RRuby LLM Benchmarks

Benchmark Detail

Calendar System

EASY181 models testedTop 92.1

Benchmark

Success Rate

78.3%

Quality Score

60

Avg Tests Passed

18

Models Tested

181

181 models

1–50 of 181

Sel	#	Model	Config	Score ↓	Success	Quality	Date
	1	OSS 120BOpenAI Config—	—	92.1	95.7%	60	2025-08
	2	Gemma 4 26B A4BGoogle Confignone	none	91.9	95.7%	58	2026-04
	3	Claude Opus 4Anthropic Config—	—	91.3	95.7%	52	2025-08
	4	Claude 4.5 HaikuAnthropic Config—	—	90.9	95.7%	48	2025-10
	5	Claude Opus 4.1Anthropic Config—	—	90.9	95.7%	48	2025-08
	6	Gemini 2.0 Flash 001Google Config—	—	88.6	91.3%	64	2025-08
	7	GPT 5 nanoOpenAI Config—	—	87.2	91.3%	50	2025-08
	8	GPT 4 TurboOpenAI Config—	—	87.0	91.3%	48	2025-08
	9	GPT 5.1 Codex MiniOpenAI Config—	—	86.7	87.0%	84	2025-11
	10	Kimi K2.6Moonshotai Configlow	low	86.1	87.0%	78	2026-04
	11	GPT 4.1OpenAI Config—	—	85.6	91.3%	34	2025-08
	12	Claude Sonnet 4.6Anthropic Config—	—	85.3	87.0%	70	2026-02
	13	Codestral 25.08Mistral Config—	—	85.3	87.0%	70	2025-08
	14	Claude Fable 5Anthropic Configgen	gen	85.1	87.0%	68	2026-07
	15	Claude Opus 4.8Anthropic Configmedium	medium	85.1	87.0%	68	2026-05
	16	Gemini 3.5 FlashGoogle Configxhigh	xhigh	85.1	87.0%	68	2026-05
	17	Claude Fable 5Anthropic Configlow	low	84.9	87.0%	66	2026-06
	18	Claude Opus 4.8Anthropic Confighigh	high	84.9	87.0%	66	2026-05
	19	Claude Sonnet 4.6Anthropic Confighigh· 8,192 tokens	high· 8,192 tokens	84.9	87.0%	66	2026-04
	20	Gemini 2.5 FlashGoogle Config—	—	84.9	87.0%	66	2025-08
	21	GLM 5Z.AI Config—	—	84.9	87.0%	66	2026-02
	22	Claude Fable 5Anthropic Confighigh	high	84.7	87.0%	64	2026-06
	23	↳ Claude Fable 5 Configlow	low	84.7	87.0%	64	2026-07
	24	↳ Claude Fable 5 Configmedium	medium	84.7	87.0%	64	2026-06
	25	Claude Opus 4.6Anthropic Configmedium· 2,048 tokens	medium· 2,048 tokens	84.7	87.0%	64	2026-04
	26	Claude Opus 4.7Anthropic Configlow	low	84.7	87.0%	64	2026-05
	27	Grok 4xAI Config—	—	84.7	87.0%	64	2025-08
	28	Claude 3.5 HaikuAnthropic Config—	—	84.5	87.0%	62	2025-08
	29	Claude Fable 5Anthropic Confighigh	high	84.5	87.0%	62	2026-07
	30	Claude Opus 4.5Anthropic Config—	—	84.5	87.0%	62	2025-11
	31	Claude Opus 4.6Anthropic Confighigh· 8,192 tokens	high· 8,192 tokens	84.5	87.0%	62	2026-04
	32	Claude Opus 4.7Anthropic Configmedium	medium	84.5	87.0%	62	2026-05
	33	↳ Claude Opus 4.7 Configmedium· 2,048 tokens	medium· 2,048 tokens	84.5	87.0%	62	2026-04
	34	Claude Opus 4.8Anthropic Configlow	low	84.5	87.0%	62	2026-05
	35	Claude Opus 4.8 (Fast)Anthropic Configmedium	medium	84.5	87.0%	62	2026-05
	36	Gemini 2.5 ProGoogle Config—	—	84.5	87.0%	62	2025-08
	37	GPT 4.1 nanoOpenAI Config—	—	84.5	87.0%	62	2025-08
	38	o4 miniOpenAI Config—	—	84.5	87.0%	62	2025-08
	39	Claude Opus 4.6 FastAnthropic Config—	—	84.3	87.0%	60	2026-04
	40	Claude Opus 4.6Anthropic Config—	—	84.3	87.0%	60	2026-02
	41	Claude Opus 4.7Anthropic Confighigh· 8,192 tokens	high· 8,192 tokens	84.3	87.0%	60	2026-04
	42	Claude Opus 4.8 (Fast)Anthropic Confighigh	high	84.3	87.0%	60	2026-05
	43	DeepSeek V4 ProDeepSeek Configgen	gen	84.3	87.0%	60	2026-05
	44	Gemini 3.1 Flash Lite PreviewGoogle Configmedium	medium	84.3	87.0%	60	2026-04
	45	Kimi K2 ThinkingMoonshot AI Config		84.3	87.0%	60	2025-12
	46	Claude Opus 4.8 (Fast)Anthropic Configlow	low	84.1	87.0%	58	2026-05
	47	Gemma 4 26B A4BGoogle Configlow	low	84.1	87.0%	58	2026-04
	48	Claude Opus 4.7Anthropic Config—	—	83.9	87.0%	56	2026-04
	49	Claude Opus 4.8 (Fast)Anthropic Configgen	gen	83.9	87.0%	56	2026-05
	50	Gemini 3.1 Flash Lite PreviewGoogle Configlow	low	83.9	87.0%	56	2026-04

1–50 of 181