Benchmark Detail View
Calendar System
EASY Challenge94 models testedTop Score: 92.1
Success Rate
77.1%
Quality Score
58
Tests Passed
18
Models Tested
94
Calendar System Benchmark - Individual Model Results
Showing 94 of 94 models
1 OSS 120B OpenAI • 08/2025 08/2025 | 92.1 | 95.7% |
2 Claude 4 Opus Anthropic • 08/2025 08/2025 | 91.3 | 95.7% |
3 Claude 4.5 Haiku Anthropic • 10/2025 10/2025 | 90.9 | 95.7% |
4 Claude 4.1 Opus Anthropic • 08/2025 08/2025 | 90.9 | 95.7% |
5 Gemini 2.0 Flash 001 Google • 08/2025 08/2025 | 88.6 | 91.3% |
6 GPT 5 nano OpenAI • 08/2025 08/2025 | 87.2 | 91.3% |
7 GPT 4 Turbo OpenAI • 08/2025 08/2025 | 87.0 | 91.3% |
8 GPT 5.1 Codex Mini OpenAI • 11/2025 11/2025 | 86.7 | 87.0% |
9 GPT 4.1 OpenAI • 08/2025 08/2025 | 85.6 | 91.3% |
10 Claude 4.6 Sonnet Anthropic • 02/2026 02/2026 | 85.3 | 87.0% |
11 Codestral 25.08 Mistral • 08/2025 08/2025 | 85.3 | 87.0% |
12 Gemini 2.5 Flash Google • 08/2025 08/2025 | 84.9 | 87.0% |
13 GLM 5 Z.AI • 02/2026 02/2026 | 84.9 | 87.0% |
14 Grok 4 xAI • 08/2025 08/2025 | 84.7 | 87.0% |
15 Claude 3.5 Haiku Anthropic • 08/2025 08/2025 | 84.5 | 87.0% |
16 Claude 4.5 Opus Anthropic • 11/2025 11/2025 | 84.5 | 87.0% |
17 Gemini 2.5 Pro Google • 08/2025 08/2025 | 84.5 | 87.0% |
18 GPT 4.1 nano OpenAI • 08/2025 08/2025 | 84.5 | 87.0% |
19 o4 mini OpenAI • 08/2025 08/2025 | 84.5 | 87.0% |
20 Claude 4.6 Opus Anthropic • 02/2026 02/2026 | 84.3 | 87.0% |
21 Kimi K2 Thinking Moonshot AI • 12/2025 12/2025 | 84.3 | 87.0% |
22 Llama 4 Maverick Meta • 08/2025 08/2025 | 83.9 | 87.0% |
23 o1 mini OpenAI • 08/2025 08/2025 | 83.7 | 87.0% |
24 GPT 5.3 Codex OpenAI • 02/2026 02/2026 | 83.2 | 82.6% |
25 GPT 5.4 OpenAI • 03/2026 03/2026 | 83.0 | 82.6% |
26 MiniMax M2.5 Minimax • 02/2026 02/2026 | 82.9 | 87.0% |
27 GPT 5.2 OpenAI • 12/2025 12/2025 | 82.8 | 82.6% |
28 GPT 5 Codex OpenAI • 10/2025 10/2025 | 82.8 | 82.6% |
29 GPT 5 nano OpenAI • 09/2025 09/2025 | 82.5 | 87.0% |
30 GPT 5.1 Codex OpenAI • 11/2025 11/2025 | 82.0 | 82.6% |
31 GPT 5.2 OpenAI • 12/2025 12/2025 | 82.0 | 82.6% |
32 GLM 4.7 Z.AI • 12/2025 12/2025 | 81.9 | 87.0% |
33 GPT 4 OpenAI • 08/2025 08/2025 | 81.5 | 82.6% |
34 GLM 4.6 Z.AI • 10/2025 10/2025 | 80.8 | 82.6% |
35 o4 mini (High) OpenAI • 08/2025 08/2025 | 80.8 | 82.6% |
36 GPT 4.1 mini OpenAI • 08/2025 08/2025 | 80.5 | 82.6% |
37 Claude 3.5 Sonnet Anthropic • 08/2025 08/2025 | 80.3 | 82.6% |
38 Coder Large Other • 08/2025 08/2025 | 80.3 | 82.6% |
39 DeepSeek V3.2 Speciale DeepSeek • 02/2026 02/2026 | 80.2 | 82.6% |
40 Grok Code Fast 1 xAI • 09/2025 09/2025 | 80.2 | 82.6% |
41 o3 mini OpenAI • 08/2025 08/2025 | 80.2 | 82.6% |
42 Claude 4 Sonnet Anthropic • 08/2025 08/2025 | 79.5 | 82.6% |
43 o3 mini (High) OpenAI • 08/2025 08/2025 | 79.3 | 82.6% |
44 Claude 3.7 Sonnet (Thinking) Anthropic • 08/2025 08/2025 | 78.8 | 82.6% |
45 GPT 5 mini OpenAI • 08/2025 08/2025 | 78.5 | 82.6% |
46 GPT 5.2 Codex OpenAI • 01/2026 01/2026 | 78.3 | 82.6% |
47 GPT 5.3 Chat OpenAI • 03/2026 03/2026 | 78.2 | 82.6% |
48 Gemini 2.5 Flash Lite Google • 08/2025 08/2025 | 78.0 | 78.3% |
49 GPT 4o OpenAI • 08/2025 08/2025 | 78.0 | 78.3% |
50 GPT 4o OpenAI • 08/2025 08/2025 | 78.0 | 82.6% |
51 GPT 5.1 OpenAI • 11/2025 11/2025 | 77.8 | 78.3% |
52 Qwen3 Max Qwen • 10/2025 10/2025 | 77.8 | 82.6% |
53 Trinity Large Preview Arcee AI • 02/2026 02/2026 | 76.8 | 78.3% |
54 DeepSeek V3.2 Exp DeepSeek • 10/2025 10/2025 | 76.6 | 78.3% |
55 Horizon Beta Other • 08/2025 08/2025 | 76.6 | 78.3% |
56 GPT 5.1 OpenAI • 11/2025 11/2025 | 76.6 | 78.3% |
57 GPT 5 OpenAI • 08/2025 08/2025 | 76.6 | 78.3% |
58 GPT 5 OpenAI • 09/2025 09/2025 | 76.2 | 78.3% |
59 Grok 3 Mini xAI • 08/2025 08/2025 | 75.8 | 78.3% |
60 Gemini 3 Flash Preview Google • 12/2025 12/2025 | 75.2 | 78.3% |
61 GPT 5 OpenAI • 08/2025 08/2025 | 74.6 | 78.3% |
62 Kimi K2.5 Moonshot AI • 02/2026 02/2026 | 74.2 | 78.3% |
63 Gemini 3.1 Pro Preview Google • 02/2026 02/2026 | 73.7 | 73.9% |
64 Nova Pro V1 Amazon • 08/2025 08/2025 | 73.7 | 73.9% |
65 Llama 4 Scout Meta • 08/2025 08/2025 | 73.5 | 73.9% |
66 GPT 4o mini OpenAI • 08/2025 08/2025 | 73.3 | 73.9% |
67 DeepSeek V3 DeepSeek • 08/2025 08/2025 | 73.1 | 73.9% |
68 Claude 4.5 Sonnet Anthropic • 10/2025 10/2025 | 72.7 | 73.9% |
69 DeepSeek V3.2 Exp DeepSeek • 12/2025 12/2025 | 72.7 | 73.9% |
70 Gemini 3 Pro Preview Google • 11/2025 11/2025 | 72.5 | 73.9% |
71 Grok 4 Fast xAI • 10/2025 10/2025 | 72.4 | 78.3% |
72 DeepSeek R1 DeepSeek • 08/2025 08/2025 | 72.1 | 73.9% |
73 Step 3.5 Flash StepFun • 02/2026 02/2026 | 71.8 | 78.3% |
74 Sonoma Sky Alpha Other • 09/2025 09/2025 | 71.6 | 78.3% |
75 Qwen3 Coder Next Qwen • 02/2026 02/2026 | 69.5 | 73.9% |
76 Mistral Large 25.12 Mistral • 12/2025 12/2025 | 68.6 | 69.6% |
77 Nova 2 Lite V1 Amazon • 02/2026 02/2026 | 68.6 | 69.6% |
78 GPT 5 mini OpenAI • 09/2025 09/2025 | 66.2 | 69.6% |
79 Kimi K2 (0905) Moonshot AI • 10/2025 10/2025 | 65.5 | 65.2% |
80 MIMO V2 Flash Minimax • 12/2025 12/2025 | 65.4 | 69.6% |
81 Devstral 25.12 Mistral • 12/2025 12/2025 | 65.1 | 65.2% |
82 Grok 3 xAI • 08/2025 08/2025 | 65.1 | 65.2% |
83 Kimi K2 Moonshot AI • 08/2025 08/2025 | 64.9 | 65.2% |
84 Claude 3.7 Sonnet Anthropic • 08/2025 08/2025 | 63.5 | 65.2% |
85 Qwen3 Coder Plus Qwen • 10/2025 10/2025 | 63.5 | 65.2% |
86 Nova Lite V1 Amazon • 08/2025 08/2025 | 63.2 | 60.9% |
87 Mistral Medium 3 Mistral • 08/2025 08/2025 | 61.4 | 60.9% |
88 GPT 3.5 Turbo OpenAI • 08/2025 08/2025 | 59.7 | 56.5% |
89 Qwen3 Coder Qwen • 08/2025 08/2025 | 59.2 | 60.9% |
90 Claude 3 Haiku Anthropic • 08/2025 08/2025 | 56.0 | 52.2% |
91 OSS 20B OpenAI • 08/2025 08/2025 | 53.0 | 52.2% |
92 Nova Micro V1 Amazon • 08/2025 08/2025 | 22.6 | 17.4% |
93 Gemma 3 4B IT Google • 08/2025 08/2025 | 15.7 | 13.0% |
94 Command A Cohere • 08/2025 08/2025 | 7.2 | 0.0% |
Top Performers
#1
OpenAI92.1
OSS 120B
Success Rate
95.7%22
Tests Passed
Q
60
Quality
20
Issues
23 total tests
#2
Anthropic91.3
Claude 4 Opus
Success Rate
95.7%22
Tests Passed
Q
52
Quality
24
Issues
23 total tests
#3
Anthropic90.9
Claude 4.5 Haiku
Success Rate
95.7%22
Tests Passed
Q
48
Quality
26
Issues
23 total tests
Explore More Benchmarks
See how models perform across different programming challenges and complexity levels.