Benchmark Detail View
Parking Garage Management
HARD Challenge95 models testedTop Score: 67.1
Success Rate
43.3%
Quality Score
41
Tests Passed
17
Models Tested
95
Parking Garage Benchmark - Individual Model Results
Showing 95 of 95 models
1 Claude 3.7 Sonnet (Thinking) Anthropic • 08/2025 08/2025 | 67.1 | 69.2% |
2 Claude 4.6 Sonnet Anthropic • 02/2026 02/2026 | 66.3 | 69.2% |
3 DeepSeek V3.2 Exp DeepSeek • 12/2025 12/2025 | 66.3 | 69.2% |
4 Claude 4.6 Opus Anthropic • 02/2026 02/2026 | 65.9 | 69.2% |
5 Grok 3 xAI • 08/2025 08/2025 | 65.9 | 69.2% |
6 Claude 4.5 Opus Anthropic • 11/2025 11/2025 | 64.3 | 69.2% |
7 Claude 4 Opus Anthropic • 08/2025 08/2025 | 64.1 | 69.2% |
8 DeepSeek V3 DeepSeek • 08/2025 08/2025 | 64.1 | 64.1% |
9 Claude 3.5 Sonnet Anthropic • 08/2025 08/2025 | 63.4 | 61.5% |
10 Claude 4 Sonnet Anthropic • 08/2025 08/2025 | 63.2 | 66.7% |
11 Claude 4.1 Opus Anthropic • 08/2025 08/2025 | 62.9 | 69.2% |
12 Claude 4.5 Sonnet Anthropic • 10/2025 10/2025 | 62.8 | 66.7% |
13 GLM 5 Z.AI • 02/2026 02/2026 | 62.3 | 69.2% |
14 Codestral 25.08 Mistral • 08/2025 08/2025 | 60.6 | 61.5% |
15 GLM 4.6 Z.AI • 10/2025 10/2025 | 60.0 | 61.5% |
16 Qwen3 Coder Next Qwen • 02/2026 02/2026 | 59.1 | 64.1% |
17 DeepSeek V3.2 Exp DeepSeek • 10/2025 10/2025 | 58.1 | 59.0% |
18 GPT 4o OpenAI • 08/2025 08/2025 | 55.5 | 53.8% |
19 GPT 5.4 OpenAI • 03/2026 03/2026 | 54.1 | 51.3% |
20 GPT 5.3 Codex OpenAI • 02/2026 02/2026 | 53.4 | 51.3% |
21 Devstral 25.12 Mistral • 12/2025 12/2025 | 53.1 | 51.3% |
22 GPT 5.2 OpenAI • 12/2025 12/2025 | 51.8 | 51.3% |
23 GPT 5 OpenAI • 08/2025 08/2025 | 51.5 | 51.3% |
24 Gemini 3 Flash Preview Google • 12/2025 12/2025 | 50.4 | 51.3% |
25 Qwen3 Max Qwen • 10/2025 10/2025 | 50.3 | 53.8% |
26 GLM 4.7 Z.AI • 12/2025 12/2025 | 50.1 | 51.3% |
27 Gemini 3.1 Pro Preview Google • 02/2026 02/2026 | 50.0 | 51.3% |
28 Claude 3.7 Sonnet Anthropic • 08/2025 08/2025 | 49.1 | 51.3% |
29 Claude Opus 4.7 Anthropic • 04/2026 04/2026 | 49.1 | 51.3% |
30 GPT 5.1 OpenAI • 11/2025 11/2025 | 48.6 | 48.7% |
31 GPT 5.3 Chat OpenAI • 03/2026 03/2026 | 48.0 | 48.7% |
32 Kimi K2.5 Moonshot AI • 02/2026 02/2026 | 48.0 | 51.3% |
33 Grok 4.1 Fast xAI • 02/2026 02/2026 | 47.9 | 48.7% |
34 Kimi K2 (0905) Moonshot AI • 10/2025 10/2025 | 47.6 | 48.7% |
35 Horizon Beta Other • 08/2025 08/2025 | 47.3 | 48.7% |
36 GPT 4.1 mini OpenAI • 08/2025 08/2025 | 46.7 | 46.2% |
37 GPT 5 OpenAI • 09/2025 09/2025 | 46.6 | 48.7% |
38 GPT 5.2 Codex OpenAI • 01/2026 01/2026 | 46.5 | 51.3% |
39 GPT 4.1 OpenAI • 08/2025 08/2025 | 46.3 | 48.7% |
40 GPT 5 OpenAI • 08/2025 08/2025 | 45.5 | 48.7% |
41 DeepSeek R1 DeepSeek • 08/2025 08/2025 | 45.4 | 43.6% |
42 Llama 4 Scout Meta • 08/2025 08/2025 | 45.0 | 43.6% |
43 Qwen3 Coder Plus Qwen • 10/2025 10/2025 | 44.9 | 46.2% |
44 Mistral Large 25.12 Mistral • 12/2025 12/2025 | 44.8 | 43.6% |
45 GPT 5.1 Codex OpenAI • 11/2025 11/2025 | 44.7 | 41.0% |
46 GPT 4o OpenAI • 08/2025 08/2025 | 44.4 | 43.6% |
47 Claude 4.5 Haiku Anthropic • 10/2025 10/2025 | 44.2 | 43.6% |
48 Kimi K2 Moonshot AI • 08/2025 08/2025 | 44.2 | 43.6% |
49 MIMO V2 Flash Minimax • 12/2025 12/2025 | 43.5 | 46.2% |
50 Kimi K2 Thinking Moonshot AI • 12/2025 12/2025 | 42.5 | 41.0% |
51 Llama 4 Maverick Meta • 08/2025 08/2025 | 42.5 | 41.0% |
52 GPT 5.2 OpenAI • 12/2025 12/2025 | 42.5 | 41.0% |
53 GPT 5.1 OpenAI • 11/2025 11/2025 | 42.2 | 43.6% |
54 Mistral Medium 3 Mistral • 08/2025 08/2025 | 41.6 | 38.5% |
55 o3 mini OpenAI • 08/2025 08/2025 | 40.7 | 41.0% |
56 MiniMax M2.5 Minimax • 02/2026 02/2026 | 40.5 | 41.0% |
57 GPT 4 Turbo OpenAI • 08/2025 08/2025 | 40.4 | 38.5% |
58 Qwen3 Coder Qwen • 08/2025 08/2025 | 40.3 | 41.0% |
59 GPT 4 OpenAI • 08/2025 08/2025 | 39.8 | 38.5% |
60 Gemini 2.0 Flash 001 Google • 08/2025 08/2025 | 39.6 | 43.6% |
61 DeepSeek V3.2 Speciale DeepSeek • 02/2026 02/2026 | 39.3 | 41.0% |
62 Sonoma Sky Alpha Other • 09/2025 09/2025 | 39.3 | 41.0% |
63 Grok 4 xAI • 08/2025 08/2025 | 39.2 | 38.5% |
64 GPT 5 Codex OpenAI • 10/2025 10/2025 | 38.9 | 41.0% |
65 Trinity Large Preview Arcee AI • 02/2026 02/2026 | 38.6 | 38.5% |
66 Grok 4 Fast xAI • 10/2025 10/2025 | 38.3 | 41.0% |
67 Nova 2 Lite V1 Amazon • 02/2026 02/2026 | 38.0 | 38.5% |
68 o1 mini OpenAI • 08/2025 08/2025 | 37.1 | 35.9% |
69 GPT 3.5 Turbo OpenAI • 08/2025 08/2025 | 36.4 | 33.3% |
70 MiniMax M2.1 Minimax • 12/2025 12/2025 | 36.1 | 35.9% |
71 Claude 3 Haiku Anthropic • 08/2025 08/2025 | 36.0 | 33.3% |
72 Qwen3 14B Qwen • 08/2025 08/2025 | 35.0 | 33.3% |
73 o4 mini OpenAI • 08/2025 08/2025 | 32.9 | 30.8% |
74 GPT 5.1 Codex Mini OpenAI • 11/2025 11/2025 | 32.8 | 28.2% |
75 Grok Code Fast 1 xAI • 09/2025 09/2025 | 32.3 | 30.8% |
76 o3 mini (High) OpenAI • 08/2025 08/2025 | 31.8 | 33.3% |
77 OSS 20B OpenAI • 08/2025 08/2025 | 31.6 | 28.2% |
78 GPT 5 mini OpenAI • 09/2025 09/2025 | 31.2 | 33.3% |
79 Gemini 3 Pro Preview Google • 11/2025 11/2025 | 30.1 | 30.8% |
80 Gemini 2.5 Flash Google • 08/2025 08/2025 | 29.7 | 30.8% |
81 GPT 5 nano OpenAI • 08/2025 08/2025 | 29.0 | 28.2% |
82 GPT 5 mini OpenAI • 08/2025 08/2025 | 28.9 | 30.8% |
83 o4 mini (High) OpenAI • 08/2025 08/2025 | 27.9 | 25.6% |
84 Nova Micro V1 Amazon • 08/2025 08/2025 | 27.0 | 23.1% |
85 Nova Lite V1 Amazon • 08/2025 08/2025 | 23.8 | 17.9% |
86 Gemini 2.5 Pro Google • 08/2025 08/2025 | 23.5 | 20.5% |
87 Grok 3 Mini xAI • 08/2025 08/2025 | 23.4 | 23.1% |
88 GPT 5 nano OpenAI • 09/2025 09/2025 | 22.7 | 20.5% |
89 OSS 120B OpenAI • 08/2025 08/2025 | 20.4 | 17.9% |
90 Gemini 2.5 Flash Lite Google • 08/2025 08/2025 | 19.1 | 20.5% |
91 GPT 4o mini OpenAI • 08/2025 08/2025 | 19.0 | 15.4% |
92 Claude 3.5 Haiku Anthropic • 08/2025 08/2025 | 16.9 | 12.8% |
93 Nova Pro V1 Amazon • 08/2025 08/2025 | 14.6 | 10.3% |
94 GPT 4.1 nano OpenAI • 08/2025 08/2025 | 11.9 | 12.8% |
95 Coder Large Other • 08/2025 08/2025 | 9.3 | 7.7% |
Top Performers
#1
Anthropic67.1
Claude 3.7 Sonnet (Thinking)
Success Rate
69.2%27
Tests Passed
Q
48
Quality
26
Issues
39 total tests
#2
Anthropic66.3
Claude 4.6 Sonnet
Success Rate
69.2%27
Tests Passed
Q
40
Quality
30
Issues
39 total tests
#3
DeepSeek66.3
DeepSeek V3.2 Exp
Success Rate
69.2%27
Tests Passed
Q
40
Quality
30
Issues
39 total tests
Explore More Benchmarks
See how models perform across different programming challenges and complexity levels.