Benchmark Detail View
School Library Management
MEDIUM Challenge95 models testedTop Score: 80.1
Success Rate
59.1%
Quality Score
82
Tests Passed
17
Models Tested
95
School Library Benchmark - Individual Model Results
Showing 95 of 95 models
1 DeepSeek V3.2 Speciale DeepSeek • 02/2026 02/2026 | 80.1 | 78.6% |
2 Claude 3 Haiku Anthropic • 08/2025 08/2025 | 79.9 | 78.6% |
3 GLM 5 Z.AI • 02/2026 02/2026 | 79.7 | 78.6% |
4 Grok 4 xAI • 08/2025 08/2025 | 79.5 | 78.6% |
5 Horizon Beta Other • 08/2025 08/2025 | 78.1 | 78.6% |
6 GPT 5.2 OpenAI • 12/2025 12/2025 | 76.7 | 75.0% |
7 Claude 4.6 Opus Anthropic • 02/2026 02/2026 | 76.5 | 75.0% |
8 GPT 5.1 OpenAI • 11/2025 11/2025 | 76.3 | 75.0% |
9 GPT 5.3 Chat OpenAI • 03/2026 03/2026 | 76.3 | 75.0% |
10 Claude 4.5 Sonnet Anthropic • 10/2025 10/2025 | 76.1 | 75.0% |
11 o4 mini OpenAI • 08/2025 08/2025 | 76.1 | 75.0% |
12 OSS 120B OpenAI • 08/2025 08/2025 | 76.1 | 75.0% |
13 DeepSeek R1 DeepSeek • 08/2025 08/2025 | 76.1 | 75.0% |
14 o1 mini OpenAI • 08/2025 08/2025 | 75.9 | 75.0% |
15 Grok 3 Mini xAI • 08/2025 08/2025 | 75.7 | 75.0% |
16 o3 mini (High) OpenAI • 08/2025 08/2025 | 75.5 | 75.0% |
17 o3 mini OpenAI • 08/2025 08/2025 | 75.5 | 75.0% |
18 GPT 4o OpenAI • 08/2025 08/2025 | 75.3 | 75.0% |
19 GLM 4.7 Z.AI • 12/2025 12/2025 | 74.7 | 75.0% |
20 GPT 5 OpenAI • 09/2025 09/2025 | 74.5 | 75.0% |
21 o4 mini (High) OpenAI • 08/2025 08/2025 | 74.5 | 75.0% |
22 GPT 4.1 OpenAI • 08/2025 08/2025 | 74.3 | 75.0% |
23 Nova Pro V1 Amazon • 08/2025 08/2025 | 73.3 | 71.4% |
24 Mistral Medium 3 Mistral • 08/2025 08/2025 | 73.1 | 71.4% |
25 GPT 5 mini OpenAI • 09/2025 09/2025 | 72.9 | 75.0% |
26 Claude 4 Sonnet Anthropic • 08/2025 08/2025 | 72.7 | 71.4% |
27 GPT 5.1 Codex OpenAI • 11/2025 11/2025 | 72.7 | 71.4% |
28 Kimi K2 Thinking Moonshot AI • 12/2025 12/2025 | 72.5 | 71.4% |
29 MIMO V2 Flash Minimax • 12/2025 12/2025 | 72.5 | 71.4% |
30 GPT 4.1 nano OpenAI • 08/2025 08/2025 | 71.9 | 71.4% |
31 Sonoma Sky Alpha Other • 09/2025 09/2025 | 71.3 | 71.4% |
32 Nova Lite V1 Amazon • 08/2025 08/2025 | 70.9 | 67.9% |
33 GPT 4o mini OpenAI • 08/2025 08/2025 | 70.7 | 71.4% |
34 Trinity Large Preview Arcee AI • 02/2026 02/2026 | 70.3 | 67.9% |
35 GPT 5.1 Codex Mini OpenAI • 11/2025 11/2025 | 70.1 | 67.9% |
36 GPT 5 mini OpenAI • 08/2025 08/2025 | 69.3 | 71.4% |
37 Step 3.5 Flash StepFun • 02/2026 02/2026 | 68.9 | 67.9% |
38 Coder Large Other • 08/2025 08/2025 | 66.9 | 64.3% |
39 Grok Code Fast 1 xAI • 09/2025 09/2025 | 66.5 | 64.3% |
40 Nova Micro V1 Amazon • 08/2025 08/2025 | 66.3 | 64.3% |
41 Gemini 3 Flash Preview Google • 12/2025 12/2025 | 64.2 | 60.7% |
42 Gemini 3 Pro Preview Google • 11/2025 11/2025 | 64.0 | 60.7% |
43 MiniMax M2.1 Minimax • 12/2025 12/2025 | 63.6 | 60.7% |
44 GPT 5.2 OpenAI • 12/2025 12/2025 | 63.4 | 60.7% |
45 Gemini 2.5 Flash Lite Google • 08/2025 08/2025 | 62.2 | 60.7% |
46 OSS 20B OpenAI • 08/2025 08/2025 | 62.2 | 60.7% |
47 Grok 3 xAI • 08/2025 08/2025 | 60.8 | 57.1% |
48 Claude 4.6 Sonnet Anthropic • 02/2026 02/2026 | 60.2 | 57.1% |
49 Claude 4.5 Opus Anthropic • 11/2025 11/2025 | 59.8 | 57.1% |
50 GPT 5.3 Codex OpenAI • 02/2026 02/2026 | 59.2 | 57.1% |
51 GPT 5 Codex OpenAI • 10/2025 10/2025 | 59.2 | 57.1% |
52 DeepSeek V3 DeepSeek • 08/2025 08/2025 | 57.8 | 53.6% |
53 DeepSeek V3.2 Exp DeepSeek • 12/2025 12/2025 | 57.4 | 53.6% |
54 GPT 5.2 Codex OpenAI • 01/2026 01/2026 | 57.4 | 53.6% |
55 Mistral Large 25.12 Mistral • 12/2025 12/2025 | 57.2 | 53.6% |
56 Llama 4 Scout Meta • 08/2025 08/2025 | 56.8 | 53.6% |
57 Claude 4.1 Opus Anthropic • 08/2025 08/2025 | 56.6 | 53.6% |
58 Gemini 2.5 Flash Google • 08/2025 08/2025 | 56.6 | 53.6% |
59 Qwen3 Coder Plus Qwen • 10/2025 10/2025 | 56.6 | 53.6% |
60 Gemini 2.0 Flash 001 Google • 08/2025 08/2025 | 56.2 | 53.6% |
61 GPT 4 OpenAI • 08/2025 08/2025 | 55.8 | 53.6% |
62 Qwen3 14B Qwen • 08/2025 08/2025 | 55.8 | 53.6% |
63 Gemini 2.5 Pro Google • 08/2025 08/2025 | 53.8 | 50.0% |
64 Kimi K2 Moonshot AI • 08/2025 08/2025 | 53.8 | 50.0% |
65 Kimi K2.5 Moonshot AI • 02/2026 02/2026 | 53.6 | 53.6% |
66 Qwen3 Coder Qwen • 08/2025 08/2025 | 53.2 | 50.0% |
67 Claude 4 Opus Anthropic • 08/2025 08/2025 | 53.0 | 50.0% |
68 Qwen3 Coder Next Qwen • 02/2026 02/2026 | 52.8 | 50.0% |
69 DeepSeek V3.2 Exp DeepSeek • 10/2025 10/2025 | 52.6 | 50.0% |
70 Claude 3.7 Sonnet Anthropic • 08/2025 08/2025 | 51.2 | 46.4% |
71 Claude 3.7 Sonnet (Thinking) Anthropic • 08/2025 08/2025 | 51.2 | 46.4% |
72 Claude 3.5 Sonnet Anthropic • 08/2025 08/2025 | 51.0 | 46.4% |
73 Qwen3 Max Qwen • 10/2025 10/2025 | 50.8 | 46.4% |
74 GLM 4.5 Z.AI • 08/2025 08/2025 | 50.6 | 46.4% |
75 GPT 4 Turbo OpenAI • 08/2025 08/2025 | 50.6 | 46.4% |
76 Gemini 3.1 Pro Preview Google • 02/2026 02/2026 | 50.4 | 50.0% |
77 Kimi K2 (0905) Moonshot AI • 10/2025 10/2025 | 50.2 | 46.4% |
78 Claude 3.5 Haiku Anthropic • 08/2025 08/2025 | 50.0 | 46.4% |
79 GLM 4.6 Z.AI • 10/2025 10/2025 | 49.6 | 46.4% |
80 Claude 4.5 Haiku Anthropic • 10/2025 10/2025 | 49.2 | 46.4% |
81 Grok 4 Fast xAI • 10/2025 10/2025 | 49.0 | 46.4% |
82 MiniMax M2.5 Minimax • 02/2026 02/2026 | 48.4 | 46.4% |
83 GPT 4.1 mini OpenAI • 08/2025 08/2025 | 48.2 | 46.4% |
84 Llama 4 Maverick Meta • 08/2025 08/2025 | 47.8 | 42.9% |
85 Codestral 25.08 Mistral • 08/2025 08/2025 | 47.2 | 42.9% |
86 GPT 5.4 OpenAI • 03/2026 03/2026 | 47.0 | 42.9% |
87 Grok 4.1 Fast xAI • 02/2026 02/2026 | 46.8 | 42.9% |
88 GPT 5 nano OpenAI • 08/2025 08/2025 | 46.0 | 42.9% |
89 GPT 4o OpenAI • 08/2025 08/2025 | 44.6 | 39.3% |
90 GPT 5 OpenAI • 08/2025 08/2025 | 44.4 | 39.3% |
91 Devstral 25.12 Mistral • 12/2025 12/2025 | 44.0 | 39.3% |
92 GPT 3.5 Turbo OpenAI • 08/2025 08/2025 | 41.5 | 35.7% |
93 GPT 5 OpenAI • 08/2025 08/2025 | 41.4 | 39.3% |
94 GPT 5.1 OpenAI • 11/2025 11/2025 | 31.5 | 32.1% |
95 Command A Cohere • 08/2025 08/2025 | 13.0 | 3.6% |
Top Performers
#1
DeepSeek80.1
DeepSeek V3.2 Speciale
Success Rate
78.6%22
Tests Passed
Q
94
Quality
3
Issues
28 total tests
#2
Anthropic79.9
Claude 3 Haiku
Success Rate
78.6%22
Tests Passed
Q
92
Quality
4
Issues
28 total tests
#3
Z.AI79.7
GLM 5
Success Rate
78.6%22
Tests Passed
Q
90
Quality
5
Issues
28 total tests
Explore More Benchmarks
See how models perform across different programming challenges and complexity levels.