Ruby LLM benchmarksAI Model Performance Dashboard
Comprehensive performance analysis of LLM models across all program fixing benchmarks - testing Ruby code generation capabilities through real programming challenges validated against test suites and RuboCop quality standards
4 Benchmarks
99 AI Models
59.0% Avg Success
Overall Performance Rankings - All Benchmarks Combined
Showing 99 of 99 models
1 Claude 4.6 Opus Anthropic • 02/2026 02/2026 | 76.1 | 77.2% |
2 GLM 5 Z.AI • 02/2026 02/2026 | 75.6 | 77.2% |
3 Claude 4 Sonnet Anthropic • 08/2025 08/2025 | 72.6 | 73.7% |
4 Claude 4.5 Sonnet Anthropic • 10/2025 10/2025 | 72.3 | 73.2% |
5 Claude 4.6 Sonnet Anthropic • 02/2026 02/2026 | 72.2 | 72.7% |
6 GPT 5.2 OpenAI • 12/2025 12/2025 | 72.1 | 71.6% |
7 Claude 4.5 Opus Anthropic • 11/2025 11/2025 | 71.6 | 72.7% |
8 Claude 4.1 Opus Anthropic • 08/2025 08/2025 | 71.3 | 73.2% |
9 Claude 4 Opus Anthropic • 08/2025 08/2025 | 70.8 | 72.3% |
10 GPT 4.1 OpenAI • 08/2025 08/2025 | 70.8 | 73.1% |
11 GPT 5.3 Chat OpenAI • 03/2026 03/2026 | 70.2 | 70.9% |
12 Horizon Beta Other • 08/2025 08/2025 | 69.8 | 70.7% |
13 GPT 5.1 OpenAI • 11/2025 11/2025 | 69.8 | 69.8% |
14 GPT 5.1 Codex OpenAI • 11/2025 11/2025 | 69.4 | 68.1% |
15 DeepSeek V3.2 Speciale DeepSeek • 02/2026 02/2026 | 69.3 | 69.9% |
16 Kimi K2 Thinking Moonshot AI • 12/2025 12/2025 | 69.2 | 69.2% |
17 GPT 5 OpenAI • 09/2025 09/2025 | 68.8 | 69.8% |
18 GPT 4o OpenAI • 08/2025 08/2025 | 68.7 | 68.6% |
19 GPT 5.3 Codex OpenAI • 02/2026 02/2026 | 68.6 | 67.1% |
20 Claude 3.7 Sonnet (Thinking) Anthropic • 08/2025 08/2025 | 68.6 | 68.9% |
21 DeepSeek V3.2 Exp DeepSeek • 12/2025 12/2025 | 68.6 | 68.5% |
22 o3 mini OpenAI • 08/2025 08/2025 | 68.5 | 69.0% |
23 o1 mini OpenAI • 08/2025 08/2025 | 68.3 | 68.8% |
24 Claude 3.5 Sonnet Anthropic • 08/2025 08/2025 | 68.0 | 67.0% |
25 DeepSeek R1 DeepSeek • 08/2025 08/2025 | 67.9 | 67.5% |
26 o4 mini OpenAI • 08/2025 08/2025 | 67.7 | 67.5% |
27 Grok 3 xAI • 08/2025 08/2025 | 67.3 | 67.3% |
28 Codestral 25.08 Mistral • 08/2025 08/2025 | 67.1 | 66.4% |
29 GLM 4.6 Z.AI • 10/2025 10/2025 | 67.1 | 67.0% |
30 GPT 5.2 OpenAI • 12/2025 12/2025 | 66.9 | 65.4% |
31 Gemini 3 Flash Preview Google • 12/2025 12/2025 | 66.9 | 66.9% |
32 OSS 120B OpenAI • 08/2025 08/2025 | 66.7 | 66.5% |
33 GPT 5.1 Codex Mini OpenAI • 11/2025 11/2025 | 66.7 | 65.1% |
34 DeepSeek V3.2 Exp DeepSeek • 10/2025 10/2025 | 66.2 | 66.2% |
35 o3 mini (High) OpenAI • 08/2025 08/2025 | 66.1 | 67.1% |
36 GPT 5.4 OpenAI • 03/2026 03/2026 | 65.7 | 63.5% |
37 o4 mini (High) OpenAI • 08/2025 08/2025 | 65.2 | 65.2% |
38 GPT 5.2 Codex OpenAI • 01/2026 01/2026 | 65.0 | 66.2% |
39 GPT 5 Codex OpenAI • 10/2025 10/2025 | 64.6 | 64.5% |
40 Sonoma Sky Alpha Other • 09/2025 09/2025 | 64.4 | 66.2% |
41 Qwen3 Max Qwen • 10/2025 10/2025 | 64.1 | 65.1% |
42 GPT 4 Turbo OpenAI • 08/2025 08/2025 | 63.9 | 63.4% |
43 GPT 4 OpenAI • 08/2025 08/2025 | 63.8 | 63.0% |
44 Kimi K2.5 Moonshot AI • 02/2026 02/2026 | 63.3 | 65.1% |
45 Llama 4 Scout Meta • 08/2025 08/2025 | 63.3 | 62.1% |
46 GPT 5 mini OpenAI • 08/2025 08/2025 | 63.3 | 65.6% |
47 GLM 4.7 Z.AI • 12/2025 12/2025 | 63.2 | 63.8% |
48 GPT 4.1 mini OpenAI • 08/2025 08/2025 | 63.0 | 63.2% |
49 Gemini 3.1 Pro Preview Google • 02/2026 02/2026 | 63.0 | 63.2% |
50 Llama 4 Maverick Meta • 08/2025 08/2025 | 63.0 | 62.1% |
51 GPT 5 OpenAI • 08/2025 08/2025 | 62.5 | 61.6% |
52 MiniMax M2.5 Minimax • 02/2026 02/2026 | 62.4 | 63.0% |
53 Grok 4 xAI • 08/2025 08/2025 | 62.2 | 61.5% |
54 Gemini 2.5 Flash Google • 08/2025 08/2025 | 62.2 | 62.2% |
55 GPT 5 mini OpenAI • 09/2025 09/2025 | 61.7 | 63.8% |
56 GPT 4.1 nano OpenAI • 08/2025 08/2025 | 61.6 | 62.2% |
57 Qwen3 Coder Plus Qwen • 10/2025 10/2025 | 60.7 | 60.6% |
58 Claude 3.7 Sonnet Anthropic • 08/2025 08/2025 | 60.3 | 60.1% |
59 Kimi K2 (0905) Moonshot AI • 10/2025 10/2025 | 60.2 | 59.4% |
60 GPT 5 nano OpenAI • 08/2025 08/2025 | 60.1 | 59.9% |
61 Gemini 2.5 Pro Google • 08/2025 08/2025 | 60.0 | 58.7% |
62 GPT 5 OpenAI • 08/2025 08/2025 | 59.7 | 60.9% |
63 DeepSeek V3 DeepSeek • 08/2025 08/2025 | 59.6 | 57.6% |
64 Grok 4 Fast xAI • 10/2025 10/2025 | 59.3 | 60.8% |
65 Gemini 2.5 Flash Lite Google • 08/2025 08/2025 | 58.5 | 58.4% |
66 Trinity Large Preview Arcee AI • 02/2026 02/2026 | 58.1 | 56.6% |
67 GPT 5.1 OpenAI • 11/2025 11/2025 | 57.4 | 57.9% |
68 Gemini 2.0 Flash 001 Google • 08/2025 08/2025 | 57.3 | 57.6% |
69 Claude 3.5 Haiku Anthropic • 08/2025 08/2025 | 57.3 | 55.9% |
70 Claude 4.5 Haiku Anthropic • 10/2025 10/2025 | 56.8 | 56.1% |
71 MIMO V2 Flash Minimax • 12/2025 12/2025 | 56.7 | 57.3% |
72 Qwen3 Coder Next Qwen • 02/2026 02/2026 | 56.7 | 57.5% |
73 Grok Code Fast 1 xAI • 09/2025 09/2025 | 56.2 | 54.9% |
74 GPT 4o OpenAI • 08/2025 08/2025 | 55.8 | 54.4% |
75 Mistral Medium 3 Mistral • 08/2025 08/2025 | 55.5 | 53.2% |
76 Grok 3 Mini xAI • 08/2025 08/2025 | 55.0 | 54.6% |
77 Step 3.5 Flash StepFun • 02/2026 02/2026 | 54.5 | 55.9% |
78 Mistral Large 25.12 Mistral • 12/2025 12/2025 | 54.3 | 52.2% |
79 Claude 3 Haiku Anthropic • 08/2025 08/2025 | 53.9 | 50.7% |
80 Gemini 3 Pro Preview Google • 11/2025 11/2025 | 53.2 | 51.8% |
81 Devstral 25.12 Mistral • 12/2025 12/2025 | 52.1 | 49.4% |
82 Kimi K2 Moonshot AI • 08/2025 08/2025 | 52.1 | 50.2% |
83 Nova Pro V1 Amazon • 08/2025 08/2025 | 51.9 | 49.4% |
84 GPT 4o mini OpenAI • 08/2025 08/2025 | 51.7 | 49.9% |
85 Coder Large Other • 08/2025 08/2025 | 50.6 | 49.1% |
86 Qwen3 Coder Qwen • 08/2025 08/2025 | 49.7 | 48.5% |
87 OSS 20B OpenAI • 08/2025 08/2025 | 48.2 | 45.8% |
88 Nova Lite V1 Amazon • 08/2025 08/2025 | 47.5 | 43.1% |
89 Nova 2 Lite V1 Amazon • 02/2026 02/2026 | 45.8 | 46.4% |
90 GPT 5 nano OpenAI • 09/2025 09/2025 | 45.5 | 46.2% |
91 GPT 3.5 Turbo OpenAI • 08/2025 08/2025 | 45.3 | 41.1% |
92 Grok 4.1 Fast xAI • 02/2026 02/2026 | 43.1 | 42.2% |
93 Nova Micro V1 Amazon • 08/2025 08/2025 | 38.3 | 34.3% |
94 MiniMax M2.1 Minimax • 12/2025 12/2025 | 36.6 | 34.6% |
95 Qwen3 14B Qwen • 08/2025 08/2025 | 33.6 | 31.4% |
96 GLM 4.5 Z.AI • 08/2025 08/2025 | 24.0 | 22.1% |
97 Gemma 3 4B IT Google • 08/2025 08/2025 | 14.8 | 12.9% |
98 Magnum V4 72B NousResearch • 08/2025 08/2025 | 10.9 | 9.7% |
99 Command A Cohere • 08/2025 08/2025 | 8.0 | 1.7% |
How Scoring Works
90%
Test Success Rate
Percentage of test cases that pass. This measures whether the AI-generated code actually works correctly.
10%
Code Quality
Based on RuboCop static analysis. Quality score decreases linearly from 100 to 0 as offenses increase from 0 to 50.
RuboCop uses strict default settings and may not reflect real-world code quality preferences. The quality score should be interpreted as adherence to Ruby style guidelines rather than overall code quality.
📐
Calculation Formula
Score = (Test Success Rate × 90%) + (Quality Score × 10%)
Quality = 100 - ((RuboCop Offenses ÷ 50) × 100), capped 0-100
Top Performers
#1
Anthropic76.1
Claude 4.6 Opus
Success Rate
77.2%92
Tests Passed
Q
67
Quality
67
Issues
121 total tests
#2
Z.AI75.6
GLM 5
Success Rate
77.2%92
Tests Passed
Q
61
Quality
88
Issues
121 total tests
#3
Anthropic72.6
Claude 4 Sonnet
Success Rate
73.7%88
Tests Passed
Q
63
Quality
75
Issues
121 total tests
Performance Overview
99
AI Models Tested
4
Benchmarks
59.0%
Avg Success Rate
63
Avg Quality Score
Benchmark Challenges
Vending Machine
Medium99 models tested
78.9
Average Success Rate65.2%
Quality Score
80
Models
99
Calendar System
Easy94 models tested
92.1
Average Success Rate77.1%
Quality Score
58
Models
94
Parking Garage
Hard94 models tested
67.1
Average Success Rate43.2%
Quality Score
41
Models
94
School Library
Medium95 models tested
80.1
Average Success Rate59.1%
Quality Score
82
Models
95
Dive Deeper into the Analysis (coming soon)
Explore detailed benchmark results, model comparisons, and performance insights across all coding challenges.