Ruby LLM benchmarksAI Model Performance Dashboard
Comprehensive performance analysis of LLM models across all program fixing benchmarks - testing Ruby code generation capabilities through real programming challenges validated against test suites and RuboCop quality standards
4 Benchmarks
83 AI Models
59.4% Avg Success
Overall Performance Rankings - All Benchmarks Combined
Showing 83 of 83 models
1 Claude 4 Sonnet Claude • 08/2025 08/2025 | 72.6 | 73.7% |
2 Claude 4.5 Sonnet Claude • 10/2025 10/2025 | 72.3 | 73.2% |
3 OpenAI GPT-5.2 Chat OpenAI • 12/2025 12/2025 | 72.1 | 71.6% |
4 Claude 4.5 Opus Claude • 11/2025 11/2025 | 71.6 | 72.7% |
5 Claude 4.1 Opus Claude • 08/2025 08/2025 | 71.3 | 73.2% |
6 Claude 4 Opus Claude • 08/2025 08/2025 | 70.8 | 72.3% |
7 OpenAI GPT-4.1 OpenAI • 08/2025 08/2025 | 70.8 | 73.1% |
8 Horizon Beta Other • 08/2025 08/2025 | 69.8 | 70.7% |
9 OpenAI GPT-5.1 Chat OpenAI • 11/2025 11/2025 | 69.8 | 69.8% |
10 OpenAI 5.1 Codex OpenAI • 11/2025 11/2025 | 69.4 | 68.1% |
11 Kimi K2 Moonshot • 12/2025 12/2025 | 69.2 | 69.2% |
12 OpenAI GPT-5 OpenAI • 09/2025 09/2025 | 68.8 | 69.8% |
13 OpenAI GPT-4o OpenAI • 08/2025 08/2025 | 68.7 | 68.6% |
14 Claude 3.7 Sonnet (Thinking) Claude • 08/2025 08/2025 | 68.6 | 68.9% |
15 DeepSeek V3 DeepSeek • 12/2025 12/2025 | 68.6 | 68.5% |
16 OpenAI o3-mini OpenAI • 08/2025 08/2025 | 68.5 | 69.0% |
17 OpenAI o1-mini OpenAI • 08/2025 08/2025 | 68.3 | 68.8% |
18 Claude 3.5 Sonnet Claude • 08/2025 08/2025 | 68.0 | 67.0% |
19 R1 DeepSeek • 08/2025 08/2025 | 67.9 | 67.5% |
20 OpenAI o4-mini OpenAI • 08/2025 08/2025 | 67.7 | 67.5% |
21 Grok 3 xAI • 08/2025 08/2025 | 67.3 | 67.3% |
22 Codestral 25.08 Mistral • 08/2025 08/2025 | 67.1 | 66.4% |
23 Glm 4 6 Other • 10/2025 10/2025 | 67.1 | 67.0% |
24 OpenAI GPT-5.2 OpenAI • 12/2025 12/2025 | 66.9 | 65.4% |
25 Gemini 3 Flash Google • 12/2025 12/2025 | 66.9 | 66.9% |
26 Openai Oss 120b OpenAI • 08/2025 08/2025 | 66.7 | 66.5% |
27 OpenAI 5.1 Codex Mini OpenAI • 11/2025 11/2025 | 66.7 | 65.1% |
28 DeepSeek V3 DeepSeek • 10/2025 10/2025 | 66.2 | 66.2% |
29 OpenAI o3-mini (High) OpenAI • 08/2025 08/2025 | 66.1 | 67.1% |
30 OpenAI o4-mini (High) OpenAI • 08/2025 08/2025 | 65.2 | 65.2% |
31 OpenAI 5 Codex OpenAI • 10/2025 10/2025 | 64.6 | 64.5% |
32 Sonoma Sky Alpha Other • 09/2025 09/2025 | 64.4 | 66.2% |
33 Qwen3 Max Alibaba • 10/2025 10/2025 | 64.1 | 65.1% |
34 OpenAI GPT-4 Turbo OpenAI • 08/2025 08/2025 | 63.9 | 63.4% |
35 OpenAI GPT-4 OpenAI • 08/2025 08/2025 | 63.8 | 63.0% |
36 Llama 4 Scout Meta • 08/2025 08/2025 | 63.3 | 62.1% |
37 OpenAI GPT-5 mini OpenAI • 08/2025 08/2025 | 63.3 | 65.6% |
38 Glm 4 7 Other • 12/2025 12/2025 | 63.2 | 63.8% |
39 OpenAI GPT-4.1 mini OpenAI • 08/2025 08/2025 | 63.0 | 63.2% |
40 Llama 4 Maverick Meta • 08/2025 08/2025 | 63.0 | 62.1% |
41 OpenAI GPT-5 Chat OpenAI • 08/2025 08/2025 | 62.5 | 61.6% |
42 Grok 4 xAI • 08/2025 08/2025 | 62.2 | 61.5% |
43 Gemini 2.5 Flash Google • 08/2025 08/2025 | 62.2 | 62.2% |
44 OpenAI GPT-5 mini OpenAI • 09/2025 09/2025 | 61.7 | 63.8% |
45 OpenAI GPT-4.1 nano OpenAI • 08/2025 08/2025 | 61.6 | 62.2% |
46 OpenAI GPT-5 nano OpenAI • 09/2025 09/2025 | 60.7 | 61.6% |
47 Qwen 3 Coder Alibaba • 10/2025 10/2025 | 60.7 | 60.6% |
48 Claude 3.7 Sonnet Claude • 08/2025 08/2025 | 60.3 | 60.1% |
49 Kimi K2 Moonshot • 10/2025 10/2025 | 60.2 | 59.4% |
50 OpenAI GPT-5 nano OpenAI • 08/2025 08/2025 | 60.1 | 59.9% |
51 Gemini 2.5 Pro Google • 08/2025 08/2025 | 60.0 | 58.7% |
52 OpenAI GPT-5 OpenAI • 08/2025 08/2025 | 59.7 | 60.9% |
53 DeepSeek V3 DeepSeek • 08/2025 08/2025 | 59.6 | 57.6% |
54 Grok 4 xAI • 10/2025 10/2025 | 59.3 | 60.8% |
55 Gemini 2.5 Flash Lite Google • 08/2025 08/2025 | 58.5 | 58.4% |
56 OpenAI GPT-5.1 OpenAI • 11/2025 11/2025 | 57.4 | 57.9% |
57 Gemini 2.0 Flash-001 Google • 08/2025 08/2025 | 57.3 | 57.6% |
58 Claude 3.5 Haiku Claude • 08/2025 08/2025 | 57.3 | 55.9% |
59 Claude 4.5 Haiku Claude • 10/2025 10/2025 | 56.8 | 56.1% |
60 Mimo V2 Flash Free Other • 12/2025 12/2025 | 56.7 | 57.3% |
61 Grok Code Fast 1 xAI • 09/2025 09/2025 | 56.2 | 54.9% |
62 OpenAI GPT-4o OpenAI • 08/2025 08/2025 | 55.8 | 54.4% |
63 Mistral Medium 3 Mistral • 08/2025 08/2025 | 55.5 | 53.2% |
64 Grok 3 Mini xAI • 08/2025 08/2025 | 55.0 | 54.6% |
65 Mistral Large 2512 Mistral • 12/2025 12/2025 | 54.3 | 52.2% |
66 Claude 3 Haiku Claude • 08/2025 08/2025 | 53.9 | 50.7% |
67 Gemini 3 Pro Preview Google • 11/2025 11/2025 | 53.2 | 51.8% |
68 Devstral 2512 Other • 12/2025 12/2025 | 52.1 | 49.4% |
69 Kimi K2 Moonshot • 08/2025 08/2025 | 52.1 | 50.2% |
70 Nova Pro V1 Amazon • 08/2025 08/2025 | 51.9 | 49.4% |
71 OpenAI GPT-4o mini OpenAI • 08/2025 08/2025 | 51.7 | 49.9% |
72 Coder Large Other • 08/2025 08/2025 | 50.6 | 49.1% |
73 Qwen 3 Coder Alibaba • 08/2025 08/2025 | 49.7 | 48.5% |
74 Minimax M2 1 Other • 12/2025 12/2025 | 48.8 | 46.2% |
75 Openai Oss 20b OpenAI • 08/2025 08/2025 | 48.2 | 45.8% |
76 Glm 4 5 Other • 08/2025 08/2025 | 48.0 | 44.2% |
77 Nova Lite V1 Amazon • 08/2025 08/2025 | 47.5 | 43.1% |
78 OpenAI GPT-3.5 Turbo OpenAI • 08/2025 08/2025 | 45.3 | 41.1% |
79 Qwen3 14b Alibaba • 08/2025 08/2025 | 44.8 | 41.9% |
80 Magnum V4 72B NousResearch • 08/2025 08/2025 | 43.6 | 38.7% |
81 Nova Micro V1 Amazon • 08/2025 08/2025 | 38.3 | 34.3% |
82 Gemma 3 4B IT Google • 08/2025 08/2025 | 29.7 | 25.9% |
83 Command A Cohere • 08/2025 08/2025 | 10.6 | 2.3% |
How Scoring Works
90%
Test Success Rate
Percentage of test cases that pass. This measures whether the AI-generated code actually works correctly.
10%
Code Quality
Based on RuboCop static analysis. Quality score decreases linearly from 100 to 0 as offenses increase from 0 to 50.
RuboCop uses strict default settings and may not reflect real-world code quality preferences. The quality score should be interpreted as adherence to Ruby style guidelines rather than overall code quality.
📐
Calculation Formula
Score = (Test Success Rate × 90%) + (Quality Score × 10%)
Quality = 100 - ((RuboCop Offenses ÷ 50) × 100), capped 0-100
Top Performers
#1
Claude72.6
Claude 4 Sonnet
Success Rate
73.7%88
Tests Passed
Q
63
Quality
75
Issues
121 total tests
#2
Claude72.3
Claude 4.5 Sonnet
Success Rate
73.2%88
Tests Passed
Q
64
Quality
73
Issues
121 total tests
#3
OpenAI72.1
OpenAI GPT-5.2 Chat
Success Rate
71.6%84
Tests Passed
Q
77
Quality
46
Issues
121 total tests
Performance Overview
83
AI Models Tested
4
Benchmarks
59.4%
Avg Success Rate
66
Avg Quality Score
Benchmark Challenges
Vending Machine
Medium83 models tested
78.9
Average Success Rate63.7%
Quality Score
80
Models
83
Parking Garage
Hard79 models tested
67.1
Average Success Rate41.5%
Quality Score
42
Models
79
School Library
Medium80 models tested
79.9
Average Success Rate58.9%
Quality Score
82
Models
80
Calendar System
Easy79 models tested
92.1
Average Success Rate76.4%
Quality Score
59
Models
79
Dive Deeper into the Analysis (coming soon)
Explore detailed benchmark results, model comparisons, and performance insights across all coding challenges.