Ruby LLM benchmarksAI Model Performance Dashboard
Comprehensive performance analysis of LLM models across all program fixing benchmarks - testing Ruby code generation capabilities through real programming challenges validated against test suites and RuboCop quality standards
4 Benchmarks
58 AI Models
58.0% Avg Success
Overall Performance Rankings - All Benchmarks Combined
Showing 58 of 58 models
1 Claude 4 Sonnet Claude • 08/2025 08/2025 | 72.6 | 73.7% |
2 Claude 4.1 Opus Claude • 08/2025 08/2025 | 71.3 | 73.2% |
3 Claude 4 Opus Claude • 08/2025 08/2025 | 70.8 | 72.3% |
4 OpenAI GPT-4.1 OpenAI • 08/2025 08/2025 | 70.8 | 73.1% |
5 Horizon Beta Other • 08/2025 08/2025 | 69.8 | 70.7% |
6 OpenAI GPT-5 OpenAI • 09/2025 09/2025 | 68.8 | 69.8% |
7 OpenAI GPT-4o OpenAI • 08/2025 08/2025 | 68.7 | 68.6% |
8 Claude 3.7 Sonnet (Thinking) Claude • 08/2025 08/2025 | 68.6 | 68.9% |
9 OpenAI o3-mini OpenAI • 08/2025 08/2025 | 68.5 | 69.0% |
10 OpenAI o1-mini OpenAI • 08/2025 08/2025 | 68.3 | 68.8% |
11 Claude 3.5 Sonnet Claude • 08/2025 08/2025 | 68.0 | 67.0% |
12 R1 DeepSeek • 08/2025 08/2025 | 67.9 | 67.5% |
13 OpenAI o4-mini OpenAI • 08/2025 08/2025 | 67.7 | 67.5% |
14 Grok 3 xAI • 08/2025 08/2025 | 67.3 | 67.3% |
15 Codestral 25.08 Mistral • 08/2025 08/2025 | 67.1 | 66.4% |
16 Openai Oss 120b OpenAI • 08/2025 08/2025 | 66.7 | 66.5% |
17 OpenAI o3-mini (High) OpenAI • 08/2025 08/2025 | 66.1 | 67.1% |
18 OpenAI o4-mini (High) OpenAI • 08/2025 08/2025 | 65.2 | 65.2% |
19 Sonoma Sky Alpha Other • 09/2025 09/2025 | 64.4 | 66.2% |
20 OpenAI GPT-4 Turbo OpenAI • 08/2025 08/2025 | 63.9 | 63.4% |
21 OpenAI GPT-4 OpenAI • 08/2025 08/2025 | 63.8 | 63.0% |
22 Llama 4 Scout Meta • 08/2025 08/2025 | 63.3 | 62.1% |
23 OpenAI GPT-5 mini OpenAI • 08/2025 08/2025 | 63.3 | 65.6% |
24 OpenAI GPT-4.1 mini OpenAI • 08/2025 08/2025 | 63.0 | 63.2% |
25 Llama 4 Maverick Meta • 08/2025 08/2025 | 63.0 | 62.1% |
26 OpenAI GPT-5 Chat OpenAI • 08/2025 08/2025 | 62.5 | 61.6% |
27 Grok 4 xAI • 08/2025 08/2025 | 62.2 | 61.5% |
28 Gemini 2.5 Flash Google • 08/2025 08/2025 | 62.2 | 62.2% |
29 OpenAI GPT-5 mini OpenAI • 09/2025 09/2025 | 61.7 | 63.8% |
30 OpenAI GPT-4.1 nano OpenAI • 08/2025 08/2025 | 61.6 | 62.2% |
31 OpenAI GPT-5 nano OpenAI • 09/2025 09/2025 | 60.7 | 61.6% |
32 Claude 3.7 Sonnet Claude • 08/2025 08/2025 | 60.3 | 60.1% |
33 OpenAI GPT-5 nano OpenAI • 08/2025 08/2025 | 60.1 | 59.9% |
34 Gemini 2.5 Pro Google • 08/2025 08/2025 | 60.0 | 58.7% |
35 OpenAI GPT-5 OpenAI • 08/2025 08/2025 | 59.7 | 60.9% |
36 DeepSeek V3 DeepSeek • 08/2025 08/2025 | 59.6 | 57.6% |
37 Gemini 2.5 Flash Lite Google • 08/2025 08/2025 | 58.5 | 58.4% |
38 Gemini 2.0 Flash-001 Google • 08/2025 08/2025 | 57.3 | 57.6% |
39 Claude 3.5 Haiku Claude • 08/2025 08/2025 | 57.3 | 55.9% |
40 Grok Code Fast 1 xAI • 09/2025 09/2025 | 56.2 | 54.9% |
41 OpenAI GPT-4o OpenAI • 08/2025 08/2025 | 55.8 | 54.4% |
42 Mistral Medium 3 Mistral • 08/2025 08/2025 | 55.5 | 53.2% |
43 Grok 3 Mini xAI • 08/2025 08/2025 | 55.0 | 54.6% |
44 Claude 3 Haiku Claude • 08/2025 08/2025 | 53.9 | 50.7% |
45 Kimi K2 Moonshot • 08/2025 08/2025 | 52.1 | 50.2% |
46 Nova Pro V1 Amazon • 08/2025 08/2025 | 51.9 | 49.4% |
47 OpenAI GPT-4o mini OpenAI • 08/2025 08/2025 | 51.7 | 49.9% |
48 Coder Large Other • 08/2025 08/2025 | 50.6 | 49.1% |
49 Qwen 3 Coder Alibaba • 08/2025 08/2025 | 49.7 | 48.5% |
50 Openai Oss 20b OpenAI • 08/2025 08/2025 | 48.2 | 45.8% |
51 Glm 4 5 Other • 08/2025 08/2025 | 48.0 | 44.2% |
52 Nova Lite V1 Amazon • 08/2025 08/2025 | 47.5 | 43.1% |
53 OpenAI GPT-3.5 Turbo OpenAI • 08/2025 08/2025 | 45.3 | 41.1% |
54 Qwen3 14b Alibaba • 08/2025 08/2025 | 44.8 | 41.9% |
55 Magnum V4 72B NousResearch • 08/2025 08/2025 | 43.6 | 38.7% |
56 Nova Micro V1 Amazon • 08/2025 08/2025 | 38.3 | 34.3% |
57 Gemma 3 4B IT Google • 08/2025 08/2025 | 29.7 | 25.9% |
58 Command A Cohere • 08/2025 08/2025 | 10.6 | 2.3% |
How Scoring Works
90%
Test Success Rate
Percentage of test cases that pass. This measures whether the AI-generated code actually works correctly.
10%
Code Quality
Based on RuboCop static analysis. Quality score decreases linearly from 100 to 0 as offenses increase from 0 to 50.
RuboCop uses strict default settings and may not reflect real-world code quality preferences. The quality score should be interpreted as adherence to Ruby style guidelines rather than overall code quality.
📐
Calculation Formula
Score = (Test Success Rate × 90%) + (Quality Score × 10%)
Quality = 100 - ((RuboCop Offenses ÷ 50) × 100), capped 0-100
Top Performers
#1
Claude72.6
Claude 4 Sonnet
Success Rate
73.7%88
Tests Passed
Q
63
Quality
75
Issues
121 total tests
#2
Claude71.3
Claude 4.1 Opus
Success Rate
73.2%87
Tests Passed
Q
55
Quality
90
Issues
121 total tests
#3
Claude70.8
Claude 4 Opus
Success Rate
72.3%86
Tests Passed
Q
58
Quality
84
Issues
121 total tests
Performance Overview
58
AI Models Tested
4
Benchmarks
58.0%
Avg Success Rate
66
Avg Quality Score
Benchmark Challenges
Vending Machine
Medium58 models tested
78.5
Average Success Rate62.1%
Quality Score
80
Models
58
Parking Garage
Hard54 models tested
67.1
Average Success Rate38.4%
Quality Score
42
Models
54
School Library
Medium55 models tested
79.9
Average Success Rate59.3%
Quality Score
82
Models
55
Calendar System
Easy55 models tested
92.1
Average Success Rate75.4%
Quality Score
58
Models
55
Dive Deeper into the Analysis (coming soon)
Explore detailed benchmark results, model comparisons, and performance insights across all coding challenges.