Ruby LLM benchmarksAI Model Performance Dashboard

Comprehensive performance analysis of LLM models across all program fixing benchmarks - testing Ruby code generation capabilities through real programming challenges validated against test suites and RuboCop quality standards

4 Benchmarks
85 AI Models
59.7% Avg Success
Overall Performance Rankings - All Benchmarks Combined
Showing 85 of 85 models
1
Claude 4.6 Opus
Claude02/2026
02/2026
76.177.2%
2
Claude 4 Sonnet
Claude08/2025
08/2025
72.673.7%
3
Claude 4.5 Sonnet
Claude10/2025
10/2025
72.373.2%
4
OpenAI GPT-5.2 Chat
OpenAI12/2025
12/2025
72.171.6%
5
Claude 4.5 Opus
Claude11/2025
11/2025
71.672.7%
6
Claude 4.1 Opus
Claude08/2025
08/2025
71.373.2%
7
Claude 4 Opus
Claude08/2025
08/2025
70.872.3%
8
OpenAI GPT-4.1
OpenAI08/2025
08/2025
70.873.1%
9
Horizon Beta
Other08/2025
08/2025
69.870.7%
10
OpenAI GPT-5.1 Chat
OpenAI11/2025
11/2025
69.869.8%
11
OpenAI 5.1 Codex
OpenAI11/2025
11/2025
69.468.1%
12
Kimi K2
Moonshot12/2025
12/2025
69.269.2%
13
OpenAI GPT-5
OpenAI09/2025
09/2025
68.869.8%
14
OpenAI GPT-4o
OpenAI08/2025
08/2025
68.768.6%
15
Claude 3.7 Sonnet (Thinking)
Claude08/2025
08/2025
68.668.9%
16
DeepSeek V3
DeepSeek12/2025
12/2025
68.668.5%
17
OpenAI o3-mini
OpenAI08/2025
08/2025
68.569.0%
18
OpenAI o1-mini
OpenAI08/2025
08/2025
68.368.8%
19
Claude 3.5 Sonnet
Claude08/2025
08/2025
68.067.0%
20
R1
DeepSeek08/2025
08/2025
67.967.5%
21
OpenAI o4-mini
OpenAI08/2025
08/2025
67.767.5%
22
Grok 3
xAI08/2025
08/2025
67.367.3%
23
Codestral 25.08
Mistral08/2025
08/2025
67.166.4%
24
Glm 4 6
Other10/2025
10/2025
67.167.0%
25
OpenAI GPT-5.2
OpenAI12/2025
12/2025
66.965.4%
26
Gemini 3 Flash
Google12/2025
12/2025
66.966.9%
27
Openai Oss 120b
OpenAI08/2025
08/2025
66.766.5%
28
OpenAI 5.1 Codex Mini
OpenAI11/2025
11/2025
66.665.1%
29
DeepSeek V3
DeepSeek10/2025
10/2025
66.266.2%
30
OpenAI o3-mini (High)
OpenAI08/2025
08/2025
66.167.1%
31
OpenAI o4-mini (High)
OpenAI08/2025
08/2025
65.265.2%
32
OpenAI 5.2 Codex
OpenAI01/2026
01/2026
65.066.2%
33
OpenAI 5 Codex
OpenAI10/2025
10/2025
64.664.5%
34
Sonoma Sky Alpha
Other09/2025
09/2025
64.466.2%
35
Qwen3 Max
Alibaba10/2025
10/2025
64.165.1%
36
OpenAI GPT-4 Turbo
OpenAI08/2025
08/2025
63.963.4%
37
OpenAI GPT-4
OpenAI08/2025
08/2025
63.863.0%
38
Llama 4 Scout
Meta08/2025
08/2025
63.362.1%
39
OpenAI GPT-5 mini
OpenAI08/2025
08/2025
63.365.6%
40
Glm 4 7
Other12/2025
12/2025
63.263.8%
41
OpenAI GPT-4.1 mini
OpenAI08/2025
08/2025
63.063.2%
42
Llama 4 Maverick
Meta08/2025
08/2025
63.062.1%
43
OpenAI GPT-5 Chat
OpenAI08/2025
08/2025
62.561.6%
44
Grok 4
xAI08/2025
08/2025
62.261.5%
45
Gemini 2.5 Flash
Google08/2025
08/2025
62.262.2%
46
OpenAI GPT-5 mini
OpenAI09/2025
09/2025
61.763.8%
47
OpenAI GPT-4.1 nano
OpenAI08/2025
08/2025
61.662.2%
48
OpenAI GPT-5 nano
OpenAI09/2025
09/2025
60.761.6%
49
Qwen 3 Coder
Alibaba10/2025
10/2025
60.760.6%
50
Claude 3.7 Sonnet
Claude08/2025
08/2025
60.360.1%
51
Kimi K2
Moonshot10/2025
10/2025
60.259.4%
52
OpenAI GPT-5 nano
OpenAI08/2025
08/2025
60.159.9%
53
Gemini 2.5 Pro
Google08/2025
08/2025
60.058.7%
54
OpenAI GPT-5
OpenAI08/2025
08/2025
59.760.9%
55
DeepSeek V3
DeepSeek08/2025
08/2025
59.657.6%
56
Grok 4
xAI10/2025
10/2025
59.360.8%
57
Gemini 2.5 Flash Lite
Google08/2025
08/2025
58.558.4%
58
OpenAI GPT-5.1
OpenAI11/2025
11/2025
57.457.9%
59
Gemini 2.0 Flash-001
Google08/2025
08/2025
57.357.6%
60
Claude 3.5 Haiku
Claude08/2025
08/2025
57.355.9%
61
Claude 4.5 Haiku
Claude10/2025
10/2025
56.856.1%
62
Mimo V2 Flash Free
Other12/2025
12/2025
56.757.3%
63
Grok Code Fast 1
xAI09/2025
09/2025
56.254.9%
64
OpenAI GPT-4o
OpenAI08/2025
08/2025
55.854.4%
65
Mistral Medium 3
Mistral08/2025
08/2025
55.553.2%
66
Grok 3 Mini
xAI08/2025
08/2025
55.054.6%
67
Mistral Large 2512
Mistral12/2025
12/2025
54.352.2%
68
Claude 3 Haiku
Claude08/2025
08/2025
53.950.7%
69
Gemini 3 Pro Preview
Google11/2025
11/2025
53.251.8%
70
Devstral 2512
Other12/2025
12/2025
52.149.4%
71
Kimi K2
Moonshot08/2025
08/2025
52.150.2%
72
Nova Pro V1
Amazon08/2025
08/2025
51.949.4%
73
OpenAI GPT-4o mini
OpenAI08/2025
08/2025
51.749.9%
74
Coder Large
Other08/2025
08/2025
50.649.1%
75
Qwen 3 Coder
Alibaba08/2025
08/2025
49.748.5%
76
Minimax M2 1
Other12/2025
12/2025
48.846.2%
77
Openai Oss 20b
OpenAI08/2025
08/2025
48.245.8%
78
Glm 4 5
Other08/2025
08/2025
48.044.2%
79
Nova Lite V1
Amazon08/2025
08/2025
47.543.1%
80
OpenAI GPT-3.5 Turbo
OpenAI08/2025
08/2025
45.341.1%
81
Qwen3 14b
Alibaba08/2025
08/2025
44.841.9%
82
Magnum V4 72B
NousResearch08/2025
08/2025
43.638.7%
83
Nova Micro V1
Amazon08/2025
08/2025
38.334.3%
84
Gemma 3 4B IT
Google08/2025
08/2025
29.725.9%
85
Command A
Cohere08/2025
08/2025
10.62.3%
How Scoring Works
90%

Test Success Rate

Percentage of test cases that pass. This measures whether the AI-generated code actually works correctly.

10%

Code Quality

Based on RuboCop static analysis. Quality score decreases linearly from 100 to 0 as offenses increase from 0 to 50.

📐

Calculation Formula

Score = (Test Success Rate × 90%) + (Quality Score × 10%)
Quality = 100 - ((RuboCop Offenses ÷ 50) × 100), capped 0-100

Top Performers

Champions
#1
Claude
76.1

Claude 4.6 Opus

Success Rate
77.2%
92
Tests Passed
Q
67
Quality
67
Issues
121 total tests
#2
Claude
72.6

Claude 4 Sonnet

Success Rate
73.7%
88
Tests Passed
Q
63
Quality
75
Issues
121 total tests
#3
Claude
72.3

Claude 4.5 Sonnet

Success Rate
73.2%
88
Tests Passed
Q
64
Quality
73
Issues
121 total tests

Performance Overview

85
AI Models Tested
4
Benchmarks
59.7%
Avg Success Rate
66
Avg Quality Score

Benchmark Challenges

Dive Deeper into the Analysis (coming soon)

Explore detailed benchmark results, model comparisons, and performance insights across all coding challenges.