Ruby LLM benchmarksAI Model Performance Dashboard

Comprehensive performance analysis of LLM models across all program fixing benchmarks - testing Ruby code generation capabilities through real programming challenges validated against test suites and RuboCop quality standards

4 Benchmarks
100 AI Models
59.1% Avg Success
Overall Performance Rankings - All Benchmarks Combined
Showing 100 of 100 models
1
Claude 4.6 Opus
Anthropic02/2026
02/2026
76.177.2%
2
GLM 5
Z.AI02/2026
02/2026
75.677.2%
3
Claude 4 Sonnet
Anthropic08/2025
08/2025
72.673.7%
4
Claude 4.5 Sonnet
Anthropic10/2025
10/2025
72.373.2%
5
Claude 4.6 Sonnet
Anthropic02/2026
02/2026
72.272.7%
6
GPT 5.2
OpenAI12/2025
12/2025
72.171.6%
7
Claude 4.5 Opus
Anthropic11/2025
11/2025
71.672.7%
8
Claude 4.1 Opus
Anthropic08/2025
08/2025
71.373.2%
9
Claude 4 Opus
Anthropic08/2025
08/2025
70.872.3%
10
GPT 4.1
OpenAI08/2025
08/2025
70.873.1%
11
GPT 5.3 Chat
OpenAI03/2026
03/2026
70.270.9%
12
Horizon Beta
Other08/2025
08/2025
69.870.7%
13
GPT 5.1
OpenAI11/2025
11/2025
69.869.8%
14
GPT 5.1 Codex
OpenAI11/2025
11/2025
69.468.1%
15
DeepSeek V3.2 Speciale
DeepSeek02/2026
02/2026
69.369.9%
16
Kimi K2 Thinking
Moonshot AI12/2025
12/2025
69.269.2%
17
GPT 5
OpenAI09/2025
09/2025
68.869.8%
18
GPT 4o
OpenAI08/2025
08/2025
68.768.6%
19
GPT 5.3 Codex
OpenAI02/2026
02/2026
68.667.1%
20
Claude 3.7 Sonnet (Thinking)
Anthropic08/2025
08/2025
68.668.9%
21
DeepSeek V3.2 Exp
DeepSeek12/2025
12/2025
68.668.5%
22
o3 mini
OpenAI08/2025
08/2025
68.569.0%
23
o1 mini
OpenAI08/2025
08/2025
68.368.8%
24
Claude 3.5 Sonnet
Anthropic08/2025
08/2025
68.067.0%
25
DeepSeek R1
DeepSeek08/2025
08/2025
67.967.5%
26
o4 mini
OpenAI08/2025
08/2025
67.767.5%
27
Grok 3
xAI08/2025
08/2025
67.367.3%
28
Codestral 25.08
Mistral08/2025
08/2025
67.166.4%
29
GLM 4.6
Z.AI10/2025
10/2025
67.167.0%
30
GPT 5.2
OpenAI12/2025
12/2025
66.965.4%
31
Gemini 3 Flash Preview
Google12/2025
12/2025
66.966.9%
32
OSS 120B
OpenAI08/2025
08/2025
66.766.5%
33
GPT 5.1 Codex Mini
OpenAI11/2025
11/2025
66.665.1%
34
Claude Opus 4.7
Anthropic04/2026
04/2026
66.366.4%
35
DeepSeek V3.2 Exp
DeepSeek10/2025
10/2025
66.266.2%
36
o3 mini (High)
OpenAI08/2025
08/2025
66.167.1%
37
GPT 5.4
OpenAI03/2026
03/2026
65.763.5%
38
o4 mini (High)
OpenAI08/2025
08/2025
65.265.2%
39
GPT 5.2 Codex
OpenAI01/2026
01/2026
65.066.2%
40
GPT 5 Codex
OpenAI10/2025
10/2025
64.664.5%
41
Sonoma Sky Alpha
Other09/2025
09/2025
64.466.2%
42
Qwen3 Max
Qwen10/2025
10/2025
64.165.1%
43
GPT 4 Turbo
OpenAI08/2025
08/2025
63.963.4%
44
GPT 4
OpenAI08/2025
08/2025
63.863.0%
45
Kimi K2.5
Moonshot AI02/2026
02/2026
63.365.1%
46
Llama 4 Scout
Meta08/2025
08/2025
63.362.1%
47
GPT 5 mini
OpenAI08/2025
08/2025
63.365.6%
48
GLM 4.7
Z.AI12/2025
12/2025
63.263.8%
49
GPT 4.1 mini
OpenAI08/2025
08/2025
63.063.2%
50
Gemini 3.1 Pro Preview
Google02/2026
02/2026
63.063.2%
51
Llama 4 Maverick
Meta08/2025
08/2025
63.062.1%
52
GPT 5
OpenAI08/2025
08/2025
62.561.6%
53
MiniMax M2.5
Minimax02/2026
02/2026
62.463.0%
54
Grok 4
xAI08/2025
08/2025
62.261.5%
55
Gemini 2.5 Flash
Google08/2025
08/2025
62.262.2%
56
GPT 5 mini
OpenAI09/2025
09/2025
61.763.8%
57
GPT 4.1 nano
OpenAI08/2025
08/2025
61.662.2%
58
Qwen3 Coder Plus
Qwen10/2025
10/2025
60.760.6%
59
Claude 3.7 Sonnet
Anthropic08/2025
08/2025
60.360.1%
60
Kimi K2 (0905)
Moonshot AI10/2025
10/2025
60.259.4%
61
GPT 5 nano
OpenAI08/2025
08/2025
60.159.9%
62
Gemini 2.5 Pro
Google08/2025
08/2025
60.158.7%
63
GPT 5
OpenAI08/2025
08/2025
59.760.9%
64
DeepSeek V3
DeepSeek08/2025
08/2025
59.657.6%
65
Grok 4 Fast
xAI10/2025
10/2025
59.360.8%
66
Gemini 2.5 Flash Lite
Google08/2025
08/2025
58.558.4%
67
Trinity Large Preview
Arcee AI02/2026
02/2026
58.156.6%
68
GPT 5.1
OpenAI11/2025
11/2025
57.457.9%
69
Gemini 2.0 Flash 001
Google08/2025
08/2025
57.357.6%
70
Claude 3.5 Haiku
Anthropic08/2025
08/2025
57.355.9%
71
Claude 4.5 Haiku
Anthropic10/2025
10/2025
56.856.1%
72
MIMO V2 Flash
Minimax12/2025
12/2025
56.757.3%
73
Qwen3 Coder Next
Qwen02/2026
02/2026
56.757.5%
74
Grok Code Fast 1
xAI09/2025
09/2025
56.254.9%
75
GPT 4o
OpenAI08/2025
08/2025
55.854.4%
76
Mistral Medium 3
Mistral08/2025
08/2025
55.553.2%
77
Grok 3 Mini
xAI08/2025
08/2025
55.054.6%
78
Step 3.5 Flash
StepFun02/2026
02/2026
54.555.9%
79
Mistral Large 25.12
Mistral12/2025
12/2025
54.352.2%
80
Claude 3 Haiku
Anthropic08/2025
08/2025
53.950.7%
81
Gemini 3 Pro Preview
Google11/2025
11/2025
53.151.8%
82
Devstral 25.12
Mistral12/2025
12/2025
52.149.4%
83
Kimi K2
Moonshot AI08/2025
08/2025
52.150.2%
84
Nova Pro V1
Amazon08/2025
08/2025
51.949.4%
85
GPT 4o mini
OpenAI08/2025
08/2025
51.749.9%
86
Coder Large
Other08/2025
08/2025
50.649.1%
87
Qwen3 Coder
Qwen08/2025
08/2025
49.748.5%
88
OSS 20B
OpenAI08/2025
08/2025
48.245.8%
89
Nova Lite V1
Amazon08/2025
08/2025
47.543.1%
90
Nova 2 Lite V1
Amazon02/2026
02/2026
45.846.4%
91
GPT 5 nano
OpenAI09/2025
09/2025
45.546.2%
92
GPT 3.5 Turbo
OpenAI08/2025
08/2025
45.341.1%
93
Grok 4.1 Fast
xAI02/2026
02/2026
43.142.2%
94
Nova Micro V1
Amazon08/2025
08/2025
38.334.3%
95
MiniMax M2.1
Minimax12/2025
12/2025
36.634.6%
96
Qwen3 14B
Qwen08/2025
08/2025
33.631.4%
97
GLM 4.5
Z.AI08/2025
08/2025
24.022.1%
98
Gemma 3 4B IT
Google08/2025
08/2025
14.812.9%
99
Magnum V4 72B
NousResearch08/2025
08/2025
10.99.7%
100
Command A
Cohere08/2025
08/2025
8.01.7%
How Scoring Works
90%

Test Success Rate

Percentage of test cases that pass. This measures whether the AI-generated code actually works correctly.

10%

Code Quality

Based on RuboCop static analysis. Quality score decreases linearly from 100 to 0 as offenses increase from 0 to 50.

📐

Calculation Formula

Score = (Test Success Rate × 90%) + (Quality Score × 10%)
Quality = 100 - ((RuboCop Offenses ÷ 50) × 100), capped 0-100

Top Performers

Champions
#1
Anthropic
76.1

Claude 4.6 Opus

Success Rate
77.2%
92
Tests Passed
Q
67
Quality
67
Issues
121 total tests
#2
Z.AI
75.6

GLM 5

Success Rate
77.2%
92
Tests Passed
Q
61
Quality
88
Issues
121 total tests
#3
Anthropic
72.6

Claude 4 Sonnet

Success Rate
73.7%
88
Tests Passed
Q
63
Quality
75
Issues
121 total tests

Performance Overview

100
AI Models Tested
4
Benchmarks
59.1%
Avg Success Rate
63
Avg Quality Score

Benchmark Challenges

Dive Deeper into the Analysis (coming soon)

Explore detailed benchmark results, model comparisons, and performance insights across all coding challenges.