Ruby LLM benchmarksAI Model Performance Dashboard

Comprehensive performance analysis of LLM models across all program fixing benchmarks - testing Ruby code generation capabilities through real programming challenges validated against test suites and RuboCop quality standards

4 Benchmarks
45 AI Models
57.1% Avg Success
Overall Performance Rankings - All Benchmarks Combined
Showing 45 of 45 models
1
Claude 4 Sonnet
Claude08/2025
08/2025
72.673.7%
2
Claude 4 Opus
Claude08/2025
08/2025
70.872.3%
3
OpenAI GPT-4.1
OpenAI08/2025
08/2025
70.873.1%
4
Horizon Beta
Other08/2025
08/2025
69.870.7%
5
OpenAI GPT-4o
OpenAI08/2025
08/2025
68.768.6%
6
Claude 3.7 Sonnet (Thinking)
Claude08/2025
08/2025
68.668.9%
7
OpenAI o3-mini
OpenAI08/2025
08/2025
68.569.0%
8
OpenAI o1-mini
OpenAI08/2025
08/2025
68.368.8%
9
Claude 3.5 Sonnet
Claude08/2025
08/2025
68.067.0%
10
R1
DeepSeek08/2025
08/2025
67.967.5%
11
OpenAI o4-mini
OpenAI08/2025
08/2025
67.767.5%
12
Grok 3
xAI08/2025
08/2025
67.367.3%
13
Codestral 25.08
Mistral08/2025
08/2025
67.166.4%
14
OpenAI o3-mini (High)
OpenAI08/2025
08/2025
66.167.1%
15
OpenAI o4-mini (High)
OpenAI08/2025
08/2025
65.265.2%
16
OpenAI GPT-4 Turbo
OpenAI08/2025
08/2025
63.963.4%
17
OpenAI GPT-4
OpenAI08/2025
08/2025
63.863.0%
18
Llama 4 Scout
Meta08/2025
08/2025
63.362.1%
19
OpenAI GPT-4.1 mini
OpenAI08/2025
08/2025
63.063.2%
20
Llama 4 Maverick
Meta08/2025
08/2025
63.062.1%
21
Grok 4
xAI08/2025
08/2025
62.261.5%
22
Gemini 2.5 Flash
Google08/2025
08/2025
62.262.2%
23
OpenAI GPT-4.1 nano
OpenAI08/2025
08/2025
61.662.2%
24
Claude 3.7 Sonnet
Claude08/2025
08/2025
60.360.1%
25
Gemini 2.5 Pro
Google08/2025
08/2025
60.058.7%
26
DeepSeek V3
DeepSeek08/2025
08/2025
59.657.6%
27
Gemini 2.5 Flash Lite
Google08/2025
08/2025
58.558.4%
28
Gemini 2.0 Flash-001
Google08/2025
08/2025
57.357.6%
29
Claude 3.5 Haiku
Claude08/2025
08/2025
57.355.9%
30
OpenAI GPT-4o
OpenAI08/2025
08/2025
55.854.4%
31
Mistral Medium 3
Mistral08/2025
08/2025
55.553.2%
32
Grok 3 Mini
xAI08/2025
08/2025
55.054.6%
33
Claude 3 Haiku
Claude08/2025
08/2025
53.950.7%
34
Kimi K2
Moonshot08/2025
08/2025
52.150.2%
35
Nova Pro V1
Amazon08/2025
08/2025
51.949.4%
36
OpenAI GPT-4o mini
OpenAI08/2025
08/2025
51.749.9%
37
Coder Large
Other08/2025
08/2025
50.649.1%
38
Qwen 3 Coder
Alibaba08/2025
08/2025
49.748.5%
39
Nova Lite V1
Amazon08/2025
08/2025
47.543.1%
40
OpenAI GPT-3.5 Turbo
OpenAI08/2025
08/2025
45.341.1%
41
Qwen3 14b
Alibaba08/2025
08/2025
44.841.9%
42
Magnum V4 72B
NousResearch08/2025
08/2025
43.638.7%
43
Nova Micro V1
Amazon08/2025
08/2025
38.334.3%
44
Gemma 3 4B IT
Google08/2025
08/2025
29.725.9%
45
Command A
Cohere08/2025
08/2025
10.62.3%
How Scoring Works
90%

Test Success Rate

Percentage of test cases that pass. This measures whether the AI-generated code actually works correctly.

10%

Code Quality

Based on RuboCop static analysis. Quality score decreases linearly from 100 to 0 as offenses increase from 0 to 50.

📐

Calculation Formula

Score = (Test Success Rate × 90%) + (Quality Score × 10%)
Quality = 100 - ((RuboCop Offenses ÷ 50) × 100), capped 0-100

Top Performers

Champions
#1
Claude
72.6

Claude 4 Sonnet

Success Rate
73.7%
88
Tests Passed
Q
63
Quality
75
Issues
121 total tests
#2
Claude
70.8

Claude 4 Opus

Success Rate
72.3%
86
Tests Passed
Q
58
Quality
84
Issues
121 total tests
#3
OpenAI
70.8

OpenAI GPT-4.1

Success Rate
73.1%
85
Tests Passed
Q
50
Quality
101
Issues
121 total tests

Performance Overview

45
AI Models Tested
4
Benchmarks
57.1%
Avg Success Rate
68
Avg Quality Score

Benchmark Challenges

Dive Deeper into the Analysis (coming soon)

Explore detailed benchmark results, model comparisons, and performance insights across all coding challenges.