AI Agent Code Quality Benchmark
How well do AI coding agents write code? Measured by roam-code.
7 tasks · 4 combos · 26 evaluations · Source & methodology
Results: Algorithm Tasks algorithm
Agent Quality Score (AQS) per task. Scale: 0–100. Grade: A (90+), B (80+), C (70+), D (60+), F (<60).
| Task |
Claude Code 2.1.45 / Sonnet 4.6 |
Claude Code 2.1.42 / Sonnet 4.5 |
Claude Code 2.1.42 / Opus 4.6 |
Codex 0.101.0 / GPT-5.3 |
| Python ETL Pipeline |
60 (D) |
-- |
98 (A) |
57 (F) |
| TypeScript Pathfinder |
52 (F) |
-- |
50 (F) |
50 (F) |
| Average |
56 (F) |
-- |
74 (C) |
54 (F) |
Category Breakdown
AQS breaks into 6 categories: Health (35), Quality (20), Architecture (15), Algorithms (10), Testing (15), Completeness (5).
Python ETL Pipeline
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
60 |
23 |
11 |
6 |
0 |
15 |
5 |
| Claude Code 2.1.42 / Opus 4.6 |
98 |
35 |
20 |
15 |
10 |
15 |
3 |
| Codex 0.101.0 / GPT-5.3 |
57 |
25 |
13 |
7 |
0 |
9 |
3 |
TypeScript Pathfinder
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
52 |
22 |
6 |
6 |
0 |
15 |
3 |
| Claude Code 2.1.42 / Opus 4.6 |
50 |
19 |
11 |
0 |
0 |
15 |
5 |
| Codex 0.101.0 / GPT-5.3 |
50 |
26 |
7 |
12 |
0 |
0 |
5 |
Results: Standard Tasks standard
Agent Quality Score (AQS) per task. Scale: 0–100. Grade: A (90+), B (80+), C (70+), D (60+), F (<60).
| Task |
Claude Code 2.1.45 / Sonnet 4.6 |
Claude Code 2.1.42 / Sonnet 4.5 |
Claude Code 2.1.42 / Opus 4.6 |
Codex 0.101.0 / GPT-5.3 |
| Astro Landing |
90 (A) |
100 (A) |
98 (A) |
93 (A) |
| C++ Calculator |
50 (F) |
84 (B) |
85 (B) |
87 (B) |
| Go Log Analyzer |
52 (F) |
80 (B) |
67 (D) |
71 (C) |
| Python Crawler |
74 (C) |
73 (C) |
68 (D) |
64 (D) |
| React TODO |
72 (C) |
88 (B) |
85 (B) |
72 (C) |
| Average |
68 (D) |
85 (B) |
81 (B) |
77 (C) |
Category Breakdown
AQS breaks into 6 categories: Health (35), Quality (20), Architecture (15), Algorithms (10), Testing (15), Completeness (5).
Astro Landing
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
90 |
35 |
20 |
15 |
10 |
7 |
3 |
| Claude Code 2.1.42 / Sonnet 4.5 |
100 |
35 |
20 |
15 |
10 |
15 |
5 |
| Claude Code 2.1.42 / Opus 4.6 |
98 |
35 |
20 |
15 |
10 |
13 |
5 |
| Codex 0.101.0 / GPT-5.3 |
93 |
34 |
16 |
15 |
10 |
13 |
5 |
C++ Calculator
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
50 |
20 |
13 |
7 |
7 |
0 |
3 |
| Claude Code 2.1.42 / Sonnet 4.5 |
84 |
32 |
18 |
12 |
10 |
7 |
5 |
| Claude Code 2.1.42 / Opus 4.6 |
85 |
34 |
13 |
15 |
9 |
9 |
5 |
| Codex 0.101.0 / GPT-5.3 |
87 |
34 |
20 |
15 |
6 |
7 |
5 |
Go Log Analyzer
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
52 |
27 |
6 |
12 |
4 |
0 |
3 |
| Claude Code 2.1.42 / Sonnet 4.5 |
80 |
32 |
6 |
15 |
7 |
15 |
5 |
| Claude Code 2.1.42 / Opus 4.6 |
67 |
26 |
6 |
12 |
3 |
15 |
5 |
| Codex 0.101.0 / GPT-5.3 |
71 |
27 |
7 |
10 |
9 |
13 |
5 |
Python Crawler
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
74 |
31 |
12 |
10 |
1 |
15 |
5 |
| Claude Code 2.1.42 / Sonnet 4.5 |
73 |
30 |
10 |
12 |
1 |
15 |
5 |
| Claude Code 2.1.42 / Opus 4.6 |
68 |
27 |
12 |
9 |
0 |
15 |
5 |
| Codex 0.101.0 / GPT-5.3 |
64 |
27 |
10 |
7 |
0 |
15 |
5 |
React TODO
| Combo | AQS |
Health /35 |
Quality /20 |
Arch /15 |
Algo /10 |
Testing /15 |
Complete /5 |
| Claude Code 2.1.45 / Sonnet 4.6 |
72 |
34 |
10 |
15 |
10 |
0 |
3 |
| Claude Code 2.1.42 / Sonnet 4.5 |
88 |
30 |
16 |
12 |
10 |
15 |
5 |
| Claude Code 2.1.42 / Opus 4.6 |
85 |
30 |
13 |
12 |
10 |
15 |
5 |
| Codex 0.101.0 / GPT-5.3 |
72 |
18 |
20 |
10 |
10 |
9 |
5 |
Combo Averages (All Tasks)
Average scores across all tasks (vanilla mode).
| Combo | Avg AQS |
Avg Health |
Avg Quality |
Avg Arch |
Avg Algo |
Avg Testing |
Avg Complete |
| Claude Code 2.1.45 / Sonnet 4.6 |
64 (D) |
27.4/35 |
11.1/20 |
10.1/15 |
4.6/10 |
7.4/15 |
3.6/5 |
| Claude Code 2.1.42 / Sonnet 4.5 |
85 (B) |
31.8/35 |
14.0/20 |
13.2/15 |
7.6/10 |
13.4/15 |
5.0/5 |
| Claude Code 2.1.42 / Opus 4.6 |
79 (C) |
29.4/35 |
13.6/20 |
11.1/15 |
6.0/10 |
13.9/15 |
4.7/5 |
| Codex 0.101.0 / GPT-5.3 |
71 (C) |
27.3/35 |
13.3/20 |
10.9/15 |
5.0/10 |
9.4/15 |
4.7/5 |
Combo Signatures
| Combo | CLI Tool | CLI Version | Model |
| Claude Code 2.1.45 / Sonnet 4.6 | claude | 2.1.45 | claude-sonnet-4-6 |
| Claude Code 2.1.42 / Sonnet 4.5 | claude | 2.1.42 | claude-sonnet-4-5-20250929 |
| Claude Code 2.1.42 / Opus 4.6 | claude | 2.1.45 | claude-opus-4-6 |
| Codex 0.101.0 / GPT-5.3 | codex | 0.101.0 | gpt-5.3-codex |
Evaluator: roam-code 9.0.0
Raw Metrics by Task
Detailed roam-code metrics for each task and combo.
Astro Landing
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
99 |
0 |
-- |
-- |
0 |
0.00 |
20 |
0 |
0 |
0 |
0 |
| Claude Code 2.1.42 / Sonnet 4.5 | vanilla |
99 |
0 |
0.7 |
1.0 |
0 |
0.00 |
20 |
0 |
0 |
0 |
0 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
99 |
0 |
-- |
-- |
0 |
0.00 |
20 |
0 |
0 |
0 |
0 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
97 |
0 |
5.3 |
25.0 |
0 |
0.00 |
20 |
0 |
0 |
0 |
0 |
C++ Calculator
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
58 |
0 |
6.2 |
25.0 |
1 |
11.10 |
1 |
1 |
1 |
1 |
1 |
| Claude Code 2.1.42 / Sonnet 4.5 | vanilla |
91 |
0 |
2.7 |
9.0 |
1 |
0.00 |
20 |
0 |
0 |
1 |
0 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
96 |
0 |
6.1 |
26.0 |
1 |
0.00 |
20 |
1 |
0 |
0 |
0 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
97 |
0 |
3.7 |
11.0 |
0 |
0.00 |
20 |
4 |
0 |
0 |
0 |
Go Log Analyzer
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
76 |
2 |
6.7 |
22.0 |
2 |
0.00 |
16 |
3 |
0 |
1 |
2 |
| Claude Code 2.1.42 / Sonnet 4.5 | vanilla |
92 |
37 |
5.5 |
15.0 |
6 |
0.00 |
20 |
1 |
1 |
0 |
0 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
74 |
44 |
5.4 |
16.0 |
8 |
0.00 |
20 |
4 |
1 |
1 |
2 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
78 |
14 |
4.7 |
11.0 |
4 |
1.10 |
20 |
1 |
0 |
0 |
3 |
Python Crawler
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
90 |
13 |
1.8 |
4.0 |
0 |
2.10 |
20 |
4 |
2 |
0 |
0 |
| Claude Code 2.1.42 / Sonnet 4.5 | vanilla |
86 |
10 |
3.2 |
4.0 |
1 |
0.00 |
20 |
8 |
0 |
1 |
0 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
77 |
3 |
2.0 |
4.0 |
1 |
0.00 |
20 |
9 |
3 |
2 |
2 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
76 |
5 |
4.0 |
6.0 |
1 |
1.50 |
20 |
12 |
1 |
1 |
2 |
Python ETL Pipeline
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
65 |
2 |
3.0 |
3.0 |
3 |
0.00 |
20 |
30 |
0 |
3 |
4 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
100 |
0 |
-- |
-- |
0 |
0.00 |
0 |
0 |
0 |
0 |
0 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
71 |
1 |
8.2 |
13.0 |
1 |
6.10 |
19 |
13 |
0 |
1 |
1 |
React TODO
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
98 |
0 |
26.0 |
26.0 |
0 |
0.00 |
20 |
0 |
0 |
0 |
0 |
| Claude Code 2.1.42 / Sonnet 4.5 | vanilla |
87 |
1 |
4.8 |
13.0 |
1 |
0.00 |
20 |
0 |
0 |
1 |
0 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
86 |
0 |
5.1 |
18.0 |
2 |
0.00 |
20 |
0 |
0 |
1 |
0 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
52 |
0 |
4.1 |
9.0 |
0 |
16.10 |
20 |
0 |
0 |
0 |
4 |
TypeScript Pathfinder
| Combo | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
AntiPat |
AP-Hi |
Crit |
Warn |
| Claude Code 2.1.45 / Sonnet 4.6 | vanilla |
62 |
3 |
12.5 |
9.0 |
1 |
0.00 |
20 |
24 |
5 |
3 |
5 |
| Claude Code 2.1.42 / Opus 4.6 | vanilla |
55 |
2 |
4.9 |
14.0 |
3 |
1.60 |
20 |
29 |
0 |
4 |
7 |
| Codex 0.101.0 / GPT-5.3 | vanilla |
73 |
35 |
3.2 |
4.0 |
11 |
0.00 |
0 |
30 |
2 |
1 |
2 |