AI Agent Code Quality Benchmark
How well do AI coding agents write code? Measured by roam-code.
5 tasks · 3 agents · 15 evaluations · Source & methodology
Results at a Glance
Agent Quality Score (AQS) per task. Scale: 0–100. Grade: A (90+), B (80+), C (70+), D (60+), F (<60).
| Task |
Claude Opus 4.6 |
Claude Sonnet 4.5 |
Codex (GPT-5.3) |
| React TODO |
90 (A) |
94 (A) |
70 (C) |
| Astro Landing |
98 (A) |
100 (A) |
91 (A) |
| Python Crawler |
75 (C) |
83 (B) |
70 (C) |
| C++ Calculator |
92 (A) |
89 (B) |
91 (A) |
| Go Log Analyzer |
74 (C) |
79 (C) |
67 (D) |
| Average |
86 (B) |
89 (B) |
78 (C) |
Category Breakdown by Task
AQS breaks into 5 categories: Health (40), Quality (25), Architecture (15), Testing (15), Completeness (5).
React TODO
| Agent | AQS |
Health /40 |
Quality /25 |
Arch /15 |
Testing /15 |
Complete /5 |
| Claude Opus 4.6 |
90 |
37 |
18 |
15 |
15 |
5 |
| Claude Sonnet 4.5 |
94 |
38 |
21 |
15 |
15 |
5 |
| Codex (GPT-5.3) |
70 |
21 |
25 |
10 |
9 |
5 |
Astro Landing
| Agent | AQS |
Health /40 |
Quality /25 |
Arch /15 |
Testing /15 |
Complete /5 |
| Claude Opus 4.6 |
98 |
40 |
25 |
15 |
13 |
5 |
| Claude Sonnet 4.5 |
100 |
40 |
25 |
15 |
15 |
5 |
| Codex (GPT-5.3) |
91 |
38 |
20 |
15 |
13 |
5 |
Python Crawler
| Agent | AQS |
Health /40 |
Quality /25 |
Arch /15 |
Testing /15 |
Complete /5 |
| Claude Opus 4.6 |
75 |
31 |
15 |
9 |
15 |
5 |
| Claude Sonnet 4.5 |
83 |
37 |
11 |
15 |
15 |
5 |
| Codex (GPT-5.3) |
70 |
30 |
13 |
7 |
15 |
5 |
C++ Calculator
| Agent | AQS |
Health /40 |
Quality /25 |
Arch /15 |
Testing /15 |
Complete /5 |
| Claude Opus 4.6 |
92 |
40 |
17 |
15 |
15 |
5 |
| Claude Sonnet 4.5 |
89 |
39 |
23 |
15 |
7 |
5 |
| Codex (GPT-5.3) |
91 |
39 |
25 |
15 |
7 |
5 |
Go Log Analyzer
| Agent | AQS |
Health /40 |
Quality /25 |
Arch /15 |
Testing /15 |
Complete /5 |
| Claude Opus 4.6 |
74 |
32 |
7 |
15 |
15 |
5 |
| Claude Sonnet 4.5 |
79 |
36 |
8 |
15 |
15 |
5 |
| Codex (GPT-5.3) |
67 |
31 |
8 |
10 |
13 |
5 |
Agent Averages
Average scores across all 5 tasks (vanilla mode).
| Agent | Avg AQS |
Avg Health |
Avg Quality |
Avg Arch |
Avg Testing |
Avg Complete |
| Claude Opus 4.6 |
86 (B) |
36.0/40 |
16.4/25 |
13.8/15 |
14.6/15 |
5.0/5 |
| Claude Sonnet 4.5 |
89 (B) |
38.0/40 |
17.6/25 |
15.0/15 |
13.4/15 |
5.0/5 |
| Codex (GPT-5.3) |
78 (C) |
31.8/40 |
18.2/25 |
11.4/15 |
11.4/15 |
5.0/5 |
Agent Signatures
| Agent | CLI Version | Model |
| Claude Opus 4.6 | claude 2.1.42 | claude-opus-4-6 |
| Claude Sonnet 4.5 | claude 2.1.42 | claude-sonnet-4-5-20250929 |
| Codex (GPT-5.3) | codex 0.101.0 | gpt-5.3-codex |
Evaluator: roam-code 8.0.1
Raw Metrics by Task
Detailed roam-code metrics for each task and agent.
React TODO
| Agent | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
Crit |
Warn |
| Claude Opus 4.6 | vanilla |
92 |
0 |
5.1 |
18.0 |
2 |
0.00 |
20 |
0 |
1 |
| Claude Sonnet 4.5 | vanilla |
94 |
1 |
4.8 |
13.0 |
1 |
0.00 |
20 |
0 |
1 |
| Codex (GPT-5.3) | vanilla |
52 |
0 |
4.1 |
9.0 |
0 |
16.10 |
20 |
0 |
4 |
Astro Landing
| Agent | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
Crit |
Warn |
| Claude Opus 4.6 | vanilla |
99 |
0 |
-- |
-- |
0 |
0.00 |
20 |
0 |
0 |
| Claude Sonnet 4.5 | vanilla |
99 |
0 |
0.7 |
1.0 |
0 |
0.00 |
20 |
0 |
0 |
| Codex (GPT-5.3) | vanilla |
96 |
0 |
5.3 |
25.0 |
0 |
0.00 |
20 |
0 |
0 |
Python Crawler
| Agent | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
Crit |
Warn |
| Claude Opus 4.6 | vanilla |
77 |
3 |
2.1 |
4.0 |
2 |
0.00 |
20 |
2 |
2 |
| Claude Sonnet 4.5 | vanilla |
92 |
10 |
3.3 |
4.0 |
2 |
0.00 |
20 |
0 |
1 |
| Codex (GPT-5.3) | vanilla |
76 |
5 |
4.2 |
6.0 |
1 |
1.50 |
20 |
1 |
2 |
C++ Calculator
| Agent | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
Crit |
Warn |
| Claude Opus 4.6 | vanilla |
99 |
0 |
6.0 |
26.0 |
1 |
0.00 |
0 |
0 |
0 |
| Claude Sonnet 4.5 | vanilla |
98 |
0 |
2.7 |
9.0 |
1 |
0.00 |
20 |
0 |
0 |
| Codex (GPT-5.3) | vanilla |
97 |
0 |
3.7 |
11.0 |
0 |
0.00 |
20 |
0 |
0 |
Go Log Analyzer
| Agent | Mode |
Health |
Dead |
AvgCx |
P90Cx |
HiCx |
Tangle |
HidCoup |
Crit |
Warn |
| Claude Opus 4.6 | vanilla |
80 |
44 |
5.4 |
16.0 |
8 |
0.00 |
19 |
0 |
3 |
| Claude Sonnet 4.5 | vanilla |
91 |
37 |
5.5 |
15.0 |
6 |
0.00 |
20 |
0 |
0 |
| Codex (GPT-5.3) | vanilla |
78 |
14 |
4.7 |
11.0 |
4 |
1.10 |
17 |
0 |
1 |