AI Agent Code Quality Benchmark

How well do AI coding agents write code? Measured by roam-code.

5 tasks · 3 agents · 15 evaluations · Source & methodology

Results at a Glance

Agent Quality Score (AQS) per task. Scale: 0–100. Grade: A (90+), B (80+), C (70+), D (60+), F (<60).

Task Claude Opus 4.6 Claude Sonnet 4.5 Codex (GPT-5.3)
React TODO 90 (A) 94 (A) 70 (C)
Astro Landing 98 (A) 100 (A) 91 (A)
Python Crawler 75 (C) 83 (B) 70 (C)
C++ Calculator 92 (A) 89 (B) 91 (A)
Go Log Analyzer 74 (C) 79 (C) 67 (D)
Average 86 (B) 89 (B) 78 (C)

Category Breakdown by Task

AQS breaks into 5 categories: Health (40), Quality (25), Architecture (15), Testing (15), Completeness (5).

React TODO

AgentAQS Health /40 Quality /25 Arch /15 Testing /15 Complete /5
Claude Opus 4.6 90 37 18 15 15 5
Claude Sonnet 4.5 94 38 21 15 15 5
Codex (GPT-5.3) 70 21 25 10 9 5

Astro Landing

AgentAQS Health /40 Quality /25 Arch /15 Testing /15 Complete /5
Claude Opus 4.6 98 40 25 15 13 5
Claude Sonnet 4.5 100 40 25 15 15 5
Codex (GPT-5.3) 91 38 20 15 13 5

Python Crawler

AgentAQS Health /40 Quality /25 Arch /15 Testing /15 Complete /5
Claude Opus 4.6 75 31 15 9 15 5
Claude Sonnet 4.5 83 37 11 15 15 5
Codex (GPT-5.3) 70 30 13 7 15 5

C++ Calculator

AgentAQS Health /40 Quality /25 Arch /15 Testing /15 Complete /5
Claude Opus 4.6 92 40 17 15 15 5
Claude Sonnet 4.5 89 39 23 15 7 5
Codex (GPT-5.3) 91 39 25 15 7 5

Go Log Analyzer

AgentAQS Health /40 Quality /25 Arch /15 Testing /15 Complete /5
Claude Opus 4.6 74 32 7 15 15 5
Claude Sonnet 4.5 79 36 8 15 15 5
Codex (GPT-5.3) 67 31 8 10 13 5

Agent Averages

Average scores across all 5 tasks (vanilla mode).

AgentAvg AQS Avg Health Avg Quality Avg Arch Avg Testing Avg Complete
Claude Opus 4.6 86 (B) 36.0/40 16.4/25 13.8/15 14.6/15 5.0/5
Claude Sonnet 4.5 89 (B) 38.0/40 17.6/25 15.0/15 13.4/15 5.0/5
Codex (GPT-5.3) 78 (C) 31.8/40 18.2/25 11.4/15 11.4/15 5.0/5

Agent Signatures

AgentCLI VersionModel
Claude Opus 4.6claude 2.1.42claude-opus-4-6
Claude Sonnet 4.5claude 2.1.42claude-sonnet-4-5-20250929
Codex (GPT-5.3)codex 0.101.0gpt-5.3-codex

Evaluator: roam-code 8.0.1

Raw Metrics by Task

Detailed roam-code metrics for each task and agent.

React TODO

AgentMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup Crit Warn
Claude Opus 4.6vanilla 92 0 5.1 18.0 2 0.00 20 0 1
Claude Sonnet 4.5vanilla 94 1 4.8 13.0 1 0.00 20 0 1
Codex (GPT-5.3)vanilla 52 0 4.1 9.0 0 16.10 20 0 4

Astro Landing

AgentMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup Crit Warn
Claude Opus 4.6vanilla 99 0 -- -- 0 0.00 20 0 0
Claude Sonnet 4.5vanilla 99 0 0.7 1.0 0 0.00 20 0 0
Codex (GPT-5.3)vanilla 96 0 5.3 25.0 0 0.00 20 0 0

Python Crawler

AgentMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup Crit Warn
Claude Opus 4.6vanilla 77 3 2.1 4.0 2 0.00 20 2 2
Claude Sonnet 4.5vanilla 92 10 3.3 4.0 2 0.00 20 0 1
Codex (GPT-5.3)vanilla 76 5 4.2 6.0 1 1.50 20 1 2

C++ Calculator

AgentMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup Crit Warn
Claude Opus 4.6vanilla 99 0 6.0 26.0 1 0.00 0 0 0
Claude Sonnet 4.5vanilla 98 0 2.7 9.0 1 0.00 20 0 0
Codex (GPT-5.3)vanilla 97 0 3.7 11.0 0 0.00 20 0 0

Go Log Analyzer

AgentMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup Crit Warn
Claude Opus 4.6vanilla 80 44 5.4 16.0 8 0.00 19 0 3
Claude Sonnet 4.5vanilla 91 37 5.5 15.0 6 0.00 20 0 0
Codex (GPT-5.3)vanilla 78 14 4.7 11.0 4 1.10 17 0 1