AI Agent Code Quality Benchmark

How well do AI coding agents write code? Measured by roam-code.

7 tasks · 4 combos · 26 evaluations · Source & methodology

Results: Algorithm Tasks algorithm

Agent Quality Score (AQS) per task. Scale: 0–100. Grade: A (90+), B (80+), C (70+), D (60+), F (<60).

Task Claude Code 2.1.45 / Sonnet 4.6 Claude Code 2.1.42 / Sonnet 4.5 Claude Code 2.1.42 / Opus 4.6 Codex 0.101.0 / GPT-5.3
Python ETL Pipeline 60 (D) -- 98 (A) 57 (F)
TypeScript Pathfinder 52 (F) -- 50 (F) 50 (F)
Average 56 (F) -- 74 (C) 54 (F)

Category Breakdown

AQS breaks into 6 categories: Health (35), Quality (20), Architecture (15), Algorithms (10), Testing (15), Completeness (5).

Python ETL Pipeline

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 60 23 11 6 0 15 5
Claude Code 2.1.42 / Opus 4.6 98 35 20 15 10 15 3
Codex 0.101.0 / GPT-5.3 57 25 13 7 0 9 3

TypeScript Pathfinder

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 52 22 6 6 0 15 3
Claude Code 2.1.42 / Opus 4.6 50 19 11 0 0 15 5
Codex 0.101.0 / GPT-5.3 50 26 7 12 0 0 5

Results: Standard Tasks standard

Agent Quality Score (AQS) per task. Scale: 0–100. Grade: A (90+), B (80+), C (70+), D (60+), F (<60).

Task Claude Code 2.1.45 / Sonnet 4.6 Claude Code 2.1.42 / Sonnet 4.5 Claude Code 2.1.42 / Opus 4.6 Codex 0.101.0 / GPT-5.3
Astro Landing 90 (A) 100 (A) 98 (A) 93 (A)
C++ Calculator 50 (F) 84 (B) 85 (B) 87 (B)
Go Log Analyzer 52 (F) 80 (B) 67 (D) 71 (C)
Python Crawler 74 (C) 73 (C) 68 (D) 64 (D)
React TODO 72 (C) 88 (B) 85 (B) 72 (C)
Average 68 (D) 85 (B) 81 (B) 77 (C)

Category Breakdown

AQS breaks into 6 categories: Health (35), Quality (20), Architecture (15), Algorithms (10), Testing (15), Completeness (5).

Astro Landing

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 90 35 20 15 10 7 3
Claude Code 2.1.42 / Sonnet 4.5 100 35 20 15 10 15 5
Claude Code 2.1.42 / Opus 4.6 98 35 20 15 10 13 5
Codex 0.101.0 / GPT-5.3 93 34 16 15 10 13 5

C++ Calculator

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 50 20 13 7 7 0 3
Claude Code 2.1.42 / Sonnet 4.5 84 32 18 12 10 7 5
Claude Code 2.1.42 / Opus 4.6 85 34 13 15 9 9 5
Codex 0.101.0 / GPT-5.3 87 34 20 15 6 7 5

Go Log Analyzer

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 52 27 6 12 4 0 3
Claude Code 2.1.42 / Sonnet 4.5 80 32 6 15 7 15 5
Claude Code 2.1.42 / Opus 4.6 67 26 6 12 3 15 5
Codex 0.101.0 / GPT-5.3 71 27 7 10 9 13 5

Python Crawler

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 74 31 12 10 1 15 5
Claude Code 2.1.42 / Sonnet 4.5 73 30 10 12 1 15 5
Claude Code 2.1.42 / Opus 4.6 68 27 12 9 0 15 5
Codex 0.101.0 / GPT-5.3 64 27 10 7 0 15 5

React TODO

ComboAQS Health /35 Quality /20 Arch /15 Algo /10 Testing /15 Complete /5
Claude Code 2.1.45 / Sonnet 4.6 72 34 10 15 10 0 3
Claude Code 2.1.42 / Sonnet 4.5 88 30 16 12 10 15 5
Claude Code 2.1.42 / Opus 4.6 85 30 13 12 10 15 5
Codex 0.101.0 / GPT-5.3 72 18 20 10 10 9 5

Combo Averages (All Tasks)

Average scores across all tasks (vanilla mode).

ComboAvg AQS Avg Health Avg Quality Avg Arch Avg Algo Avg Testing Avg Complete
Claude Code 2.1.45 / Sonnet 4.6 64 (D) 27.4/35 11.1/20 10.1/15 4.6/10 7.4/15 3.6/5
Claude Code 2.1.42 / Sonnet 4.5 85 (B) 31.8/35 14.0/20 13.2/15 7.6/10 13.4/15 5.0/5
Claude Code 2.1.42 / Opus 4.6 79 (C) 29.4/35 13.6/20 11.1/15 6.0/10 13.9/15 4.7/5
Codex 0.101.0 / GPT-5.3 71 (C) 27.3/35 13.3/20 10.9/15 5.0/10 9.4/15 4.7/5

Combo Signatures

ComboCLI ToolCLI VersionModel
Claude Code 2.1.45 / Sonnet 4.6claude2.1.45claude-sonnet-4-6
Claude Code 2.1.42 / Sonnet 4.5claude2.1.42claude-sonnet-4-5-20250929
Claude Code 2.1.42 / Opus 4.6claude2.1.45claude-opus-4-6
Codex 0.101.0 / GPT-5.3codex0.101.0gpt-5.3-codex

Evaluator: roam-code 9.0.0

Raw Metrics by Task

Detailed roam-code metrics for each task and combo.

Astro Landing

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 99 0 -- -- 0 0.00 20 0 0 0 0
Claude Code 2.1.42 / Sonnet 4.5vanilla 99 0 0.7 1.0 0 0.00 20 0 0 0 0
Claude Code 2.1.42 / Opus 4.6vanilla 99 0 -- -- 0 0.00 20 0 0 0 0
Codex 0.101.0 / GPT-5.3vanilla 97 0 5.3 25.0 0 0.00 20 0 0 0 0

C++ Calculator

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 58 0 6.2 25.0 1 11.10 1 1 1 1 1
Claude Code 2.1.42 / Sonnet 4.5vanilla 91 0 2.7 9.0 1 0.00 20 0 0 1 0
Claude Code 2.1.42 / Opus 4.6vanilla 96 0 6.1 26.0 1 0.00 20 1 0 0 0
Codex 0.101.0 / GPT-5.3vanilla 97 0 3.7 11.0 0 0.00 20 4 0 0 0

Go Log Analyzer

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 76 2 6.7 22.0 2 0.00 16 3 0 1 2
Claude Code 2.1.42 / Sonnet 4.5vanilla 92 37 5.5 15.0 6 0.00 20 1 1 0 0
Claude Code 2.1.42 / Opus 4.6vanilla 74 44 5.4 16.0 8 0.00 20 4 1 1 2
Codex 0.101.0 / GPT-5.3vanilla 78 14 4.7 11.0 4 1.10 20 1 0 0 3

Python Crawler

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 90 13 1.8 4.0 0 2.10 20 4 2 0 0
Claude Code 2.1.42 / Sonnet 4.5vanilla 86 10 3.2 4.0 1 0.00 20 8 0 1 0
Claude Code 2.1.42 / Opus 4.6vanilla 77 3 2.0 4.0 1 0.00 20 9 3 2 2
Codex 0.101.0 / GPT-5.3vanilla 76 5 4.0 6.0 1 1.50 20 12 1 1 2

Python ETL Pipeline

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 65 2 3.0 3.0 3 0.00 20 30 0 3 4
Claude Code 2.1.42 / Opus 4.6vanilla 100 0 -- -- 0 0.00 0 0 0 0 0
Codex 0.101.0 / GPT-5.3vanilla 71 1 8.2 13.0 1 6.10 19 13 0 1 1

React TODO

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 98 0 26.0 26.0 0 0.00 20 0 0 0 0
Claude Code 2.1.42 / Sonnet 4.5vanilla 87 1 4.8 13.0 1 0.00 20 0 0 1 0
Claude Code 2.1.42 / Opus 4.6vanilla 86 0 5.1 18.0 2 0.00 20 0 0 1 0
Codex 0.101.0 / GPT-5.3vanilla 52 0 4.1 9.0 0 16.10 20 0 0 0 4

TypeScript Pathfinder

ComboMode Health Dead AvgCx P90Cx HiCx Tangle HidCoup AntiPat AP-Hi Crit Warn
Claude Code 2.1.45 / Sonnet 4.6vanilla 62 3 12.5 9.0 1 0.00 20 24 5 3 5
Claude Code 2.1.42 / Opus 4.6vanilla 55 2 4.9 14.0 3 1.60 20 29 0 4 7
Codex 0.101.0 / GPT-5.3vanilla 73 35 3.2 4.0 11 0.00 0 30 2 1 2