Team-Harness Ablation & Multi-Agent Comparison

Generated 2026-05-19 22:37 UTC · agent codex · model gpt-5.5 · subset flash (50 task pairs) · backend docker · 1 seed

Measures (a) how N-agent settings compare — solo (1 agent), coop (messaging only), coop+git (shared git remote), and team (lead/member + shared task list + scratchpad) — and (b) the marginal contribution of each of the five team-harness coordination features via one-feature-off ablation.

1. Setting comparison

Each row is the same 50 task pairs. A pair "passes" only if both features' held-out test suites pass against one merged tree (see Methodology).

configurationpassedpass rate median task timerun wall
coop (messaging only)13/50
26%
3m02s71m38s
team — no scratchpad15/50
30%
5m24s71m37s
team — no task_list20/50
40%
5m07s69m43s
solo (1 agent)24/50
48%
2m50s57m29s
coop + git28/50
56%
3m42s76m13s
team — no mcp30/50
60%
4m17s59m09s
team — no auto_refresh30/50
60%
4m04s50m20s
team — all features (baseline)31/50
62%
3m56s51m58s
team — no protocol35/50
70%
4m01s54m35s

Green row = team baseline (all features on). coop/coop+git ran at lower concurrency alongside the ablation sweep, inflating their "run wall" — the median task time is the cleaner cross-run comparison.

2. Feature ablation (one feature off per row)

All rows are team mode on the same 50 pairs; Δ is the change in passed count vs the all-on baseline. The lead/member role split stays on in every row — it is the defining property of team mode, not a toggle.

feature removed task_listscratchpadmcpauto_refreshprotocol passedΔ
(baseline — none off)ononononon31/500
task_listOFFonononon20/50-11
scratchpadonOFFononon15/50-16
mcpononOFFonon30/50-1
auto_refreshonononOFFon30/50-1
protocolononononOFF35/50+4

3. Key findings

4. Methodology

Eval protocol (per task pair repo/task [f_a, f_b]):

Step budget: codex exec ran unbounded (no --max-turns), capped only by a 2-hour wall-clock timeout; agents self-terminated after ~50–95 tool calls each.

Caveats. Single seed, n=50, codex/gpt-5.5 only. Effective discriminating n is smaller than 50 — many pairs pass or fail regardless of coordination. Costs/model field show $0 because codex's --json stream omits a cost field (real spend was nonzero). Team runs used the scratchpad for code-sharing, not a git server (--git was off) — so "team vs coop+git" compares two different sharing substrates, not "team = coop+git plus extras". The untested cell team --git (both substrates) is a follow-up.