Measures (a) how N-agent settings compare — solo (1 agent), coop (messaging only), coop+git (shared git remote), and team (lead/member + shared task list + scratchpad) — and (b) the marginal contribution of each of the five team-harness coordination features via one-feature-off ablation.
Each row is the same 50 task pairs. A pair "passes" only if both features' held-out test suites pass against one merged tree (see Methodology).
| configuration | passed | pass rate | median task time | run wall |
|---|---|---|---|---|
| coop (messaging only) | 13/50 | 3m02s | 71m38s | |
| team — no scratchpad | 15/50 | 5m24s | 71m37s | |
| team — no task_list | 20/50 | 5m07s | 69m43s | |
| solo (1 agent) | 24/50 | 2m50s | 57m29s | |
| coop + git | 28/50 | 3m42s | 76m13s | |
| team — no mcp | 30/50 | 4m17s | 59m09s | |
| team — no auto_refresh | 30/50 | 4m04s | 50m20s | |
| team — all features (baseline) | 31/50 | 3m56s | 51m58s | |
| team — no protocol | 35/50 | 4m01s | 54m35s |
Green row = team baseline (all features on). coop/coop+git ran at lower concurrency alongside the ablation sweep, inflating their "run wall" — the median task time is the cleaner cross-run comparison.
All rows are team mode on the same 50 pairs; Δ is the change in passed count vs the all-on baseline. The lead/member role split stays on in every row — it is the defining property of team mode, not a toggle.
| feature removed | task_list | scratchpad | mcp | auto_refresh | protocol | passed | Δ |
|---|---|---|---|---|---|---|---|
| (baseline — none off) | on | on | on | on | on | 31/50 | 0 |
| task_list | OFF | on | on | on | on | 20/50 | -11 |
| scratchpad | on | OFF | on | on | on | 15/50 | -16 |
| mcp | on | on | OFF | on | on | 30/50 | -1 |
| auto_refresh | on | on | on | OFF | on | 30/50 | -1 |
| protocol | on | on | on | on | OFF | 35/50 | +4 |
scratchpad (−16) and
task_list (−11) — account for nearly all of team mode's value. Remove
either and team drops below solo: two uncoordinated agents are worse than one.auto_refresh is expected to be a no-op — it
only fires in Python-loop adapters, and codex is a CLI adapter.Eval protocol (per task pair repo/task [f_a, f_b]):
Step budget: codex exec ran unbounded (no --max-turns),
capped only by a 2-hour wall-clock timeout; agents self-terminated after ~50–95 tool calls each.
--json stream omits a cost field (real spend was
nonzero). Team runs used the scratchpad for code-sharing, not a
git server (--git was off) — so "team vs coop+git" compares two different
sharing substrates, not "team = coop+git plus extras". The untested cell
team --git (both substrates) is a follow-up.