Team-Harness Ablation & Multi-Agent Comparison

Generated 2026-05-19 22:37 UTC · agent codex · model gpt-5.5 · subset flash (50 task pairs) · backend docker · 1 seed

Measures (a) how N-agent settings compare — solo (1 agent), coop (messaging only), coop+git (shared git remote), and team (lead/member + shared task list + scratchpad) — and (b) the marginal contribution of each of the five team-harness coordination features via one-feature-off ablation.

1. Setting comparison

Each row is the same 50 task pairs. A pair "passes" only if both features' held-out test suites pass against one merged tree (see Methodology).

configuration	passed	pass rate	median task time	run wall
coop (messaging only)	13/50	26%	3m02s	71m38s
team — no scratchpad	15/50	30%	5m24s	71m37s
team — no task_list	20/50	40%	5m07s	69m43s
solo (1 agent)	24/50	48%	2m50s	57m29s
coop + git	28/50	56%	3m42s	76m13s
team — no mcp	30/50	60%	4m17s	59m09s
team — no auto_refresh	30/50	60%	4m04s	50m20s
team — all features (baseline)	31/50	62%	3m56s	51m58s
team — no protocol	35/50	70%	4m01s	54m35s

Green row = team baseline (all features on). coop/coop+git ran at lower concurrency alongside the ablation sweep, inflating their "run wall" — the median task time is the cleaner cross-run comparison.

2. Feature ablation (one feature off per row)

All rows are team mode on the same 50 pairs; Δ is the change in passed count vs the all-on baseline. The lead/member role split stays on in every row — it is the defining property of team mode, not a toggle.

feature removed	task_list	scratchpad	mcp	auto_refresh	protocol	passed	Δ
(baseline — none off)	on	on	on	on	on	31/50	0
task_list	OFF	on	on	on	on	20/50	-11
scratchpad	on	OFF	on	on	on	15/50	-16
mcp	on	on	OFF	on	on	30/50	-1
auto_refresh	on	on	on	OFF	on	30/50	-1
protocol	on	on	on	on	OFF	35/50	+4

3. Key findings

Code-sharing is the load-bearing mechanism. The two features that let agents see each other's work — scratchpad (−16) and task_list (−11) — account for nearly all of team mode's value. Remove either and team drops below solo: two uncoordinated agents are worse than one.
mcp, auto_refresh, protocol have no positive effect for codex. mcp and auto_refresh land within noise of baseline (−1 each); protocol-off actually scored higher (+4), suggesting the typed request/respond verbs add mild overhead without payoff here. auto_refresh is expected to be a no-op — it only fires in Python-loop adapters, and codex is a CLI adapter.
Most multi-agent value = a shared code substrate, not the orchestration. coop+git (git remote) reaches 56% and team (scratchpad+task_list) reaches 62% — both far above messaging-only coop (26%), which is the worst configuration of all, below solo.

4. Methodology

Eval protocol (per task pair repo/task [f_a, f_b]):

solo: apply the single patch, both feature suites must pass against it.
coop / team: apply each agent's patch to its own branch, then: (1) if patches are byte-identical, use one; (2) else attempt naive 3-way merge — if clean, the merged tree is authoritative and both suites must pass; (3) if the merge conflicts or a patch fails to apply, fall back to the lead's patch alone. No union merge, no member fallback.
A pass requires both feature suites green against the same tree. Eval runs in a fresh container from the task's frozen image, after the agent containers are torn down.

Step budget: codex exec ran unbounded (no --max-turns), capped only by a 2-hour wall-clock timeout; agents self-terminated after ~50–95 tool calls each.

Caveats. Single seed, n=50, codex/gpt-5.5 only. Effective discriminating n is smaller than 50 — many pairs pass or fail regardless of coordination. Costs/model field show $0 because codex's --json stream omits a cost field (real spend was nonzero). Team runs used the scratchpad for code-sharing, not a git server (--git was off) — so "team vs coop+git" compares two different sharing substrates, not "team = coop+git plus extras". The untested cell team --git (both substrates) is a follow-up.