Team-Harness Ablation & Multi-Agent Comparison

Generated 2026-05-19 · updated 2026-06-05 (workload balance, parallelism & coordination→efficiency) · agent codex · model gpt-5.5 · subset flash (50 task pairs) · backend docker · 1 seed

Measures (a) how N-agent settings compare — solo (1 agent), coop (messaging only), coop+git (shared git remote), and team (lead/member + shared task list + scratchpad) — and (b) the marginal contribution of each of the five team-harness coordination features via one-feature-off ablation.

1. Setting comparison

Each row is the same 50 task pairs. A pair "passes" only if both features' held-out test suites pass against one merged tree (see Methodology).

configurationpassedpass rate median task timerun wall
coop (messaging only)13/50
26%
3m02s71m38s
team — no scratchpad15/50
30%
5m24s71m37s
team — no task_list20/50
40%
5m07s69m43s
solo (1 agent)24/50
48%
2m50s57m29s
coop + git28/50
56%
3m42s76m13s
team — no mcp30/50
60%
4m17s59m09s
team — no auto_refresh30/50
60%
4m04s50m20s
team — all features (baseline)31/50
62%
3m56s51m58s
team — no protocol35/50
70%
4m01s54m35s

Green row = team baseline (all features on). coop/coop+git ran at lower concurrency alongside the ablation sweep, inflating their "run wall" — the median task time is the cleaner cross-run comparison.

2. Feature ablation (one feature off per row)

All rows are team mode on the same 50 pairs; Δ is the change in passed count vs the all-on baseline. The lead/member role split stays on in every row — it is the defining property of team mode, not a toggle.

feature removed task_listscratchpadmcpauto_refreshprotocol passedΔ
(baseline — none off)ononononon31/500
task_listOFFonononon20/50-11
scratchpadonOFFononon15/50-16
mcpononOFFonon30/50-1
auto_refreshonononOFFon30/50-1
protocolononononOFF35/50+4

3. Key findings

4. Workload balance between agents

How evenly is the coding work split between the two agents? The only per-agent workload signal populated across these runs is patch_lines (codex does not emit usable per-step counts, tokens, or cost), so we define a balance factor = min(patch_lines) / max(patch_lines) per pair: 1.0 = a perfectly even code split, →0 = one agent wrote everything.

feature removedbalance factor lead : member linestemporal concurrencymean task time
(baseline — none off)0.61179 : 1140.62248s
task_list0.59169 : 115 (no task events)316s
scratchpad0.60109 : 1050.48350s
mcp0.58193 : 1160.66270s
auto_refresh0.57184 : 1190.67248s
protocol0.59190 : 1170.67244s

Balance and concurrency averaged over the 50 pairs per config (same runs as §2). "temporal concurrency" is defined in §5; it is unmeasurable for −task_list (no task-list events to timestamp).

workload balance vs execution time

5. Are the teams actually parallel?

Balance measures who produced how much, not whether the agents worked at the same time. Using the real wall-clock timestamps on every coordination event (task-list create/claim/update + inter-agent messages; codex records no per-action timestamps, verified against the raw HF logs), we compute a temporal-concurrency factor = overlap / union of the two agents' active spans (first→last coordination event). Over 1618 pairs (flash ablation + the 652-pair full-dataset runs) the mean is 0.63 (median 0.67).

real-timestamp concurrency

6. Does coordination predict efficiency?

Coordination shouldn't predict gross wall-time — its payoff is specific: it spares the lead the cost of integrating the member's patch. The right efficiency target is the merge outcome (clean vs conflict), not total time. We test this on the full-dataset runs (652 tasks) for which we also have the matched solo run.

6.1 Merge effort is exactly what makes a second agent (not) pay off

Across 652 matched tasks the team is slower than solo overall (median speedup solo/team = 0.86; the team is faster on only 37% of tasks) — but that is entirely the merge:

team merge outcomenmean speedup vs soloreading
clean1481.03team matches/beats solo
conflict5040.92team loses to solo

Difference significant (Mann-Whitney p=0.007). Team pass rate (0.60) still beats solo (0.56) — so a clean-merging team is the best of both: solo's speed and better accuracy. 80% of team merges conflict, which is why the average team is a net efficiency loss.

team vs solo speedup by merge outcome

6.2 What coordination signal predicts the merge? (n=634)

We scored every team pair on a battery of coordination signals and measured each one's AUC for predicting a merge conflict (the lead's merge effort). Result: cheap structural counts are useless; a semantic read of the coordination process is the best honest signal.

signalAUCnote
message count, task-update count, claim timing, concurrency0.51–0.53structural counts — no signal
LLM semantic read of the process (conflict_risk)0.68best honest, pre-merge signal
patch file-overlap, member-only files, duration0.61–0.66leaky — downstream of the merge, unusable as a predictor

coordination signal ranking

Chain of evidence: coordination → clean merge (LLM read, AUC 0.68–0.84) → efficiency (clean teams 1.03× solo vs 0.92× for conflicts). Coordination predicts efficiency through the merge channel — not as raw speed, and not via any countable proxy.

6.3 Counting coordination fails; judging coordination success works

If a semantic read predicts conflict, can we reduce it to a checklist of good behaviors? We extracted five textbook coordination behaviors from each pair's process. None reduce conflicts — they are flat-to-positive:

behavior presentconflict rate (present)(absent)Δ
declared ownership early0.810.71+0.10
flagged a potential overlap0.820.75+0.07
agreed on an interface0.830.78+0.05
explicit file partition0.800.75+0.05

behaviors vs holistic risk

6.4 The real ceiling: the tasks aren't file-partitionable

Why is 80% of merges a conflict, and why does coordination help so little? Because the benchmark's two features are, by construction, almost always editing the same files. Measuring the gold (reference) patches of each pair's two features: median file-overlap = 1.00; 91% of 652 tasks overlap >0.8; not one pair is disjoint. (e.g. go_chi/56: feature 2 = {context, mux, tree}.go, feature 5 = {context, cors, mux, tree}.go.)

gold feature overlap distribution

7. Methodology

Eval protocol (per task pair repo/task [f_a, f_b]):

Step budget: codex exec ran unbounded (no --max-turns), capped only by a 2-hour wall-clock timeout; agents self-terminated after ~50–95 tool calls each.

Caveats. Single seed, n=50, codex/gpt-5.5 only. Effective discriminating n is smaller than 50 — many pairs pass or fail regardless of coordination. Costs/model field show $0 because codex's --json stream omits a cost field (real spend was nonzero). Team runs used the scratchpad for code-sharing, not a git server (--git was off) — so "team vs coop+git" compares two different sharing substrates, not "team = coop+git plus extras". The untested cell team --git (both substrates) is a follow-up.