Team-Harness Ablation & Multi-Agent Comparison

Generated 2026-05-19 · updated 2026-06-05 (workload balance, parallelism & coordination→efficiency) · agent codex · model gpt-5.5 · subset flash (50 task pairs) · backend docker · 1 seed

Measures (a) how N-agent settings compare — solo (1 agent), coop (messaging only), coop+git (shared git remote), and team (lead/member + shared task list + scratchpad) — and (b) the marginal contribution of each of the five team-harness coordination features via one-feature-off ablation.

1. Setting comparison

Each row is the same 50 task pairs. A pair "passes" only if both features' held-out test suites pass against one merged tree (see Methodology).

configuration	passed	pass rate	median task time	run wall
coop (messaging only)	13/50	26%	3m02s	71m38s
team — no scratchpad	15/50	30%	5m24s	71m37s
team — no task_list	20/50	40%	5m07s	69m43s
solo (1 agent)	24/50	48%	2m50s	57m29s
coop + git	28/50	56%	3m42s	76m13s
team — no mcp	30/50	60%	4m17s	59m09s
team — no auto_refresh	30/50	60%	4m04s	50m20s
team — all features (baseline)	31/50	62%	3m56s	51m58s
team — no protocol	35/50	70%	4m01s	54m35s

Green row = team baseline (all features on). coop/coop+git ran at lower concurrency alongside the ablation sweep, inflating their "run wall" — the median task time is the cleaner cross-run comparison.

2. Feature ablation (one feature off per row)

All rows are team mode on the same 50 pairs; Δ is the change in passed count vs the all-on baseline. The lead/member role split stays on in every row — it is the defining property of team mode, not a toggle.

feature removed	task_list	scratchpad	mcp	auto_refresh	protocol	passed	Δ
(baseline — none off)	on	on	on	on	on	31/50	0
task_list	OFF	on	on	on	on	20/50	-11
scratchpad	on	OFF	on	on	on	15/50	-16
mcp	on	on	OFF	on	on	30/50	-1
auto_refresh	on	on	on	OFF	on	30/50	-1
protocol	on	on	on	on	OFF	35/50	+4

3. Key findings

Code-sharing is the load-bearing mechanism. The two features that let agents see each other's work — scratchpad (−16) and task_list (−11) — account for nearly all of team mode's value. Remove either and team drops below solo: two uncoordinated agents are worse than one.
mcp, auto_refresh, protocol have no positive effect for codex. mcp and auto_refresh land within noise of baseline (−1 each); protocol-off actually scored higher (+4), suggesting the typed request/respond verbs add mild overhead without payoff here. auto_refresh is expected to be a no-op — it only fires in Python-loop adapters, and codex is a CLI adapter.
Most multi-agent value = a shared code substrate, not the orchestration. coop+git (git remote) reaches 56% and team (scratchpad+task_list) reaches 62% — both far above messaging-only coop (26%), which is the worst configuration of all, below solo.

4. Workload balance between agents

How evenly is the coding work split between the two agents? The only per-agent workload signal populated across these runs is patch_lines (codex does not emit usable per-step counts, tokens, or cost), so we define a balance factor = min(patch_lines) / max(patch_lines) per pair: 1.0 = a perfectly even code split, →0 = one agent wrote everything.

feature removed	balance factor	lead : member lines	temporal concurrency	mean task time
(baseline — none off)	0.61	179 : 114	0.62	248s
task_list	0.59	169 : 115	— (no task events)	316s
scratchpad	0.60	109 : 105	0.48	350s
mcp	0.58	193 : 116	0.66	270s
auto_refresh	0.57	184 : 119	0.67	248s
protocol	0.59	190 : 117	0.67	244s

Balance and concurrency averaged over the 50 pairs per config (same runs as §2). "temporal concurrency" is defined in §5; it is unmeasurable for −task_list (no task-list events to timestamp).

workload balance vs execution time

Imbalance is structural, not feature-driven. The balance factor sits in a tight 0.57–0.61 band regardless of which coordination feature is removed. Ablating features moves time and pass rate a lot (§1–2) but barely moves balance — the split is governed by the lead/member role, not the harness.
The lead does most of the coding. Across 1694 pairs the lead is the larger contributor in 96% of cases, averaging 1.56× the member's patch size (175 vs 112 lines) — partly because the lead also integrates the merged tree.
Balance does not track execution time. Pooled r = 0.16 is statistically significant (p = 0.007) but explains only 2.4% of variance and is unstable across configs (per-config r ranges −0.12 to +0.49) — i.e. a real p-value with a negligible, unreliable effect.

5. Are the teams actually parallel?

Balance measures who produced how much, not whether the agents worked at the same time. Using the real wall-clock timestamps on every coordination event (task-list create/claim/update + inter-agent messages; codex records no per-action timestamps, verified against the raw HF logs), we compute a temporal-concurrency factor = overlap / union of the two agents' active spans (first→last coordination event). Over 1618 pairs (flash ablation + the 652-pair full-dataset runs) the mean is 0.63 (median 0.67).

$real-timestamp concurrency$

Balanced output ≠ parallel execution. Across 1618 pairs, output balance and temporal concurrency are uncorrelated (r = 0.07): teams can split the code evenly yet work in disjoint time windows, or overlap heavily while one agent dominates the output. min/max patch_lines is therefore a poor parallelism indicator on its own.
The shared surface is what enables co-activity. Removing scratchpad drops concurrency to 0.48 (the lowest) and is also the slowest config — without a place to see each other's work, the agents serialize. Removing protocol gives the highest concurrency (0.67) and the fastest time, consistent with §2's finding that the typed request/respond verbs add serializing overhead without payoff.
Better parallelism metrics need per-action timestamps. This concurrency factor is a coordination-event envelope and likely over-estimates instant-by-instant overlap. A true parallel-efficiency / co-activity metric (and a serialization/blocking fraction) would require logging per-action wall-clock in the adapters — codex currently records none.

6. Does coordination predict efficiency?

Coordination shouldn't predict gross wall-time — its payoff is specific: it spares the lead the cost of integrating the member's patch. The right efficiency target is the merge outcome (clean vs conflict), not total time. We test this on the full-dataset runs (652 tasks) for which we also have the matched solo run.

6.1 Merge effort is exactly what makes a second agent (not) pay off

Across 652 matched tasks the team is slower than solo overall (median speedup solo/team = 0.86; the team is faster on only 37% of tasks) — but that is entirely the merge:

team merge outcome	n	mean speedup vs solo	reading
clean	148	1.03	team matches/beats solo
conflict	504	0.92	team loses to solo

Difference significant (Mann-Whitney p=0.007). Team pass rate (0.60) still beats solo (0.56) — so a clean-merging team is the best of both: solo's speed and better accuracy. 80% of team merges conflict, which is why the average team is a net efficiency loss.

team vs solo speedup by merge outcome

6.2 What coordination signal predicts the merge? (n=634)

We scored every team pair on a battery of coordination signals and measured each one's AUC for predicting a merge conflict (the lead's merge effort). Result: cheap structural counts are useless; a semantic read of the coordination process is the best honest signal.

signal	AUC	note
message count, task-update count, claim timing, concurrency	0.51–0.53	structural counts — no signal
LLM semantic read of the process (conflict_risk)	0.68	best honest, pre-merge signal
patch file-overlap, member-only files, duration	0.61–0.66	leaky — downstream of the merge, unusable as a predictor

coordination signal ranking

Counting coordination doesn't work. How many messages/claims they exchange, when they first claim, and even real-timestamp concurrency all sit at AUC ≈ 0.51 — volume of coordination says nothing about whether the merge will be clean.
Reading coordination does. An LLM judging the same process (messages + task notes, no code) hits AUC 0.68 from a single conflict_risk score, and 5-fold CV-AUC 0.64 — more stable than the structural bundle (0.65 ± 0.16). Combining signals does not beat it; they capture the same thin process.
Beware leakage. Patch file-overlap looks predictive but in team mode the lead's patch reflects whether integration succeeded — it encodes the answer. Such post-merge signals must be excluded from any predictor.
The ceiling is input richness, not method. 0.65 is the limit on this run's sparse coordination trace (~3 messages/pair). On a coop run with full message threads and non-leaky file intent, the same LLM judge predicted clean-vs-conflict at 84% (vs a 51% file-overlap baseline). To predict — and reward — merge efficiency, log richer coordination: explicit file/interface ownership declared before editing.
It generalizes. The same zero-shot judge on an independent 651-pair run (team-noproto) reproduces the pattern: LLM signals predict conflict out-of-sample (conflict_risk AUC 0.60, communication 0.72) while structural counts stay at chance (messages 0.53, task-updates 0.50). And it is actionable: across two coordination conditions, the one with fewer merge conflicts (trust-prompt 24% vs shared-doc 38%) had the higher pass rate (32% vs 20%) — moving coordination moved merge effort moved outcome.

Chain of evidence: coordination → clean merge (LLM read, AUC 0.68–0.84) → efficiency (clean teams 1.03× solo vs 0.92× for conflicts). Coordination predicts efficiency through the merge channel — not as raw speed, and not via any countable proxy.

6.3 Counting coordination fails; judging coordination success works

If a semantic read predicts conflict, can we reduce it to a checklist of good behaviors? We extracted five textbook coordination behaviors from each pair's process. None reduce conflicts — they are flat-to-positive:

behavior present	conflict rate (present)	(absent)	Δ
declared ownership early	0.81	0.71	+0.10
flagged a potential overlap	0.82	0.75	+0.07
agreed on an interface	0.83	0.78	+0.05
explicit file partition	0.80	0.75	+0.05

behaviors vs holistic risk

Performing coordination is a marker of a hard task, not a cause of a clean merge. Pairs that declare ownership, flag overlaps and agree interfaces are precisely the ones on entangled features that need it — so the behavior correlates with more conflict. This is why both structural counts (§6.2) and behavior flags fail: both measure "did they coordinate," which is confounded by difficulty.
What works is judging whether the coordination will actually hold. The LLM's holistic conflict_risk sorts pairs into conflict rates of 0.69 → 0.83 → 0.91 across terciles — it reads not "did they coordinate" but "given what they said and own, will the two patches collide." That is the coordination signal that predicts efficiency.
Implication for harness design. Don't reward coordination theater (a behavior checklist is gameable and difficulty-confounded). Reward predicted coordination success — or change the task structure so features are partitionable in the first place (the lead-as-integrator bottleneck and 80% conflict rate suggest the current split is the problem).

6.4 The real ceiling: the tasks aren't file-partitionable

Why is 80% of merges a conflict, and why does coordination help so little? Because the benchmark's two features are, by construction, almost always editing the same files. Measuring the gold (reference) patches of each pair's two features: median file-overlap = 1.00; 91% of 652 tasks overlap >0.8; not one pair is disjoint. (e.g. go_chi/56: feature 2 = {context, mux, tree}.go, feature 5 = {context, cors, mux, tree}.go.)

gold feature overlap distribution

Clean file partitioning is impossible here. The agents are set up to collide; the best coordination can do is sequence edits or agree interfaces within shared files, not avoid them. This is the structural reason behaviors like "explicit file partition" don't help — there is nothing to partition.
But it's not all task structure. Combining the oracle entanglement (gold file-overlap + shared-file count) with the live LLM conflict_risk lifts 5-fold CV-AUC to 0.78, well above either alone (0.65 / 0.68). The size of the shared surface and whether the agents navigated it are complementary — so within these forced-overlap tasks, coordination quality still moves the outcome.
Highest-leverage fix is structural, not prompt-level. Coordination's predictive ceiling for efficiency is bounded by non-partitionable tasks. To make a second agent pay off, change the task/role structure (partitionable features, or a union/sequential merge protocol that doesn't discard on overlap) rather than asking for better coordination on tasks that can't be cleanly divided.

7. Methodology

Eval protocol (per task pair repo/task [f_a, f_b]):

solo: apply the single patch, both feature suites must pass against it.
coop / team: apply each agent's patch to its own branch, then: (1) if patches are byte-identical, use one; (2) else attempt naive 3-way merge — if clean, the merged tree is authoritative and both suites must pass; (3) if the merge conflicts or a patch fails to apply, fall back to the lead's patch alone. No union merge, no member fallback.
A pass requires both feature suites green against the same tree. Eval runs in a fresh container from the task's frozen image, after the agent containers are torn down.

Step budget: codex exec ran unbounded (no --max-turns), capped only by a 2-hour wall-clock timeout; agents self-terminated after ~50–95 tool calls each.

Caveats. Single seed, n=50, codex/gpt-5.5 only. Effective discriminating n is smaller than 50 — many pairs pass or fail regardless of coordination. Costs/model field show $0 because codex's --json stream omits a cost field (real spend was nonzero). Team runs used the scratchpad for code-sharing, not a git server (--git was off) — so "team vs coop+git" compares two different sharing substrates, not "team = coop+git plus extras". The untested cell team --git (both substrates) is a follow-up.