Measures (a) how N-agent settings compare — solo (1 agent), coop (messaging only), coop+git (shared git remote), and team (lead/member + shared task list + scratchpad) — and (b) the marginal contribution of each of the five team-harness coordination features via one-feature-off ablation.
Each row is the same 50 task pairs. A pair "passes" only if both features' held-out test suites pass against one merged tree (see Methodology).
| configuration | passed | pass rate | median task time | run wall |
|---|---|---|---|---|
| coop (messaging only) | 13/50 | 3m02s | 71m38s | |
| team — no scratchpad | 15/50 | 5m24s | 71m37s | |
| team — no task_list | 20/50 | 5m07s | 69m43s | |
| solo (1 agent) | 24/50 | 2m50s | 57m29s | |
| coop + git | 28/50 | 3m42s | 76m13s | |
| team — no mcp | 30/50 | 4m17s | 59m09s | |
| team — no auto_refresh | 30/50 | 4m04s | 50m20s | |
| team — all features (baseline) | 31/50 | 3m56s | 51m58s | |
| team — no protocol | 35/50 | 4m01s | 54m35s |
Green row = team baseline (all features on). coop/coop+git ran at lower concurrency alongside the ablation sweep, inflating their "run wall" — the median task time is the cleaner cross-run comparison.
All rows are team mode on the same 50 pairs; Δ is the change in passed count vs the all-on baseline. The lead/member role split stays on in every row — it is the defining property of team mode, not a toggle.
| feature removed | task_list | scratchpad | mcp | auto_refresh | protocol | passed | Δ |
|---|---|---|---|---|---|---|---|
| (baseline — none off) | on | on | on | on | on | 31/50 | 0 |
| task_list | OFF | on | on | on | on | 20/50 | -11 |
| scratchpad | on | OFF | on | on | on | 15/50 | -16 |
| mcp | on | on | OFF | on | on | 30/50 | -1 |
| auto_refresh | on | on | on | OFF | on | 30/50 | -1 |
| protocol | on | on | on | on | OFF | 35/50 | +4 |
scratchpad (−16) and
task_list (−11) — account for nearly all of team mode's value. Remove
either and team drops below solo: two uncoordinated agents are worse than one.auto_refresh is expected to be a no-op — it
only fires in Python-loop adapters, and codex is a CLI adapter.How evenly is the coding work split between the two agents? The only per-agent
workload signal populated across these runs is patch_lines (codex does
not emit usable per-step counts, tokens, or cost), so we define a
balance factor = min(patch_lines) / max(patch_lines) per pair:
1.0 = a perfectly even code split, →0 = one agent wrote
everything.
| feature removed | balance factor | lead : member lines | temporal concurrency | mean task time |
|---|---|---|---|---|
| (baseline — none off) | 0.61 | 179 : 114 | 0.62 | 248s |
| task_list | 0.59 | 169 : 115 | — (no task events) | 316s |
| scratchpad | 0.60 | 109 : 105 | 0.48 | 350s |
| mcp | 0.58 | 193 : 116 | 0.66 | 270s |
| auto_refresh | 0.57 | 184 : 119 | 0.67 | 248s |
| protocol | 0.59 | 190 : 117 | 0.67 | 244s |
Balance and concurrency averaged over the 50 pairs per config (same runs as
§2). "temporal concurrency" is defined in §5; it is unmeasurable for
−task_list (no task-list events to timestamp).
r = 0.16 is statistically significant
(p = 0.007) but explains only 2.4% of variance and is
unstable across configs (per-config r ranges −0.12 to +0.49) —
i.e. a real p-value with a negligible, unreliable effect.Balance measures who produced how much, not whether the agents worked at the same time. Using the real wall-clock timestamps on every coordination event (task-list create/claim/update + inter-agent messages; codex records no per-action timestamps, verified against the raw HF logs), we compute a temporal-concurrency factor = overlap / union of the two agents' active spans (first→last coordination event). Over 1618 pairs (flash ablation + the 652-pair full-dataset runs) the mean is 0.63 (median 0.67).
r = 0.07): teams can split the code evenly yet work in disjoint
time windows, or overlap heavily while one agent dominates the output. min/max
patch_lines is therefore a poor parallelism indicator on its own.scratchpad drops concurrency to 0.48 (the lowest) and is also the
slowest config — without a place to see each other's work, the agents serialize.
Removing protocol gives the highest concurrency (0.67) and the fastest time,
consistent with §2's finding that the typed request/respond verbs add serializing
overhead without payoff.Coordination shouldn't predict gross wall-time — its payoff is specific: it spares the
lead the cost of integrating the member's patch. The right efficiency target is the
merge outcome (clean vs conflict), not total time. We test this on the full-dataset
runs (652 tasks) for which we also have the matched solo run.
Across 652 matched tasks the team is slower than solo overall (median speedup solo/team = 0.86; the team is faster on only 37% of tasks) — but that is entirely the merge:
| team merge outcome | n | mean speedup vs solo | reading |
|---|---|---|---|
| clean | 148 | 1.03 | team matches/beats solo |
| conflict | 504 | 0.92 | team loses to solo |
Difference significant (Mann-Whitney p=0.007). Team pass rate (0.60) still beats solo (0.56) — so a clean-merging team is the best of both: solo's speed and better accuracy. 80% of team merges conflict, which is why the average team is a net efficiency loss.
We scored every team pair on a battery of coordination signals and measured each one's AUC for predicting a merge conflict (the lead's merge effort). Result: cheap structural counts are useless; a semantic read of the coordination process is the best honest signal.
| signal | AUC | note |
|---|---|---|
| message count, task-update count, claim timing, concurrency | 0.51–0.53 | structural counts — no signal |
| LLM semantic read of the process (conflict_risk) | 0.68 | best honest, pre-merge signal |
| patch file-overlap, member-only files, duration | 0.61–0.66 | leaky — downstream of the merge, unusable as a predictor |
conflict_risk score, and 5-fold CV-AUC 0.64
— more stable than the structural bundle (0.65 ± 0.16). Combining signals does
not beat it; they capture the same thin process.team-noproto) reproduces the pattern: LLM signals predict conflict out-of-sample
(conflict_risk AUC 0.60, communication 0.72) while structural counts stay at chance
(messages 0.53, task-updates 0.50). And it is actionable: across two coordination
conditions, the one with fewer merge conflicts (trust-prompt 24% vs shared-doc 38%) had the
higher pass rate (32% vs 20%) — moving coordination moved merge effort moved outcome.Chain of evidence: coordination → clean merge (LLM read, AUC 0.68–0.84) → efficiency (clean teams 1.03× solo vs 0.92× for conflicts). Coordination predicts efficiency through the merge channel — not as raw speed, and not via any countable proxy.
If a semantic read predicts conflict, can we reduce it to a checklist of good behaviors? We extracted five textbook coordination behaviors from each pair's process. None reduce conflicts — they are flat-to-positive:
| behavior present | conflict rate (present) | (absent) | Δ |
|---|---|---|---|
| declared ownership early | 0.81 | 0.71 | +0.10 |
| flagged a potential overlap | 0.82 | 0.75 | +0.07 |
| agreed on an interface | 0.83 | 0.78 | +0.05 |
| explicit file partition | 0.80 | 0.75 | +0.05 |
conflict_risk sorts pairs into conflict rates of 0.69 → 0.83 →
0.91 across terciles — it reads not "did they coordinate" but "given what they said and
own, will the two patches collide." That is the coordination signal that predicts efficiency.Why is 80% of merges a conflict, and why does coordination help so little? Because the benchmark's
two features are, by construction, almost always editing the same files. Measuring the
gold (reference) patches of each pair's two features:
median file-overlap = 1.00; 91% of 652 tasks overlap >0.8; not one pair is disjoint.
(e.g. go_chi/56: feature 2 = {context, mux, tree}.go, feature 5 = {context,
cors, mux, tree}.go.)
conflict_risk lifts 5-fold CV-AUC to
0.78, well above either alone (0.65 / 0.68). The size of the shared surface and
whether the agents navigated it are complementary — so within these forced-overlap tasks,
coordination quality still moves the outcome.Eval protocol (per task pair repo/task [f_a, f_b]):
Step budget: codex exec ran unbounded (no --max-turns),
capped only by a 2-hour wall-clock timeout; agents self-terminated after ~50–95 tool calls each.
--json stream omits a cost field (real spend was
nonzero). Team runs used the scratchpad for code-sharing, not a
git server (--git was off) — so "team vs coop+git" compares two different
sharing substrates, not "team = coop+git plus extras". The untested cell
team --git (both substrates) is a follow-up.