CooperBench Coordination Study — full dataset

Generated 2026-05-23 · agent codex (+ mini_swe_agent_v2 head-to-head) · model gpt-5.5-hao (Azure) · backend docker · 1 seed

Compares how multi-agent coordination affects task success on CooperBench, where each task pairs two features that two agents can split. Headline: a structured team harness (lead/member + shared task list + scratchpad) beats a single agent, while plain shared-git coop underperforms solo — coordination only pays off when it's structured.

1. Setting comparison — full dataset (652 pairs)

setting	passed	pass rate	median task time
coop + git	329/645	51%	3m26s
solo (1 agent)	362/652	56%	3m50s
team (full harness)	390/636	61%	4m23s
team − protocol	403/651	62%	4m06s

A pair passes only if both features' held-out suites pass against one merged tree. Green = the team harness. Median task time: coop+git fastest (each agent does one feature, in parallel); team slowest (coordination + integration overhead). Note: solo was timed under heavier host load than the others, so its true median is a bit lower.

2. Feature ablation — flash (50 pairs), one feature off per row

feature removed	task_list	scratchpad	mcp	auto_refresh	protocol	passed	Δ
(baseline)	on	on	on	on	on	31/50	—
task_list	OFF	on	on	on	on	20/48	-11
scratchpad	on	OFF	on	on	on	15/46	-16
mcp	on	on	OFF	on	on	30/50	-1
auto_refresh	on	on	on	OFF	on	30/50	-1
protocol	on	on	on	on	OFF	35/50	+4

All rows are team mode; Δ vs the all-on baseline. The lead/member split stays on in every row.

3. msa vs codex — flash (50 pairs, same model)

setting	codex pass	msa pass	codex median	msa median
solo	27/50 (54%)	28/50 (56%)	3m39s	3m05s
coop+git	25/50 (50%)	20/48 (42%)	3m25s	4m57s

Solo: msa ≈ codex (and msa a touch faster). coop+git: codex clearly better (more accurate + faster) — codex coordinates the 2-agent+git setting better. (msa coop+git had 2 tasks error out.)

4. Why team is slowest — timing decomposition

Both coop and team run their agents concurrently in threads, so team's longer wall time isn't from running agents back-to-back. It's structural: team serializes integration onto the lead. The member implements its one feature and exits early; the lead then keeps working a whole extra phase — pulling in the member's patch, integrating it on top of its own feature, re-running tests, and driving the task list to "all done" before submitting. Per-agent wall is reconstructed from each container's stream.log mtime relative to the task's start.

config	→ first claim	member finishes	lead finishes	lead tail (after member)
team (full harness)	0m29s	2m12s	4m23s	2m12s
team − protocol	0m28s	2m04s	4m06s	1m58s

The lead tail — how long the lead works after the member is already done — is the team tax: ≈2m12s (median), present in 100% of tasks, and roughly a second agent-length of work sitting on the critical path. By contrast coop+git has no integrator: its two agents finish in parallel at ≈2m39s / ≈3m26s and nobody waits, so coop's wall ≈ the slower agent. Protocol cost: dropping it trims the lead tail (2m12s → 1m58s) and overall median time for ~no accuracy change — the lead spends those turns chasing the member to mark task-list items "done."

5. Why pairs fail — merge outcomes & failure modes

CooperBench's paired features touch overlapping code by design, so merge conflict is the norm. What separates team from coop is that the lead produces one integrated tree; coop has no such step, so a conflict falls back to applying agent 1 alone (which is missing feature 2). And when a pair fails, it's mostly because the evaluated patch doesn't even run — it didn't apply or broke import, so the suite couldn't collect (0 tests ran) — not because the logic was wrong.

config	merge: conflicts	clean	identical	fail: 0 tests ran	tests ran, failed
coop + git	56%	13%	32%	72%	28%
team (full harness)	79%	20%	0%	87%	13%
team − protocol	77%	22%	0%	81%	19%

First three columns are share of all pairs; last two are share of failures. coop+git's 32% identical patches are the over-sync failure mode — both agents git pull each other and collapse to one feature's solution, losing the other. team converts most of its 79% conflicts into passes via lead integration, but its residual failures are 87% broken-patch (a partial/broken integration) — the lead is a single point of failure. In both modes the bottleneck is integration, not implementation.

Findings

Structured coordination beats solo; raw git-sharing doesn't. On the full dataset, team (~61-62%) > solo (~55%) > coop+git (~51%). Two agents with only a shared git remote do worse than one agent; the task-list + scratchpad + role split is what turns the corner.
The protocol feature is dead weight. team-no-protocol ≥ team on both the full dataset and the flash ablation — dropping the typed request/respond verbs is free or marginally better.
scratchpad and task_list are load-bearing (flash ablation: −16 / −11); mcp and auto_refresh are within noise for codex (auto_refresh is a no-op for CLI adapters by design).
msa ≈ codex solo, codex > msa at coordination. Equivalent alone; codex integrates the 2-agent+git setting better.
Accuracy costs time, and the cost is a serialized integration pass. team is slowest because the lead works a whole extra phase after the member finishes (lead tail ≈2m12s, in 100% of tasks); coop has no integrator so its agents finish in parallel. That same lead-integration step is exactly what buys team its higher pass rate (§4).
The bottleneck is integration, not implementation. Merge conflict is the norm (79% of team pairs), and most failures are a merged patch that doesn't even run (0 tests collected), not wrong logic. coop's 32% "identical patches" are agents over-syncing into a single feature (§5).
Cutting the protocol is a free speedup. It trims the lead tail (2m12s → 1m58s) with no accuracy loss — the lead stops burning turns chasing task-list bookkeeping.

Data & trajectories. Per-config CSVs sit beside this page; full agent trajectories (all 6 runs) are at huggingface.co/datasets/CooperBench/coordination-trajectories.

Caveats. Single seed; Azure gpt-5.5-hao deployment; docker backend. The Azure codex path carries no token/cost telemetry (codex's JSONL stream omits it and is incompatible with Azure, so the run used plain output). msa was scoped to solo + coop+git on flash only (no msa team-mode). Solo's timing was recorded under heavier host load than the other configs. Per-agent wall times in §4 are reconstructed from container stream.log mtimes (the Azure codex path stores no per-step telemetry), so treat them as ±a few seconds.