CooperBench Coordination Study — full dataset

Generated 2026-05-23 · agent codex (+ mini_swe_agent_v2 head-to-head) · model gpt-5.5-hao (Azure) · backend docker · 1 seed

Compares how multi-agent coordination affects task success on CooperBench, where each task pairs two features that two agents can split. Headline: a structured team harness (lead/member + shared task list + scratchpad) beats a single agent, while plain shared-git coop underperforms solo — coordination only pays off when it's structured.

1. Setting comparison — full dataset (652 pairs)

settingpassedpass ratemedian task time
coop + git329/645
51%
3m26s
solo (1 agent)362/652
56%
3m50s
team (full harness)390/636
61%
4m23s
team − protocol403/651
62%
4m06s

A pair passes only if both features' held-out suites pass against one merged tree. Green = the team harness. Median task time: coop+git fastest (each agent does one feature, in parallel); team slowest (coordination + integration overhead). Note: solo was timed under heavier host load than the others, so its true median is a bit lower.

2. Feature ablation — flash (50 pairs), one feature off per row

feature removedtask_listscratchpadmcpauto_refreshprotocolpassedΔ
(baseline)ononononon31/50
task_listOFFonononon20/48-11
scratchpadonOFFononon15/46-16
mcpononOFFonon30/50-1
auto_refreshonononOFFon30/50-1
protocolononononOFF35/50+4

All rows are team mode; Δ vs the all-on baseline. The lead/member split stays on in every row.

3. msa vs codex — flash (50 pairs, same model)

settingcodex passmsa passcodex medianmsa median
solo27/50 (54%)28/50 (56%)3m39s3m05s
coop+git25/50 (50%)20/48 (42%)3m25s4m57s

Solo: msa ≈ codex (and msa a touch faster). coop+git: codex clearly better (more accurate + faster) — codex coordinates the 2-agent+git setting better. (msa coop+git had 2 tasks error out.)

4. Why team is slowest — timing decomposition

Both coop and team run their agents concurrently in threads, so team's longer wall time isn't from running agents back-to-back. It's structural: team serializes integration onto the lead. The member implements its one feature and exits early; the lead then keeps working a whole extra phase — pulling in the member's patch, integrating it on top of its own feature, re-running tests, and driving the task list to "all done" before submitting. Per-agent wall is reconstructed from each container's stream.log mtime relative to the task's start.

config→ first claimmember finisheslead finisheslead tail (after member)
team (full harness)0m29s2m12s4m23s2m12s
team − protocol0m28s2m04s4m06s1m58s

The lead tail — how long the lead works after the member is already done — is the team tax: ≈2m12s (median), present in 100% of tasks, and roughly a second agent-length of work sitting on the critical path. By contrast coop+git has no integrator: its two agents finish in parallel at ≈2m39s / ≈3m26s and nobody waits, so coop's wall ≈ the slower agent. Protocol cost: dropping it trims the lead tail (2m12s → 1m58s) and overall median time for ~no accuracy change — the lead spends those turns chasing the member to mark task-list items "done."

5. Why pairs fail — merge outcomes & failure modes

CooperBench's paired features touch overlapping code by design, so merge conflict is the norm. What separates team from coop is that the lead produces one integrated tree; coop has no such step, so a conflict falls back to applying agent 1 alone (which is missing feature 2). And when a pair fails, it's mostly because the evaluated patch doesn't even run — it didn't apply or broke import, so the suite couldn't collect (0 tests ran) — not because the logic was wrong.

configmerge: conflictscleanidenticalfail: 0 tests rantests ran, failed
coop + git
56%
13%32%72%28%
team (full harness)
79%
20%0%87%13%
team − protocol
77%
22%0%81%19%

First three columns are share of all pairs; last two are share of failures. coop+git's 32% identical patches are the over-sync failure mode — both agents git pull each other and collapse to one feature's solution, losing the other. team converts most of its 79% conflicts into passes via lead integration, but its residual failures are 87% broken-patch (a partial/broken integration) — the lead is a single point of failure. In both modes the bottleneck is integration, not implementation.

Findings

Data & trajectories. Per-config CSVs sit beside this page; full agent trajectories (all 6 runs) are at huggingface.co/datasets/CooperBench/coordination-trajectories.

Caveats. Single seed; Azure gpt-5.5-hao deployment; docker backend. The Azure codex path carries no token/cost telemetry (codex's JSONL stream omits it and is incompatible with Azure, so the run used plain output). msa was scoped to solo + coop+git on flash only (no msa team-mode). Solo's timing was recorded under heavier host load than the other configs. Per-agent wall times in §4 are reconstructed from container stream.log mtimes (the Azure codex path stores no per-step telemetry), so treat them as ±a few seconds.