Compares how multi-agent coordination affects task success on CooperBench, where each task pairs two features that two agents can split. Headline: a structured team harness (lead/member + shared task list + scratchpad) beats a single agent, while plain shared-git coop underperforms solo — coordination only pays off when it's structured.
| setting | passed | pass rate | median task time |
|---|---|---|---|
| coop + git | 329/645 | 51% | 3m26s |
| solo (1 agent) | 362/652 | 56% | 3m50s |
| team (full harness) | 390/636 | 61% | 4m23s |
| team − protocol | 403/651 | 62% | 4m06s |
A pair passes only if both features' held-out suites pass against one merged tree. Green = the team harness. Median task time: coop+git fastest (each agent does one feature, in parallel); team slowest (coordination + integration overhead). Note: solo was timed under heavier host load than the others, so its true median is a bit lower.
| feature removed | task_list | scratchpad | mcp | auto_refresh | protocol | passed | Δ |
|---|---|---|---|---|---|---|---|
| (baseline) | on | on | on | on | on | 31/50 | — |
| task_list | OFF | on | on | on | on | 20/48 | -11 |
| scratchpad | on | OFF | on | on | on | 15/46 | -16 |
| mcp | on | on | OFF | on | on | 30/50 | -1 |
| auto_refresh | on | on | on | OFF | on | 30/50 | -1 |
| protocol | on | on | on | on | OFF | 35/50 | +4 |
All rows are team mode; Δ vs the all-on baseline. The lead/member split stays on in every row.
| setting | codex pass | msa pass | codex median | msa median |
|---|---|---|---|---|
| solo | 27/50 (54%) | 28/50 (56%) | 3m39s | 3m05s |
| coop+git | 25/50 (50%) | 20/48 (42%) | 3m25s | 4m57s |
Solo: msa ≈ codex (and msa a touch faster). coop+git: codex clearly better (more accurate + faster) — codex coordinates the 2-agent+git setting better. (msa coop+git had 2 tasks error out.)
Both coop and team run their agents concurrently in threads, so team's longer wall time
isn't from running agents back-to-back. It's structural: team serializes integration onto the
lead. The member implements its one feature and exits early; the lead then keeps working a whole
extra phase — pulling in the member's patch, integrating it on top of its own feature, re-running tests,
and driving the task list to "all done" before submitting. Per-agent wall is reconstructed from each
container's stream.log mtime relative to the task's start.
| config | → first claim | member finishes | lead finishes | lead tail (after member) |
|---|---|---|---|---|
| team (full harness) | 0m29s | 2m12s | 4m23s | 2m12s |
| team − protocol | 0m28s | 2m04s | 4m06s | 1m58s |
The lead tail — how long the lead works after the member is already done — is the team tax: ≈2m12s (median), present in 100% of tasks, and roughly a second agent-length of work sitting on the critical path. By contrast coop+git has no integrator: its two agents finish in parallel at ≈2m39s / ≈3m26s and nobody waits, so coop's wall ≈ the slower agent. Protocol cost: dropping it trims the lead tail (2m12s → 1m58s) and overall median time for ~no accuracy change — the lead spends those turns chasing the member to mark task-list items "done."
CooperBench's paired features touch overlapping code by design, so merge conflict is the norm. What separates team from coop is that the lead produces one integrated tree; coop has no such step, so a conflict falls back to applying agent 1 alone (which is missing feature 2). And when a pair fails, it's mostly because the evaluated patch doesn't even run — it didn't apply or broke import, so the suite couldn't collect (0 tests ran) — not because the logic was wrong.
| config | merge: conflicts | clean | identical | fail: 0 tests ran | tests ran, failed |
|---|---|---|---|---|---|
| coop + git | 56% | 13% | 32% | 72% | 28% |
| team (full harness) | 79% | 20% | 0% | 87% | 13% |
| team − protocol | 77% | 22% | 0% | 81% | 19% |
First three columns are share of all pairs; last two are share of failures.
coop+git's 32% identical patches
are the over-sync failure mode — both agents git pull each other and collapse to one feature's
solution, losing the other. team converts most of its 79%
conflicts into passes via lead integration, but its residual failures are
87% broken-patch (a partial/broken integration) —
the lead is a single point of failure. In both modes the bottleneck is integration, not implementation.
Data & trajectories. Per-config CSVs sit beside this page; full agent trajectories (all 6 runs) are at huggingface.co/datasets/CooperBench/coordination-trajectories.
gpt-5.5-hao deployment;
docker backend. The Azure codex path carries no token/cost telemetry (codex's JSONL stream omits it and is
incompatible with Azure, so the run used plain output). msa was scoped to solo + coop+git on flash only
(no msa team-mode). Solo's timing was recorded under heavier host load than the other configs.
Per-agent wall times in §4 are reconstructed from container stream.log mtimes (the Azure codex
path stores no per-step telemetry), so treat them as ±a few seconds.