This is a synthesis of all robust (n=50) data from the coordination-design study, including the new coop+git+tasklist ("between coop+git and team") harness measured on both adapters. The Pareto plot below shows speed (x) vs accuracy (y); points up-and-to-the-left are better. Configs with a black outline are on the per-adapter Pareto frontier.
| harness | pass | median wall | note |
|---|---|---|---|
| plain coop | 13/48 (27%) | 182s | no sharing |
| ★ solo | 24/47 (51%) | 170s | 1 agent, no coordination |
| coop+git+tasklist | 24/50 (48%) | 203s | NEW — the between harness (this study) |
| ★ coop+git | 28/50 (56%) | 222s | peers + git, no integrator |
| ★ team | 31/50 (62%) | 237s | lead + tasklist + scratchpad + integrate |
| ★ team-noproto | 35/50 (70%) | 241s | team minus protocol — best accuracy |
★ = on the Pareto frontier. The frontier is solo (51%/170s) → coop+git (56%/222s) → team (62%/237s) → team-noproto (70%/241s) — four points monotonically increasing in both time and accuracy. team-noproto buys +8pp over team for only +4s wall time, so for accuracy-priority workloads it's the practical pick; the protocol is dead weight.
| harness | pass | median wall | note |
|---|---|---|---|
| ★ solo | 28/50 (56%) | 186s | 1 agent |
| coop+git | 20/48 (42%) | 298s | peers + git |
| coop+git+tasklist | 25/50 (50%) | 272s | NEW — between harness (+8pp vs coop+git) |
| ★ vanilla team | 29/50 (58%) | 216s | lead + integration |
Frontier is solo (56%/186s) → vanilla team (58%/216s). Team's accuracy edge over solo is only +2pp (within noise) at +30s cost. coop+git on msa is the worst harness (42%, 298s).
result.json; pass is both_passed from eval.json.
6-pair mini-set results during iteration are intentionally NOT plotted — they were too noisy
to anchor conclusions. The codex coop+git+tasklist run is new (this study); the codex baselines
come from the existing CooperBench report (compare-coop-flash, compare-coop-git-flash,
compare-solo-flash, ablate-flash-11111 = team, ablate-flash-11110 = team-noproto). msa solo
and msa coop+git come from cmp-flash-msa-*; msa vanilla team and msa coop+git+tasklist are new.Independent 10-iteration autoresearch sweep of the editable
team_harness/ surface on the same flash-50 / gpt-5.5-hao /
mini_swe_agent_v2 setting. One code-level change kept (a reproducible
mechanism for the msa frontier point), nine reverted with mechanisms, all under
this report's ±7pp / ±30s noise framing. Two-tier core(10)→flash(50) screen+confirm,
flash re-run for borderline winners; eval-locked to
src/cooperbench/team_harness/. Full ledger and ten reproducible patches
on branch autoresearch/team-harness, dir
autoresearch/checkpoints/.
The published vanilla-team point (0.58 / 216s) on the msa
frontier comes from a single-seed run. The unmodified-baseline single-seed run on
the same code measured 0.50 here (within their ±7pp band of 0.58);
a 10-line patch to team_harness/loop_refresh.py reliably reproduces
≈0.58 across replications (avg of 0.60 + 0.56). At least ~+4pp of the msa
frontier is a fixable harness behavior rather than ambient noise.
| msa run | flash pass | mean dur/pair | $/pair | note |
|---|---|---|---|---|
| vanilla team (this loop's unmod baseline) | 0.50 | 275s | $3.06 | 1 single run; n=50 |
| vanilla team (this published report) | 0.58 | 216s | — | 1 single run; within noise of 0.50 |
+ auto_refresh dedup (iter3, commit a0b90fd) | 0.58 avg | 268s | $3.15 | replicated: 0.60 then 0.56 |
Mechanism. TeamPoller.poll() previously re-rendered
and re-injected the full task-list summary before every LLM step
(~17 steps / agent), producing ~17 near-identical
[Team task list] open:… done:… blocks in conversation context. The
patch makes poll() return None when the summary is
byte-identical to the previous emission; the agent re-reads the list only when it
changes, and can still call coop-task-list on demand. Pure per-step
deduplication — no removed information.
auto_refresh only fires under multi-step Python-loop adapters
(mini_swe_agent_v2, swe_agent, openhands_sdk).
codex runs as a single codex exec shot (steps: 1), so the
in-loop task-list injector is never read on codex — only the env / Redis setup
latency runs, which is a small drag. Same code path, opposite net sign by adapter
type.--team-no-protocol vs the
0.58 baseline (this loop), within noise. Cause: the typed-protocol verbs
(coop-request / coop-respond / coop-pending)
only appear in agents/codex/'s prompt path. The
mini_swe_agent_v2/config/coop.yaml and the team-mode
_lead_block / _member_block reference only
coop-task-* and coop-send / coop-recv. On
msa the flag toggles nothing the model reads. The codex +8pp is a
prompt-content effect, not a feature-mechanism effect.coop.yaml tells agents the eval auto-merges and warns "do not merge
their branch into yours, the automatic merge will see duplicate changes and fail."
The editable _lead_block instructs the lead to do exactly that, and a
sampled failed task shows the lead's patch containing 100% of the member's added
lines — the visible duplicate. Realigning the lead block to the auto-merge model
(iter5) dropped pass 0.58 → 0.53. Breakdown:
| config | naive (fail) | identical (pass-prone) | solo-agent1 |
|---|---|---|---|
| baseline | 40 (22 fail) | 9 | 1 |
| lead realigned (iter5) | 48 (22 fail) | 1 | 0 |
team_harness/jit_merge.py runs git merge-file in-container
every step under msa. The 22 dominant naive-merge failures all happen on
clean merges — trees compose textually but break semantically (both
reimplement the same helper differently, etc.). Extending the probe to flag clean
disjoint co-edits is forbidden by its own tests (test_disjoint_edits_not_flagged,
and correctly so), and no static probe can see semantic breaks without running the
tests. The probe is not a tuning target; the failures are upstream, in agent
decomposition.| iter | hypothesis | flash pass | dur/pair | decision |
|---|---|---|---|---|
| 3 | dedupe auto_refresh per-step injection (loop_refresh.py) | 0.58 | 268s | kept · a0b90fd |
| 1 | lead git apply --3way (+--reject fallback) | 0.64* | 285s | reverted |
| 2 | members defer shared-region edits + apply-verbatim mandate | 0.60* | 254s | reverted |
| 4 | all-done one-time integration nudge in format_task_summary | 0.56 | 258s | reverted (null) |
| 5 | realign lead+member to auto-merge model (lead stops git apply) | 0.53 | 261s | reverted — see 5c |
| 6 | enable --git collaboration (shared remote + dry-run merge) | 0.58 | 286s | reverted — composition fixes (naive 22→17) but ~+18s overhead → J<best |
| 7 | uncap cost_limit (3.0 → 20.0) via --agent-config | 0.60 avg | 287s | reverted — rescues ~7-8% truncated agents, aggregate within noise |
| 8 | trim one-time CLI docs (_TEAM_LIST_USAGE) | 0.52 | 261s | reverted — one-time scaffolding content carries its weight |
| 9 | reorder format_task_summary with own tasks first | 0.56 | 281s | reverted (null) |
| 10 | --team-no-protocol on msa | 0.54 | 250s | reverted (null on msa; see 5b) |
* iters 1–2 ran on the codex epoch (msa baseline was 0.50) before a mid-study agent switch; included for completeness, not directly comparable to the msa-epoch numbers above.
a0b90fd).