Coordination Harness Pareto — temporary report

Generated 2026-05-26 · CooperBench flash subset (50 pairs) · single seed · codex + mini_swe_agent_v2, model gpt-5.5-hao

This is a synthesis of all robust (n=50) data from the coordination-design study, including the new coop+git+tasklist ("between coop+git and team") harness measured on both adapters. The Pareto plot below shows speed (x) vs accuracy (y); points up-and-to-the-left are better. Configs with a black outline are on the per-adapter Pareto frontier.

1. Pareto plot

150s 175s 200s 225s 250s 275s 300s 20% 30% 40% 50% 60% 70% median wall time (per pair, seconds) pass rate (both features) plain coop (27%, 182s) solo (51%, 170s) coop+git+tasklist (48%, 203s) coop+git (56%, 222s) team (62%, 237s) team-noproto (70%, 241s) solo (56%, 186s) coop+git (42%, 298s) coop+git+tasklist (50%, 272s) vanilla team (58%, 216s) Adapter codex (flash 50) mini_swe_agent_v2 (flash 50) Markers Black outline = on Pareto frontier Dashed line = frontier per adapter Up-and-left = better (higher pass rate, lower wall time)

2. Codex flash 50 spectrum

harnesspassmedian wallnote
  plain coop13/48 (27%)182sno sharing
★ solo24/47 (51%)170s1 agent, no coordination
  coop+git+tasklist24/50 (48%)203sNEW — the between harness (this study)
★ coop+git28/50 (56%)222speers + git, no integrator
★ team31/50 (62%)237slead + tasklist + scratchpad + integrate
★ team-noproto35/50 (70%)241steam minus protocol — best accuracy

★ = on the Pareto frontier. The frontier is solo (51%/170s) → coop+git (56%/222s) → team (62%/237s) → team-noproto (70%/241s) — four points monotonically increasing in both time and accuracy. team-noproto buys +8pp over team for only +4s wall time, so for accuracy-priority workloads it's the practical pick; the protocol is dead weight.

3. mini_swe_agent_v2 flash 50 spectrum

harnesspassmedian wallnote
★ solo28/50 (56%)186s1 agent
  coop+git20/48 (42%)298speers + git
  coop+git+tasklist25/50 (50%)272sNEW — between harness (+8pp vs coop+git)
★ vanilla team29/50 (58%)216slead + integration

Frontier is solo (56%/186s) → vanilla team (58%/216s). Team's accuracy edge over solo is only +2pp (within noise) at +30s cost. coop+git on msa is the worst harness (42%, 298s).

4. Headline findings

Caveats. All results are single-seed (one run per pair), so differences ≲ ±7pp pass-rate / ±30s wall are within run-to-run noise. Median wall is per-pair from result.json; pass is both_passed from eval.json. 6-pair mini-set results during iteration are intentionally NOT plotted — they were too noisy to anchor conclusions. The codex coop+git+tasklist run is new (this study); the codex baselines come from the existing CooperBench report (compare-coop-flash, compare-coop-git-flash, compare-solo-flash, ablate-flash-11111 = team, ablate-flash-11110 = team-noproto). msa solo and msa coop+git come from cmp-flash-msa-*; msa vanilla team and msa coop+git+tasklist are new.

5. Autoresearch supplement (mini_swe_agent_v2 10-iteration sweep)

Independent 10-iteration autoresearch sweep of the editable team_harness/ surface on the same flash-50 / gpt-5.5-hao / mini_swe_agent_v2 setting. One code-level change kept (a reproducible mechanism for the msa frontier point), nine reverted with mechanisms, all under this report's ±7pp / ±30s noise framing. Two-tier core(10)→flash(50) screen+confirm, flash re-run for borderline winners; eval-locked to src/cooperbench/team_harness/. Full ledger and ten reproducible patches on branch autoresearch/team-harness, dir autoresearch/checkpoints/.

5a. msa frontier — a concrete code-level mechanism for the vanilla-team point

The published vanilla-team point (0.58 / 216s) on the msa frontier comes from a single-seed run. The unmodified-baseline single-seed run on the same code measured 0.50 here (within their ±7pp band of 0.58); a 10-line patch to team_harness/loop_refresh.py reliably reproduces ≈0.58 across replications (avg of 0.60 + 0.56). At least ~+4pp of the msa frontier is a fixable harness behavior rather than ambient noise.

msa runflash passmean dur/pair$/pairnote
vanilla team (this loop's unmod baseline)0.50275s$3.061 single run; n=50
vanilla team (this published report)0.58216s1 single run; within noise of 0.50
+ auto_refresh dedup (iter3, commit a0b90fd)0.58 avg268s$3.15replicated: 0.60 then 0.56

Mechanism. TeamPoller.poll() previously re-rendered and re-injected the full task-list summary before every LLM step (~17 steps / agent), producing ~17 near-identical [Team task list] open:… done:… blocks in conversation context. The patch makes poll() return None when the summary is byte-identical to the previous emission; the agent re-reads the list only when it changes, and can still call coop-task-list on demand. Pure per-step deduplication — no removed information.

5b. Mechanisms for two adapter-dependent puzzles in this report

5c. Two negative findings worth callouts

5d. Full sweep (10 iterations, mini_swe_agent_v2 unless noted)

iterhypothesisflash passdur/pairdecision
3dedupe auto_refresh per-step injection (loop_refresh.py)0.58268skept · a0b90fd
1lead git apply --3way (+--reject fallback)0.64*285sreverted
2members defer shared-region edits + apply-verbatim mandate0.60*254sreverted
4all-done one-time integration nudge in format_task_summary0.56258sreverted (null)
5realign lead+member to auto-merge model (lead stops git apply)0.53261sreverted — see 5c
6enable --git collaboration (shared remote + dry-run merge)0.58286sreverted — composition fixes (naive 22→17) but ~+18s overhead → J<best
7uncap cost_limit (3.0 → 20.0) via --agent-config0.60 avg287sreverted — rescues ~7-8% truncated agents, aggregate within noise
8trim one-time CLI docs (_TEAM_LIST_USAGE)0.52261sreverted — one-time scaffolding content carries its weight
9reorder format_task_summary with own tasks first0.56281sreverted (null)
10--team-no-protocol on msa0.54250sreverted (null on msa; see 5b)

* iters 1–2 ran on the codex epoch (msa baseline was 0.50) before a mid-study agent switch; included for completeness, not directly comparable to the msa-epoch numbers above.

5e. What this confirms in the original report

Supplement appended 2026-05-26 by an autoresearch loop on CooperBench branch autoresearch/team-harness. Ledger: autoresearch/ledger.jsonl · checkpoints: autoresearch/checkpoints/iter{N}-{kept|reverted}.patch (each manifest entry names its base_commit so any version is reproducible).