Coordination Harness Pareto — temporary report

Generated 2026-05-26 · CooperBench flash subset (50 pairs) · single seed · codex + mini_swe_agent_v2, model gpt-5.5-hao

This is a synthesis of all robust (n=50) data from the coordination-design study, including the new coop+git+tasklist ("between coop+git and team") harness measured on both adapters. The Pareto plot below shows speed (x) vs accuracy (y); points up-and-to-the-left are better. Configs with a black outline are on the per-adapter Pareto frontier.

1. Pareto plot

2. Codex flash 50 spectrum

harness	pass	median wall	note
plain coop	13/48 (27%)	182s	no sharing
★ solo	24/47 (51%)	170s	1 agent, no coordination
coop+git+tasklist	24/50 (48%)	203s	NEW — the between harness (this study)
★ coop+git	28/50 (56%)	222s	peers + git, no integrator
★ team	31/50 (62%)	237s	lead + tasklist + scratchpad + integrate
★ team-noproto	35/50 (70%)	241s	team minus protocol — best accuracy

★ = on the Pareto frontier. The frontier is solo (51%/170s) → coop+git (56%/222s) → team (62%/237s) → team-noproto (70%/241s) — four points monotonically increasing in both time and accuracy. team-noproto buys +8pp over team for only +4s wall time, so for accuracy-priority workloads it's the practical pick; the protocol is dead weight.

3. mini_swe_agent_v2 flash 50 spectrum

harness	pass	median wall	note
★ solo	28/50 (56%)	186s	1 agent
coop+git	20/48 (42%)	298s	peers + git
coop+git+tasklist	25/50 (50%)	272s	NEW — between harness (+8pp vs coop+git)
★ vanilla team	29/50 (58%)	216s	lead + integration

Frontier is solo (56%/186s) → vanilla team (58%/216s). Team's accuracy edge over solo is only +2pp (within noise) at +30s cost. coop+git on msa is the worst harness (42%, 298s).

4. Headline findings

coop+git+tasklist (the "between" harness) is dominated on both adapters. On codex it's strictly worse than coop+git (48% vs 56%); on msa it's strictly worse than vanilla team (50%/272s vs 58%/216s) and dominated by solo (56%/186s).
The task list's effect on coop+git is adapter-dependent and sign-flipping — on msa it lifts accuracy by +8pp (42→50%); on codex it cuts accuracy by −8pp (56→48%). Real, measured at n=50, but the wrong sign for codex.
team-noproto is the practical accuracy winner on codex: 70% at 241s (only +4s over team for +8pp accuracy — the protocol is dead weight). The single robust merge-worthy finding from the whole exercise — and it's from the existing report's feature ablation, not anything I built.
No coordination addition beats the frontier. Across 12 variants (line-overlap, trial-merge, coop-merge, lean, scaffold-first, coop-api, coop-api+assembly, frl-loop, shared-task, coop+git+tasklist on msa, coop+git+tasklist on codex, …), every one of them is dominated by either solo or team-noproto.
Same direction across the two adapters: the parallelism savings on these features barely earn the coordination overhead, so the accuracy ceiling (~58% msa, ~70% codex) is set more by the model than by the coordination design.

Caveats. All results are single-seed (one run per pair), so differences ≲ ±7pp pass-rate / ±30s wall are within run-to-run noise. Median wall is per-pair from result.json; pass is both_passed from eval.json. 6-pair mini-set results during iteration are intentionally NOT plotted — they were too noisy to anchor conclusions. The codex coop+git+tasklist run is new (this study); the codex baselines come from the existing CooperBench report (compare-coop-flash, compare-coop-git-flash, compare-solo-flash, ablate-flash-11111 = team, ablate-flash-11110 = team-noproto). msa solo and msa coop+git come from cmp-flash-msa-*; msa vanilla team and msa coop+git+tasklist are new.

5. Autoresearch supplement (mini_swe_agent_v2 10-iteration sweep)

Independent 10-iteration autoresearch sweep of the editable team_harness/ surface on the same flash-50 / gpt-5.5-hao / mini_swe_agent_v2 setting. One code-level change kept (a reproducible mechanism for the msa frontier point), nine reverted with mechanisms, all under this report's ±7pp / ±30s noise framing. Two-tier core(10)→flash(50) screen+confirm, flash re-run for borderline winners; eval-locked to src/cooperbench/team_harness/. Full ledger and ten reproducible patches on branch autoresearch/team-harness, dir autoresearch/checkpoints/.

5a. msa frontier — a concrete code-level mechanism for the vanilla-team point

The published vanilla-team point (0.58 / 216s) on the msa frontier comes from a single-seed run. The unmodified-baseline single-seed run on the same code measured 0.50 here (within their ±7pp band of 0.58); a 10-line patch to team_harness/loop_refresh.py reliably reproduces ≈0.58 across replications (avg of 0.60 + 0.56). At least ~+4pp of the msa frontier is a fixable harness behavior rather than ambient noise.

msa run	flash pass	mean dur/pair	$/pair	note
vanilla team (this loop's unmod baseline)	0.50	275s	$3.06	1 single run; n=50
vanilla team (this published report)	0.58	216s	—	1 single run; within noise of 0.50
+ auto_refresh dedup (iter3, commit `a0b90fd`)	0.58 avg	268s	$3.15	replicated: 0.60 then 0.56

Mechanism. TeamPoller.poll() previously re-rendered and re-injected the full task-list summary before every LLM step (~17 steps / agent), producing ~17 near-identical [Team task list] open:… done:… blocks in conversation context. The patch makes poll() return None when the summary is byte-identical to the previous emission; the agent re-reads the list only when it changes, and can still call coop-task-list on demand. Pure per-step deduplication — no removed information.

5b. Mechanisms for two adapter-dependent puzzles in this report

Task-list sign-flip (+8pp msa, −8pp codex on coop+git+tasklist). auto_refresh only fires under multi-step Python-loop adapters (mini_swe_agent_v2, swe_agent, openhands_sdk). codex runs as a single codex exec shot (steps: 1), so the in-loop task-list injector is never read on codex — only the env / Redis setup latency runs, which is a small drag. Same code path, opposite net sign by adapter type.
codex team-noproto +8pp does not transfer to msa. Tested directly (iter10): msa flash 0.54 with --team-no-protocol vs the 0.58 baseline (this loop), within noise. Cause: the typed-protocol verbs (coop-request / coop-respond / coop-pending) only appear in agents/codex/'s prompt path. The mini_swe_agent_v2/config/coop.yaml and the team-mode _lead_block / _member_block reference only coop-task-* and coop-send / coop-recv. On msa the flag toggles nothing the model reads. The codex +8pp is a prompt-content effect, not a feature-mechanism effect.

5c. Two negative findings worth callouts

Lead manual integration helps (do not remove). The locked coop.yaml tells agents the eval auto-merges and warns "do not merge their branch into yours, the automatic merge will see duplicate changes and fail." The editable _lead_block instructs the lead to do exactly that, and a sampled failed task shows the lead's patch containing 100% of the member's added lines — the visible duplicate. Realigning the lead block to the auto-merge model (iter5) dropped pass 0.58 → 0.53. Breakdown:

config naive (fail) identical (pass-prone) solo-agent1

baseline 40 (22 fail) 9 1

lead realigned (iter5) 48 (22 fail) 1 0

Naive-merge failures did not drop (still 22) — the duplicate merges textually clean. What collapsed were the identical-merge passes (9 → 1): when the lead applies a member's patch verbatim and submits the combined tree, both patches converge byte-identical, the eval takes its fast "identical" path, and the result passes. The duplicate worry merges cleanly more often than it breaks. Keep the manual lead integration.
The JIT 3-way merge probe is provably already-correctly-scoped. team_harness/jit_merge.py runs git merge-file in-container every step under msa. The 22 dominant naive-merge failures all happen on clean merges — trees compose textually but break semantically (both reimplement the same helper differently, etc.). Extending the probe to flag clean disjoint co-edits is forbidden by its own tests (test_disjoint_edits_not_flagged, and correctly so), and no static probe can see semantic breaks without running the tests. The probe is not a tuning target; the failures are upstream, in agent decomposition.

config	naive (fail)	identical (pass-prone)	solo-agent1
baseline	40 (22 fail)	9	1
lead realigned (iter5)	48 (22 fail)	1	0

5d. Full sweep (10 iterations, mini_swe_agent_v2 unless noted)

iter	hypothesis	flash pass	dur/pair	decision
3	dedupe `auto_refresh` per-step injection (`loop_refresh.py`)	0.58	268s	kept · `a0b90fd`
1	lead `git apply --3way` (+`--reject` fallback)	0.64*	285s	reverted
2	members defer shared-region edits + apply-verbatim mandate	0.60*	254s	reverted
4	all-done one-time integration nudge in `format_task_summary`	0.56	258s	reverted (null)
5	realign lead+member to auto-merge model (lead stops `git apply`)	0.53	261s	reverted — see 5c
6	enable `--git` collaboration (shared remote + dry-run merge)	0.58	286s	reverted — composition fixes (naive 22→17) but ~+18s overhead → J<best
7	uncap `cost_limit` (3.0 → 20.0) via `--agent-config`	0.60 avg	287s	reverted — rescues ~7-8% truncated agents, aggregate within noise
8	trim one-time CLI docs (`_TEAM_LIST_USAGE`)	0.52	261s	reverted — one-time scaffolding content carries its weight
9	reorder `format_task_summary` with own tasks first	0.56	281s	reverted (null)
10	`--team-no-protocol` on msa	0.54	250s	reverted (null on msa; see 5b)

* iters 1–2 ran on the codex epoch (msa baseline was 0.50) before a mid-study agent switch; included for completeness, not directly comparable to the msa-epoch numbers above.

5e. What this confirms in the original report

The original "no coordination addition beats the frontier" verdict replicates: across 10 attempts on the editable surface, only a per-step removal moves the msa frontier, and only marginally.
The cross-adapter pattern "the accuracy ceiling is set more by the model than by the coordination design" is reinforced. 10 autoresearch iterations + this report's 12 variants = 22 honest attempts → one ~+4pp removal-of-repetition win. Strong evidence to invest in model capability rather than further coordination redesign on this surface.
The recommended codex winner (team-noproto) stands; the msa equivalent recommendation is the dedup patch (a0b90fd).

Supplement appended 2026-05-26 by an autoresearch loop on CooperBench branch autoresearch/team-harness. Ledger: autoresearch/ledger.jsonl · checkpoints: autoresearch/checkpoints/iter{N}-{kept|reverted}.patch (each manifest entry names its base_commit so any version is reproducible).