CooperAgents — Unified Self-Evolving Harness

Generated 2026-06-08, finalized 2026-06-09 · system cooperagents (standalone unified team+agent harness) · agent mini-swe-agent (held constant) · model gpt-5.5 · evaluator cooperbench eval · benchmark: fixed 10 flash pairs · backend docker (arm64 via QEMU) · 1 seed

Hard constraint: every agent runs in its own container — agents never share a live workspace. The coordinated team seeds each fresh container with teammates' committed diff (git apply) so an agent still builds on prior work without sharing a running container. (Earlier "shared single container" results are superseded; see §5.)

cooperagents unifies the previously-separate team and agent harnesses into one orchestrator: agents and the supervisor share a single bus and one git working tree, and the agent loop (vendored mini-swe-agent) is editable in-repo. This lets us co-optimize the team layer and the agent loop together ("the seam") rather than treating the agent as a black box. CooperBench is used only as a task source and evaluator (never modified).

Why mini-swe, not codex. The objective is a better co-designed system, not a leaderboard number. A black-box agent (codex) forecloses co-design and makes any gain un-attributable to the harness. CooperBench's published codex numbers (solo 48% / team 62%) are context, not a target to game. We hold the agent constant (mini-swe) so every delta is attributable to the harness/seam.

0. Lead result — C1: a regression-guarded sequential handoff (method → result)

Be precise about "coordination." C1 is sequential: agents run one after another, each in its own container seeded with the prior's committed diff — no concurrency, no two-way negotiation. It is coordination only in the narrow sense that one agent's behavior is constrained by another's explicit contract (a published check). Every genuinely concurrent method we built underperformed (see the result table) — so on coupled work the coordination that helps is a sequential contract handoff, not parallelism.

Method — diagnose, then target. CooperBench couples its feature pairs on purpose, so we first asked where a team fails in a way a single agent would not. Reading the sequential team's per-feature grader output on coupled tasks, the dominant such failure is f1=F, f2=P: the second agent, building on the first's work, silently breaks the first agent's feature while implementing its own. C1 (regression-guarded handoff) targets exactly this — each agent, in its own container, publishes a runnable check for its feature into the shared tree (.cb_checks/, transferred to teammates via the committed substrate but stripped before grading); every later agent must run and keep all prior checks green while adding its feature. The published check is a one-way contract handed agent → agent — one step richer than the bare substrate (which hands over code with no obligation), but still sequential. It is not ex-ante partitioning (impossible — the write-set isn't known until the code is written) and not concurrent negotiation.

C1 method: each agent publishes a runnable invariant check; the later agent must keep all prior checks green while adding its feature. The check is the coordination signal, communicated through the committed substrate.

Result — the best result from an inter-agent contract, and it's sequential. On a regression-heavy coupled set C1 lifts the team 5/14 → 8/14, and every one of the +3 is the diagnosed failure reversing (feature 1 recovers F→P); no previously-passing pair is lost. It is genuinely targeted, not a blanket gain: on a control set with zero such regressions it is neutral-to-slightly-negative (overhead), and on the mixed flash set the sequential substrate alone is already best.

set	sequential team	+ C1	Δ
coupled set-1 (regression-heavy)	5/14 (36%)	8/14 (57%)	+3 — all feature-1 F→P recoveries
coupled set-2 (0 regressions)	9/14	8/14	−1 (overhead; nothing to fix)
flash-10 (mixed)	5/10	4/10	−1 (overhead)

C1 helps iff the "later agent breaks an earlier teammate's feature" regression is present — a real fix for one specific hard-coordination failure, confirmed by it firing exactly when that failure occurs.

Concurrency did not help — the opposite of the usual intuition. On the same coupled set-1, every genuinely concurrent design underperformed the sequential ones — the parallel split reintroduces the integration losses serialization avoids (Round 9): C1 sequential contract 8/14 > sequential substrate 5/14 > solo 5/14 > guarded-merge decompose (concurrent + contracts) 3/14 > plain parallel decompose 2/14. So the coordination that pays off on coupled work is a sequential contract handoff, not parallelism or live negotiation; we do not (yet) have a concurrent-coordination method that beats serializing on the committed substrate.

Success and efficiency vs baselines — A. Like-for-like success (same agent/model/eval): the unified team doubles solo and matches the CooperBench team harness reference. B. Success vs. efficiency on the coupled (hard) set — the kept-best sequential team and, in the upper-left "better" corner, the C1 regression-guard (higher pass-rate *and* faster on this set). Parallel decompose / guarded merge trade efficiency for losses.

0.1 The foundation C1 builds on — the own-container substrate

C1 is the sequential, own-container, build-on-prior team plus the published invariant. That underlying substrate is itself the dominant lever (solo 30% → team 50% on flash) and is coordination-safe (Round 9: 0 / 28 coordination-caused losses); C1 adds the one piece of live coordination that pays off on coupled work.

Kept-best harness: agents run in isolated containers and coordinate by seeding committed diffs (sequential build-on-prior), generalizing to N features (agent i+1 seeded with δ1..δi). The TeamBus serves the orchestrator (task list / spawn queue / metrics); its messaging & spawn tools are wired to agents only in the builtin-agent and dynamic-spawn modes — the kept-best mini-swe agents do not use the bus, they coordinate via the seeding.

1. Headline results

Same agent (mini-swe), model (gpt-5.5), evaluator, and 10 pairs. A pair passes only if both features' held-out suites pass on one merged tree.

harness	success	per-feature	notes
CooperAgents solo (1 agent, own container)	30% (3/10)	60%	single-agent baseline
CooperBench team harness (reference)	60% (6/10)	65%	like-for-like reference (own containers + scratchpad)
CooperAgents unified team — OWN containers (seed+commit)	50% (5/10)	65–70%	current; satisfies the hard constraint
unified team — shared single container	56%	67%	SUPERSEDED (violated own-container constraint)

Under the own-container constraint, the team advantage holds: team 50% vs solo 30%, matching CooperBench's team harness (60%) within n=10 noise. The substrate value is preserved by seeding+committing teammates' diff into each isolated container.
The gap to CooperBench's published codex team (62%) is agent capability (codex ≫ mini-swe), not the harness — confirmed by holding the agent constant.
The team advantage comes from separability, not coordination (Round 6). On a purpose-built set of the 14 most coupled pairs (features sharing files/symbols, ranked from gold patches), team = solo = 5/14 and the best coordination seam (S2) doesn't help (4/14). The multi-agent win evaporates exactly where coordination "should" matter: on tightly-coupled work the sequential build-on-prior substrate causes interference, not leverage. Multi-agent value is in parallelizable work, not richer coordination.

2. Hypotheses tested

#	hypothesis	result	verdict
H1	Unifying team+agent onto one shared bus + git tree enables co-design	Agents + supervisor share a `TeamBus`; agent loop editable in-repo	SUPPORTED
H2	A team can spawn helper agents at runtime (capped)	Works (`spawn_helper` + supervisor); covered by tests + offline flash	SUPPORTED
H3	A shared code substrate beats isolated per-agent patches	Isolated → merge conflicts → team below solo. Shared tree → conflicts gone → team 30%→56%	STRONGLY SUPPORTED
H4	The unified harness can match CooperBench's team harness (same agent)	56% vs 60% on the same 10 pairs	SUPPORTED
H5	A stronger agent loop (toy JSON loop → mini-swe) raises the absolute level	solo 20%→30%, team 30%→56%	SUPPORTED
H6 (S5)	A team-level verify-and-fix integrator pass fixes remaining failures	go_chi probe 0/2→0/2, ~2× time, no change	REJECTED (failures were API-semantic, not build errors)
H7 (S8)	A spec-fidelity prompt fixes API mismatches	team 5/10 (= baseline)	REJECTED (misses are spec-ambiguous)
H8 (S2)	Live teammate context (show each agent the prior diff) improves integration	team 6/10 (+1, best variant)	SUPPORTED (suggestive, n=10)
H9	Combining S8+S2 beats either alone	combo 5/10 (< S2's 6/10), but fastest	REJECTED (S8 adds nothing)

3. Seam-improvement ablation — three metrics (own-container)

Same agent, same 10 pairs; only the team↔agent seam changes, all on the own-container + S2 baseline. Success (both-pass), Efficiency (avg wall-clock; lower better), Judge (mean LLM rubric 1–5). Reliable rows = 10/10 evals.

variant	Success	Efficiency (s)	Judge /5	verdict
baseline (own-container + S2)	5/10 (50%)	686	3.40	—
S6 budget-80	3/10	786	3.00	DROP
S7 guard	4/10	870	3.80	DROP
S8 spec-fidelity	5/10	1085	3.40	DROP
S5+S6	6/10	1434	3.70	INCONCLUSIVE
S5+S8	5/10	1564	3.60	DROP
S6+S7	6/10	937	3.80	INCONCLUSIVE
all (S5+S6+S7)	5/10	1342	—	DROP

The three metrics agree on the shape: the two 60% combos (S5+S6, S6+S7) also have the highest judge (3.70/3.80), a mild positive signal — but budget alone scored worst (30%, judge 3.00) and both winners are slower, so it's noise-consistent. No robust, attributable gain at n=10; only n≈30 could confirm S5+S6/S6+S7. Earlier shared-container S2/S8/combo numbers are superseded.

4. Line of evolution

Designs are added on top of what was kept; each is measured (success / efficiency / judge) and kept or rejected. KEEP stacks into the baseline; REJECT is recorded and dropped. The tree below reads top-to-bottom as the progressive build: first the architecture, then the substrate, then four distinct mechanism families of seam experiments layered on the kept baseline.

ARCHITECTURE
two-level design, toy JSON agent loop                 team < solo (uncoordinated)   REJECT
  └─ unify team+agent onto ONE shared bus + per-agent envs                            KEEP   (the seam)
       └─ + dynamic helper spawning (capped by --max-agents)                          KEEP   (capability)
            └─ + mini-swe worker  (real agent, HELD CONSTANT — never codex)           KEEP   (worker)

SUBSTRATE  (the one big lever)
                 └─ shared SINGLE container (build-on-each-other)       30%→56%        REJECT  ✗ violates own-container constraint
                      └─ OWN containers, each seeded w/ teammates' committed diff  →50%  KEEP  ★ BASELINE
                         │   ≈ CooperBench team (60%), 2× solo (30%); seed-commit ~2× faster
                         │   than independent+integrate at equal success/judge → kept.
                         │
   ┌─────────────────────┴─ four mechanism families layered on the ★ baseline ─────────────────────┐
   │
   ├─ FAMILY A · coordination-prompt seams  (state crossing team↔agent)
   │    ├─ S2 live teammate-diff context           6/10 @n10 (judge↑)   keep* marginal — new best on judge
   │    ├─ S8 spec-fidelity prompt                 5/10 = baseline       REJECT (misses are spec-ambiguous)
   │    ├─ S5 verify-fix · S6 budget · S7 git-guard 30–60%@n10 → 40%@n20 REJECT (n=10 noise; combos collapse)
   │    └─ reverse order · S3 contracts            50%                   REJECT (irrelevant / subsumed)
   │
   ├─ FAMILY B · post-hoc reviewer / repair passes  (extra LLM pass after the team) — analyst-generated
   │    └─ T3 completeness-review reviewer pass    6/10 @n10 → 10/20 = 10/20   REJECT double-edged (+2 fixed / −2 broken)
   │
   ├─ FAMILY C · in-loop workflow preambles  (change the SAME agent's loop, no extra pass) — analyst-generated
   │    ├─ T2 spec-derived TDD (self-verify first) 5/10 = baseline, 0 flips  REJECT (pure wash)
   │    └─ T4 convention-mining before editing     4/10 (baseline −1)        REJECT (mining eats step budget)
   │
   └─ FAMILY D · selection over run-to-run variance  (best-of-N)
        └─ T6 best-of-N + LLM-judge self-select    5/10 = baseline (+1 / −1 flip)  REJECT 2× cost; judge mis-selects

   ═══ 4 families → the SAME ~50% plateau · binding constraint = agent capability (mini-swe) ═══

CHANGE THE EVAL SET  (Round 6 — is the plateau a sampling artifact?)
  └─ rank pairs by GOLD-patch coupling (shared files×2 + shared symbols)        (scripts/select_coupled.py)
       └─ old fixed set was already 8/10 coupled → NOT a coupling-poor sample
       └─ pick 14 MOST-coupled pairs (lite pool, 9 repos)
            ├─ solo                       5/14 (36%)
            ├─ team — baseline substrate  5/14 (36%)   team = solo  ✗ multi-agent advantage GONE on coupled work
            └─ team + S2 coordination     4/14 (−1)    no help
       ★ INSIGHT: team wins come from SEPARABILITY, not coordination. Coupled features →
         build-on-prior substrate causes interference → cancels the parallelism benefit.

SEPARABILITY-AWARE ORCHESTRATION  (Round 7 — can we CREATE separability?)
  └─ planner re-cuts task into an independence-maximizing subtask DAG       (planner.py + _run_decomposed)
       ├─ file-level merge       OLD 30% · COUP 21%   ≈solo  CooperBench pairs always share files → merges to 1 agent
       └─ region-aware split      OLD 30% · COUP 14%   worse  splits 2–3 ways but 12/14 merges CONFLICT
       ★ LIMIT: a conflict-free partition needs the write-set BEFORE coding, but it's only
         known AFTER coding. Separability can't be manufactured ex-ante; it must exist in the task.

COORDINATION UNDER INTERDEPENDENCE  (Round 8 — coupling IS the target)
  └─ diagnose coupled failure: dominant = f1=F,f2=P (2nd agent BREAKS 1st agent's feature)
       └─ C1 regression-guard: each agent publishes a runnable check; later agents keep ALL green
            ├─ coupled set-1 (regression-heavy)  5/14 → 8/14   +3  all 3 = feature-1 F→P recovery
            ├─ coupled set-2 (0 regressions)     9/14 → 8/14   -1  nothing to fix → overhead
            └─ OLD-10 (mixed)                    5/10 → 4/10   -1  overhead
       ★ FIRST attributable coordination WIN on coupled work — TARGETED: helps iff the
         "later agent breaks earlier feature" regression is present. Diagnose failure → target it.

WHERE DO COORDINATION PROBLEMS LIVE?  (Round 9 — detector + fresh tasks)
  └─ detector: solo passes a feature but TEAM breaks it = coordination-CAUSED loss
       └─ harvest 28 pairs (coupled set-1 + 14 fresh, diverse repos):
            ├─ sequential-seed team   0/28 losses   coordination-SAFE  (serializes on committed work)
            └─ parallel decompose     7/28 losses   lossy  (split+merge breaks working features)
       ★ Coordination problems concentrate in the PARALLEL regime (the merge boundary). The
         substrate (build-on-prior) is dominant BECAUSE it eliminates them. Open hard problem =
         LOSS-FREE PARALLELISM (parallel speed without the 7/28 integration losses).

LOSS-FREE PARALLELISM  (Round 10 — guarded merge: decompose + published invariants)
  └─ each parallel branch publishes a check; guarded integrator repairs merge-broken features
       ├─ plain parallel decompose      2/14
       ├─ guarded-merge decompose       3/14   partial  recovers FF→PP (click f1_f3) but double-edged
       └─ sequential build-on-prior     5/14   still best
       ★ Guard repairs real merge losses but can't close the gap: residual unrepairable conflicts
         (Round 7 write-set limit) + overhead. Coupled work → serialize. Loss-free parallel only partial.

ADAPTIVE TOPOLOGY  (Round 11 — let the work decide, by observation)
  └─ run parallel → probe merge (git apply --check) → clean? keep parallel : fall back to sequential
       ├─ CooperBench (all coupled)   20/20 → fallback   no regression  ≈ sequential (parallel never engages)
       └─ separable work (controlled)  → parallel        1.99× faster  8.0s → 4.1s, both features intact
       ★ sequential's success on coupled + parallel's efficiency on separable, decided by conflict not prediction.

GENERALIZE  (Round 12 — agents choose the topology, domain-agnostic)
  └─ planner AGENT reasons about subtask deps (independent / needs-output / combine-at-end) → a DAG
       ├─ chain = sequential   ·   no edges = parallel   ·   diamond = parallel-then-join (pipeline)
       └─ executor runs the agent's DAG: parallel 4.0s · sequential 8.1s · pipeline 8.1s (2×~4s work)
       ★ topology decision + scheduling are domain-general; only the integration backend (git merge) is SWE.

How to read it: each FAMILY is a structurally different way to spend extra effort at the seam. The substrate (own-container, build-on-prior) is the one change that moved the needle (30%→50%); everything layered after it lands back on ~50%. Note the shared-single-container design briefly hit 56% but was rejected for violating the own-container constraint — re-run under the constraint the team holds 50%, so the substrate value transfers via committing teammates' diff into each isolated container. Families B and C were proposed by the harness's own LLM analyst reading its failures; D (best-of-N self-selection) is the last untried mechanism. D is the most telling negative: best-of-2 does expose run-to-run variance (≈1 pair flips each way), so headroom exists — but the LLM judge can't reliably pick the passing candidate (it mis-selected a pair baseline gets right). A selector good enough to net-gain would have to know which diff passes the hidden tests — i.e. it would have to be more capable than the agent, which is the constraint itself.

5. Experiment log (chronological)

Each round explores hypotheses on top of the current kept baseline; winners stack.

Round 0 — architecture

Unified the team and agent layers onto one TeamBus (task list + messaging + spawn queue) and a shared environment, so coordination is just more tools on the agent and the agent loop is editable in-repo. Added runtime helper spawning. Enables co-design; not a pass-rate experiment.

Round 1 — establish the team baseline (on the unified harness)

Shared workspace (one git tree, agents build on each other): isolated per-agent patches → merge conflicts → team below solo; shared tree → conflicts gone, team 30% → 56%. KEPT — dominant lever.
mini-swe worker (replace toy loop with the real agent, held constant): solo 20%→30%, team 30%→56% — matches CooperBench's team harness (60%). KEPT.
S5 verify-and-fix: 0/2 on probe, ~2× time. REJECTED (failures are API-semantic, not build).
S8 spec-fidelity prompt: 5/10, judge 3.40 ≈ baseline 3.30. REJECTED (spec-ambiguous misses).
S2 live teammate context: 6/10, judge 3.70 — best on success & judge. KEPT → new baseline.

Round 1.5 — hard constraint: every agent in its OWN container (redo)

The shared-single-container design (56%) was rejected — it shared a live workspace, violating the constraint. Reworked to _run_isolated: each agent (and the integrator) in its own container, seeded with teammates' committed diff via git apply.
Re-measured under the constraint: team 50% (5/10) vs solo 30% — the advantage survives; matches CooperBench team (60%) within noise. New baseline.
S2 (teammate-context note) is now marginal (per-feature 13/20→14/20, both-pass unchanged) — committing the seeded prior already gives the agent the code, so the note is largely redundant.
Measurement lesson: the team briefly looked like 30% because ~half the eval.json files hadn't written (eval-phase flakiness) and missing → counted as fail. Fix: run agent phase and eval phase separately and backfill evals single-process before reading any number. Parallel agent runs are fine.

Round 2/3 — seam ablation on the own-container baseline (9 variants, parallel)

Nine variants run in parallel (no API rate-limiting; eval backfilled single-process). All on top of own-container + S2; only the named seam changes. Reliable rows = 10/10 evals.

variant	Success	Efficiency (s)	verdict
baseline (own-container + S2)	5/10 (50%)	686	—
S6 budget-80	3/10	786	DROP
S7 guard	4/10	870	DROP
S8 spec-fidelity	5/10	1085	DROP tied, slower
S5+S6	6/10	1434	INCONCLUSIVE uncorroborated, slower
S5+S8	5/10	1564	DROP
S6+S7	6/10	937	INCONCLUSIVE uncorroborated
all (S5+S6+S7)	5/10	1342	DROP

Finding — the loop has converged. Results scatter 30–60% around the 50% baseline, i.e. pure n=10 noise (±1–2 pairs). The two 60% combos aren't corroborated (budget alone scored 30%) and every "winner" is slower. No seam tweak gives an attributable gain at this scale. The dominant levers (shared substrate via seed-commit, real mini-swe agent) are already banked; the rest is noise + agent capability. Distinguishing these would require n≈30, not more tweaks.

S3 (interface contracts) subsumed by seed-commit; S4 (spawn-as-tool) deferred to the dynamic/parallel mode.

Round 4 — design-space closure + n=20 confirmation

experiment	Success	Efficiency (s)	Judge /5	finding
baseline (seed-commit)	5/10 (50%)	686	3.40	—
no-seed (independent + integrator merge)	5/10 (50%)	1245	3.30	DROP = on success/judge but ~2× slower
reverse feature order	5/10 (50%)	827	3.20	DROP ordering irrelevant
n=20 confirm — baseline	10/20 (50%)	—	—	baseline stable across n=10 and n=20
n=20 confirm — S6+S7 (the n=10 "60%")	8/20 (40%)	—	—	REJECT 60% was noise — collapses below baseline

Definitive convergence. The n=20 confirmation kills the only live signal: S6+S7 drops from 60% (n=10) to 40% (n=20), below the stable 50% baseline. The three metrics also refine the design choice: seed-commit and independent+integrate tie on success (50%) and judge (3.3–3.4), but seed-commit is ~2× faster (686s vs 1245s — no extra integrator pass), so the kept design wins on the efficiency axis. Ordering is irrelevant. No seam tweak produces a real, attributable gain — the dominant levers (own-container substrate + real mini-swe agent) are fully banked. The coordination-prompt seam sub-space has converged — but see Round 5: an LLM analyst reopened the loop on the untested tool/workflow axis.

Metric coverage: Rounds 1–3 used all three metrics; wave-4 tie cases (no-seed, reverse-order) now carry all three (above). The n=20 confirmation is success-only by design — S6+S7 regressed on pass-rate, a drop regardless of judge.

Round 5 — LLM analyst generates the next hypotheses (loop reopened)

An LLM analyst (eval/analyst.py) read the 5 failing pairs' real evidence — spec, submitted diff, and the grader's actual error — diagnosed each, and proposed new hypotheses. Key correction it surfaced: 4 of 5 failures are "incomplete implementation" (a whole feature silently omitted/half-done), which IS seam-addressable; only 1 is truly spec-ambiguous. So the earlier "converged" call applies only to coordination-prompt seams — the tool/workflow axis is open (raising effective agent capability via harness scaffolding is legitimate co-design; the agent model stays fixed).

analyst-proposed hypothesis	class	targets
T1 — API-contract extraction + AST/probe gate before submit	tool	wrong API, incomplete
T2 — spec-derived acceptance tests (TDD) before coding	workflow	incomplete, wrong API
T3 — feature-coverage ledger: block submit until each feature has evidence	workflow	incomplete (4/5 failures)
T4 — existing-test / convention mining step before editing	tool	incomplete, conflict
T5 — ambiguity triage (prefer least-new API; flag SPEC-AMBIGUOUS)	capability	spec-ambiguous

These were generated by the harness from its own failures — the loop's reflect→generate step, automated.

T3 result (completeness review). n=10: 6/10 vs baseline 5/10 (+1, flipping an incomplete-impl pair via its intended mechanism). But the n=20 confirmation washes it out: 10/20 = 10/20. Per-pair, T3 flips 2 fail→pass but regresses 2 pass→fail — the extra reviewer pass is double-edged (fills omissions, but sometimes breaks working code), net zero and slower. DROP. The n=10 +1 was noise (again).

T2 / T4 result (in-loop workflows). Because post-hoc passes are double-edged, the next two analyst hypotheses were built with a distinct mechanism — in-loop prompt preambles that change what the same agent does within its own budget, adding no extra pass. T2 (spec-derived TDD: derive acceptance criteria + run throwaway local checks before submitting) scored 5/10 = baseline, zero pair flips — a pure wash. T4 (convention mining: inspect existing tests/usages before editing) scored 4/10, baseline −1 — the mining step spends step-budget on exploration and costs a pair. DROP both.

T6 result (best-of-N self-selection). The last untried mechanism: run the whole isolated team twice (own containers each), then the LLM judge — pairwise, both orders, no hidden-grader access — selects the candidate diff to submit. 5/10 = baseline, but with a flip each way (gains 1706 f5_f8, loses 1655 f6_f7) at 2× compute. DROP. This is the most informative negative: best-of-2 does surface run-to-run variance, so the headroom is real — but the judge mis-selected a pair the baseline gets right. A selector good enough to net-gain would need to know which diff passes the hidden tests, i.e. be more capable than the agent — the constraint itself.

Four-family convergence. The ~50% plateau now holds across four distinct mechanism families: (a) coordination-prompt seams (Rounds 2–4), (b) post-hoc reviewer/repair passes (T3/S5), (c) in-loop workflow preambles (T2/T4), and (d) selection over variance (T6). None lifts it — post-hoc passes regress as much as they fix, in-loop preambles are wash-or-harmful (extra exploration eats the step budget), and selection is bottlenecked on an oracle no cheaper than the agent. This is strong, mechanism-diverse evidence that the binding constraint is genuinely agent capability for mini-swe at this scale, not any single untried seam (the remaining T1/T5 are variants of already-washed families). The honest stopping point: 50% plateau, ≈ CooperBench team (60%), 2× solo (30%); the next real lever is the agent, not the seam.

Round 6 — change the eval set: a coordination-sensitive benchmark

Hypothesis (from the question "is the plateau a sampling artifact of the fixed 10 pairs?"): maybe that set isn't where coordination matters, so seams have nothing to grab. To test it directly we built a graded coupling score from the gold solution patches (scripts/select_coupled.py): shared files (×2) + shared changed symbols. First surprise: the old fixed set was already 8/10 gold-conflict (coupling-heavy, above the 74% base rate), so the plateau was not from a coupling-poor sample. We then selected the 14 most-coupled pairs across the 100-pair lite pool (9 repos, cap 2/repo, 6 repos the old set never touched — e.g. dirty_equals/43 sharing 2 files +3 symbols, the click/2800 family sharing _termui_impl.py+termui.py).

arm (14 most-coupled pairs)	success	note
solo (1 agent)	5/14 (36%)	single-agent baseline
team — baseline substrate	5/14 (36%)	= solo; team advantage gone
team + S2 coordination seam	4/14 (36%→−1)	no help

The result inverts the coordination intuition — a strong negative. On the old mixed set team (50%) doubled solo (30%); on the most-coupled set the gap vanishes (team 36% = solo 36%), and it vanishes because the team regressed (50%→36%) while solo held (~30→36%). The explicit coordination seam doesn't recover it. Interpretation: the multi-agent win is about separability, not coordination. When two features touch the same files/symbols, the sequential build-on-prior substrate creates interference (the second agent inherits and must reconcile edits to shared code) that cancels the parallelism benefit — and a single agent holding the whole coupled problem in one context does just as well. So a harder-via-coupling set does not unlock seam value; it removes the team advantage. The actionable lever for multi-agent gains is partitioning work to be separable (S1 region-partitioning), not richer coordination — the opposite of where the seam backlog was pointed.

Round 7 — separability-aware orchestration (decompose by write-set)

If multi-agent value needs separability (Round 6), build an orchestrator that creates it: a PLANNER (cooperagents/planner.py) re-cuts the task into a dependency DAG of subtasks chosen to maximize independence; the harness (_run_decomposed) runs independent subtasks in parallel (own containers, seeded only along DAG edges) and merges the branch deltas. Two versions: (a) file-level (merge work that shares a file) and, after the insight that conflict is region-level not file-level, (b) region-aware — split same-file work by function, give each agent a disjoint owns write-set, enforce it as an ownership boundary in the prompt. 84/100 lite pairs are the "win zone" (share a file, disjoint symbols) where (b) should pay off.

orchestration	OLD-10	COUPLED-14	note
solo	30%	36%	one agent, whole task
sequential team (build-on-prior)	50%	36%	best on OLD
decompose — file-level merge	30%	21%	collapses to ≈solo
decompose — region-aware split	30%	14%	worse 12/14 merges conflicted

The planner behaves correctly — file-level merges ~every pair (CooperBench pairs are coupled by construction: both features from the same PR, always sharing files), so decompose collapses to ≈solo and loses the sequential team's 50%. Region-aware does split (2–3 parallel subtasks/pair) but scores worse, and 12/14 merges still conflicted (.rej / conflict markers). The deep reason — a fundamental limit, not a bug: a conflict-free partition needs each subtask's write-set before the code is written, but the true write-set is only known after doing the work. The planner predicts regions from the spec; real features need cross-cutting edits it can't foresee (a shared import, a new field, a registration line), so agents either stray outside their lane → conflicts, or obey ownership → incomplete features. You cannot manufacture separability by ex-ante partitioning. On intrinsically-coupled work, one agent holding the whole context beats any predicted cut.

Round 8 — coordination under interdependence (the actual hard problem)

Reframe: CooperBench couples its pairs on purpose — the coupling IS the research target, so the lever must be coordination DURING execution, not partitioning before it. Diagnosis (from the sequential team's per-feature eval on coupled pairs): the dominant failure is f1=F, f2=P — the second agent, building on the first's work, silently breaks the first agent's feature while implementing its own. Mechanism (C1 — regression-guarded coordination): each agent PUBLISHES a runnable check for its own feature into the shared tree (.cb_checks/, transferred to teammates via the substrate but stripped before grading); later agents must run and keep ALL prior checks green. Coordination via a preserved invariant communicated agent→agent at runtime.

set	sequential team	+ C1 regression-guard	Δ
coupled set-1 (regression-heavy)	5/14 (36%)	8/14 (57%)	+3 — all 3 are feature-1 recovering F→P
coupled set-2 (0 regressions)	9/14 (64%)	8/14 (57%)	−1 (overhead; nothing to fix)
OLD-10 (mixed)	5/10 (50%)	4/10 (40%)	−1 (overhead)

The first attributable coordination win on coupled work — and precisely targeted. On set-1 the +3 is exactly the diagnosed regressions flipping (feature-1 goes F→P), no previously-passing pair lost. The replication on set-2 is the honest control: set-2 had zero f1=F,f2=P pairs, so C1 had nothing to fix and its check-writing overhead cost one pass. C1 helps iff the set contains "later agent breaks earlier teammate" regressions — a real solution to one specific hard-coordination failure mode, not a universal gain. This is the opposite lesson from the coordination seams of Rounds 2–4: a coordination mechanism does move the needle on coupled work when it targets a diagnosed failure with a verifiable runtime invariant (vs. a generic "reuse names" prompt). Next: make C1 adaptive (invoke only when regression risk is present) or cheaper (use the repo's existing tests as the invariant instead of agent-authored checks) to keep the set-1 win without the set-2 overhead.

Hunting the next coordination failure mode — and finding the boundary. Applying the same recipe (diagnose → targeted mechanism) to the other coupled failures (f1=P,f2=F and f1=F,f2=F), the real grader errors are per-feature CORRECTNESS bugs, not coordination: pillow/290 fails a palette-sort assertion and a missing ValueError; llama_index/18813 gets max-bytes and audio-bytes resolution wrong; dspy/8394 implements feature 1 fine but feature 2's compression simply doesn't reduce size. Each feature fails on its own merits (no "agent 2 broke agent 1", no missing-interface, no merge conflict), and solo fails these too — so they are agent-capability bound, not seam-addressable. Conclusion: C1 captures essentially the entire coordination-addressable surface of coupled work (the regression mode); the residual is capability. That is the useful boundary the hard benchmark was built to expose — coordination's job here is to stop the team from creating failures (regressions) that a solo agent wouldn't, and once that's done, what's left is the agent.

Round 9 — where do coordination problems actually live? (harvest new tasks)

To find more — and more kinds of — coordination problems, we need a detector and fresh tasks. Detector: a failure is coordination-caused iff solo passes a feature but the team breaks it (the team setup created a failure a single agent wouldn't have). We harvested 14 fresh pairs (diverse repos, none reused) running solo + sequential-team + parallel-decompose, and mined every solo-passes-but-team-breaks instance — combined with coupled set-1, 28 pairs total.

team design	coordination-caused feature losses (of 28 pairs)
sequential-seed (build-on-prior)	0 — coordination-safe
parallel decompose (split + merge)	7 (set-1: 5, fresh: 2)

Replicated cleanly: the sequential team broke nothing solo got right across 28 pairs, while the parallel design broke 7 working features (e.g. click/2068 f1_f4, llama/17244 f2_f6 went solo[PP]→dec[PF]). So coordination problems are real and findable — and they concentrate in the PARALLEL regime, at the merge/integration boundary. This unifies the whole report: the dominant lever (the own-container build-on-prior substrate, Round 1) is dominant precisely because it eliminates coordination losses — it serializes on committed work, so there are no conflicts and no regressions to coordinate away. The sequential substrate is the de-facto solution to coordination on coupled tasks. The genuine open hard problem is therefore loss-free parallelism: getting the speed of the parallel split without its 7/28 integration losses — which is exactly where C1's published-invariant idea should transfer (each branch publishes a check; the integrator must land a merge that keeps all of them green, repairing conflicts).

Round 10 — loss-free parallelism (guarded merge)

Built it: decompose + preserve_invariants → each parallel branch publishes a runnable check; after the mechanical merge a guarded integrator runs every check and repairs whatever the merge broke until all pass (C1's invariant idea moved onto the parallel integrator). Tested on coupled set-1, where plain parallel decompose was lossy (2/14, 12/14 merges conflicted).

design (coupled set-1)	pass
sequential build-on-prior (kept best)	5/14
solo	5/14
parallel decompose — guarded merge	3/14
parallel decompose — plain	2/14

The mechanism demonstrably works: the guarded integrator recovered merge-broken features, including click/2800 f1_f3 going FF→PP — the worst plain-decompose loss, where the split+merge had broken both features. But it is double-edged (repair overhead cost jinja f8_f9) and nets only 3/14 — still below the sequential substrate's 5/14. Two reasons it can't close the gap: residual conflicts the integrator can't fully repair (the same ex-ante write-set-prediction limit as Round 7), and the repair's own overhead. Conclusion: loss-free parallelism is only partially achievable; on intrinsically-coupled work, sequential build-on-prior remains the best design. Parallelism + guarding narrows the integration-loss gap but does not overcome the structural advantage of serializing on committed work.

Round 11 — adaptive topology: the harness picks sequential vs parallel by observation

Who decides: the harness, by a fixed rule — not the agents. There is no LLM judgment call ("this looks separable, go parallel"); the agents only implement their features. The topology adapts to the work (via an observed merge conflict), but the chooser is orchestrator code, not an agent. A genuinely agent-chosen topology is a separate, unbuilt mechanism (the ex-ante decompose planner is the closest, and it over-merges unreliably — Round 7).

Real multi-agent workloads are diverse: some tasks are hard-to-separate (CooperBench), some are easily separable. So instead of a fixed topology, the harness selects one per task at runtime, by observation, not by ex-ante prediction (which is impossible; the write-set isn't known until the code is written, Round 7). Method (a deterministic harness policy): run the features in parallel from base (each publishing an invariant check); probe the merge (git apply --check + run the checks); if the branches compose cleanly → keep the parallel result (fast); if they collide → fall back to the sequential build-on-prior handoff (safe), reusing branch 1. The observed conflict is the decision.

Adaptive selector: parallel-probe → keep-parallel on separable work, sequential-fallback on coupled work. The conflict signal decides — no prediction.

Result — it adapts correctly; the win is on efficiency where work is separable. On a controlled separable task (two agents, disjoint files, ~4 s each) the selector takes the parallel path and is 1.99× faster (8.0 s → 4.1 s) with both features intact. On CooperBench it falls back on all 30 pairs tested — success holds at sequential levels (no regression), but the parallel path never engages.

Gold-separability ≠ agent-separability (the key correction). CooperBench is ~75% entangled / 25% not (gold-conflict report: 76.5% of gold-patch pairs conflict). One would expect adaptive to parallelize the 25%. It doesn't — and this is not a too-strict probe: switching the merge test from git apply --check to a real base-aware 3-way merge (verified to return PARALLEL on genuinely disjoint diffs) still fell back on 10/10 gold-"not-entangled" pairs. The reason: the 75/25 split is a property of the minimal gold patch; an LLM agent implementing a feature also touches shared scaffolding (imports, exports, registration, adjacent edits), so two independent agent implementations collide even when the gold patches don't. Agent-level separability is far rarer than gold-level. To exploit the 25% you'd have to stop agents touching the shared scaffolding independently — i.e. coordinate on it (back to the coupled problem) or constrain each to its region (the decompose/ownership route, which ships incomplete features — Round 7).

Net: the adaptive selector gives sequential's success on coupled work (safety, no regression) and parallel's efficiency on separable work (≈2×), choosing by observed conflict rather than a guess — a Pareto-safe default for diverse workloads. On CooperBench it (correctly) always serializes, because even the 25% gold-separable pairs are not separable once real agents implement them.

Round 12 — generalize: the agents choose the topology (domain-agnostic)

The Round-11 selector was SWE-bound (it decided via git merge conflicts) and harness-driven. Round 12 makes the decision (a) made by an agent and (b) domain-general. A planner agent (planner.plan_topology) reasons about subtask relationships in task-agnostic terms — independent → parallel; needs-another's-output → sequential edge; combine-at-end → fan-in — and emits a dependency DAG. That one abstraction expresses every topology the question raised: a chain is sequential, no edges is parallel, a diamond is parallel-then-sequential (pipeline), and arbitrary DAGs are hybrid. The executor runs whatever the agent picks; only the final integration/conflict-check is SWE-specific (git 3-way merge) and is a pluggable backend — the planning and scheduling are not.

agent-chosen topology	DAG (levels)	wall-clock (2×~4s work)
parallel (independent)	`[[a,b]]`	4.0s
sequential (chain)	`[[a],[b]]`	8.1s
pipeline (parallel→join)	`[[a,b],[c]]`	8.1s

The planner agent decides the structure (verified: it picks parallel / sequential / pipeline from domain-agnostic reasoning); the harness executes it with the matching wall-clock. So "the agents decide sequential vs parallel vs parallel-then-sequential" is now a general capability — CooperBench (code) is just one instantiation, where the integration backend happens to be git.

6. Key lessons

Let the work pick the topology — by observation, not prediction. A runtime selector (run parallel, probe the merge, keep it if clean else fall back to sequential) gives sequential's success on coupled work and ~2× efficiency on separable work, with no ex-ante guess. On all-coupled CooperBench it correctly always serializes (0/20 parallel) — the parallel win needs genuinely separable tasks, which this benchmark lacks.
Coordination problems live in the parallel regime, and the substrate already solves them. A clean detector (solo passes a feature, team breaks it) finds 0/28 coordination-caused losses for the sequential build-on-prior team but 7/28 for the parallel split+merge. The substrate is the dominant lever because it serializes on committed work and so has nothing to coordinate away; the open hard problem is loss-free parallelism.
A targeted runtime invariant beats generic coordination on coupled work. Publishing a per-feature regression check that teammates must keep green fixed the dominant coupled failure (later agent breaks earlier feature): +3/14 where those regressions occur, all confirmed as feature-1 F→P recoveries. But it only helps when that failure is present (a control set with 0 such regressions saw −1 from overhead) — diagnose the failure, then target it.
Separability can't be manufactured by planning — it has to exist in the task. Re-dividing coupled work into "disjoint" subtasks fails because the conflict-free partition needs the write-set before the work is done; predicted cuts stray (12/14 merges conflicted) or starve agents of cross-cutting edits. Decompose-by-write-set scored below solo on coupled pairs.
Multi-agent value is about separability, not coordination. The team beats solo on decoupled feature pairs (parallel, separable work); on the most-coupled pairs the advantage disappears entirely (team = solo) and coordination seams don't help. The build-on-prior substrate turns into interference when features share code. Corollary: the highest-value untried seam is the one that makes work separable (S1 region-partitioning / disjoint-file assignment), not the ones that enrich coordination — which is where the backlog had been aimed.
Shared code substrate is the dominant source of multi-agent value — independently reproduced (CooperBench's own ablation found the same). Most of the team>solo gain is agents working on one tree, not orchestration richness.
The remaining failures are agent-capability / spec-ambiguity bound, not seam-addressable. Recurring miss: a hidden test references an exact identifier (a field literally named Params) that the spec only describes ("URL parameters"). No team coordination invents a name that isn't in the spec — this is precisely why codex (stronger priors) outperforms mini-swe here.
Co-design > swapping in a stronger black-box agent. Holding the agent constant keeps gains attributable to the harness.
What lets a team beat solo is "build on teammates' work" — not which mechanism. Seed-commit and independent+integrate both reach 50%; the prompt-level seam tweaks (S2, S5–S8, ordering, combos) are all noise at this scale. The architecture matters; the garnish doesn't.
Validate before believing. A 60% n=10 result evaporated to 40% at n=20. Single-seed n=10 is far too noisy to adopt seam changes; confirm at n≥20–30.
Run agents massively parallel, eval serially. 9 concurrent agent runs caused zero rate-limiting; the only failures were eval-phase write gaps, fixed by single-process backfill.

7. Threats to validity

n = 10, single seed. A ±1-pair swing is within noise (~±15pp); S2's +1 is suggestive, not proven — confirm at n≈30. QEMU emulation: all but go_chi images are arm64-only, so runs on this amd64 host are emulated (efficiency numbers are inflated; pass/fail should be unaffected). Missing evals: 2 runs had one pair whose eval.json didn't write (counted as a miss), slightly understating them.

8. Reproduce

In the cooperagents repo:

scripts/measure.sh <label>                 # fixed-set solo+team (same agent/model/eval)
uv run python scripts/bench_compare.py --pairs <fixed 10> --team-only --teammate-context  # a seam variant (S2)
uv run python scripts/ablation_report.py --baseline cmp --candidates s8 s2 combo --setting team --judge

Self-improvement loop & backlog: docs/SELF_IMPROVEMENT_LOOP.md, docs/SEAM_BACKLOG.md in the CooperAgents repo.