CooperAgents — Unified Self-Evolving Harness

Generated 2026-06-08, finalized 2026-06-09 · system cooperagents (standalone unified team+agent harness) · agent mini-swe-agent (held constant) · model gpt-5.5 · evaluator cooperbench eval · benchmark: fixed 10 flash pairs · backend docker (arm64 via QEMU) · 1 seed
Hard constraint: every agent runs in its own container — agents never share a live workspace. The coordinated team seeds each fresh container with teammates' committed diff (git apply) so an agent still builds on prior work without sharing a running container. (Earlier "shared single container" results are superseded; see §5.)

cooperagents unifies the previously-separate team and agent harnesses into one orchestrator: agents and the supervisor share a single bus and one git working tree, and the agent loop (vendored mini-swe-agent) is editable in-repo. This lets us co-optimize the team layer and the agent loop together ("the seam") rather than treating the agent as a black box. CooperBench is used only as a task source and evaluator (never modified).

Why mini-swe, not codex. The objective is a better co-designed system, not a leaderboard number. A black-box agent (codex) forecloses co-design and makes any gain un-attributable to the harness. CooperBench's published codex numbers (solo 48% / team 62%) are context, not a target to game. We hold the agent constant (mini-swe) so every delta is attributable to the harness/seam.

0. Lead result — C1: a regression-guarded sequential handoff (method → result)

Be precise about "coordination." C1 is sequential: agents run one after another, each in its own container seeded with the prior's committed diff — no concurrency, no two-way negotiation. It is coordination only in the narrow sense that one agent's behavior is constrained by another's explicit contract (a published check). Every genuinely concurrent method we built underperformed (see the result table) — so on coupled work the coordination that helps is a sequential contract handoff, not parallelism.

Method — diagnose, then target. CooperBench couples its feature pairs on purpose, so we first asked where a team fails in a way a single agent would not. Reading the sequential team's per-feature grader output on coupled tasks, the dominant such failure is f1=F, f2=P: the second agent, building on the first's work, silently breaks the first agent's feature while implementing its own. C1 (regression-guarded handoff) targets exactly this — each agent, in its own container, publishes a runnable check for its feature into the shared tree (.cb_checks/, transferred to teammates via the committed substrate but stripped before grading); every later agent must run and keep all prior checks green while adding its feature. The published check is a one-way contract handed agent → agent — one step richer than the bare substrate (which hands over code with no obligation), but still sequential. It is not ex-ante partitioning (impossible — the write-set isn't known until the code is written) and not concurrent negotiation.

UnifiedHarness — sequential team, own container per agent C1 = build-on-prior substrate + a runtime invariant handed off agent → agent Agent 1 — feature 1 (own container) ① implement feature 1 ② PUBLISH .cb_checks/f1.py (runnable invariant) ③ run it → confirm feature 1 works git diff → δ1 (+f1 check), COMMIT seed δ1 + invariant git apply + commit Agent 2 — feature 2 (own container, seeded) ① run .cb_checks/f1.py — must stay GREEN ② add feature 2 WITHOUT breaking feature 1 ③ publish .cb_checks/f2.py · re-run all checks git diff → δ1+δ2 (cumulative) Integrated diff (δ1+δ2) strip .cb_checks/ + tests → grader never sees them cooperbench eval runs BOTH hidden feature suites The published check is the coordination signal: it forces Agent 1 to verify its feature and lets Agent 2 prove it didn't break it.
C1 method: each agent publishes a runnable invariant check; the later agent must keep all prior checks green while adding its feature. The check is the coordination signal, communicated through the committed substrate.

Result — the best result from an inter-agent contract, and it's sequential. On a regression-heavy coupled set C1 lifts the team 5/14 → 8/14, and every one of the +3 is the diagnosed failure reversing (feature 1 recovers F→P); no previously-passing pair is lost. It is genuinely targeted, not a blanket gain: on a control set with zero such regressions it is neutral-to-slightly-negative (overhead), and on the mixed flash set the sequential substrate alone is already best.

setsequential team+ C1Δ
coupled set-1 (regression-heavy)5/14 (36%)8/14 (57%)+3 — all feature-1 F→P recoveries
coupled set-2 (0 regressions)9/148/14−1 (overhead; nothing to fix)
flash-10 (mixed)5/104/10−1 (overhead)

C1 helps iff the "later agent breaks an earlier teammate's feature" regression is present — a real fix for one specific hard-coordination failure, confirmed by it firing exactly when that failure occurs.

Concurrency did not help — the opposite of the usual intuition. On the same coupled set-1, every genuinely concurrent design underperformed the sequential ones — the parallel split reintroduces the integration losses serialization avoids (Round 9): C1 sequential contract 8/14 > sequential substrate 5/14 > solo 5/14 > guarded-merge decompose (concurrent + contracts) 3/14 > plain parallel decompose 2/14. So the coordination that pays off on coupled work is a sequential contract handoff, not parallelism or live negotiation; we do not (yet) have a concurrent-coordination method that beats serializing on the committed substrate.

Success and efficiency vs baselines
A. Like-for-like success (same agent/model/eval): the unified team doubles solo and matches the CooperBench team harness reference. B. Success vs. efficiency on the coupled (hard) set — the kept-best sequential team and, in the upper-left "better" corner, the C1 regression-guard (higher pass-rate and faster on this set). Parallel decompose / guarded merge trade efficiency for losses.

0.1 The foundation C1 builds on — the own-container substrate

C1 is the sequential, own-container, build-on-prior team plus the published invariant. That underlying substrate is itself the dominant lever (solo 30% → team 50% on flash) and is coordination-safe (Round 9: 0 / 28 coordination-caused losses); C1 adds the one piece of live coordination that pays off on coupled work.

UnifiedHarness — orchestrator resolves N features → assignments (lead + members) · sequences agents · accumulates the integrated diff TeamBus orchestrator substrate • task list • messaging • spawn queue messaging/spawn tools: builtin / spawn modes Agent 1 — feature 1 OWN DockerEnv (task image, isolated) mini-swe loop (editable) LitellmModel · bash/file tools git diff → δ1, then COMMIT Agent 2 — feature 2 OWN DockerEnv (fresh, SEEDED) mini-swe loop builds on δ1 (reuses teammate's public API) git diff → δ1+δ2 (cumulative) seed δ1 git apply + commit task list / spawn / metrics Integrated diff = last cumulative (δ1+δ2) strip test sections → submitted in feature-slot 1 cooperbench eval runs BOTH feature suites on merged tree
Kept-best harness: agents run in isolated containers and coordinate by seeding committed diffs (sequential build-on-prior), generalizing to N features (agent i+1 seeded with δ1..δi). The TeamBus serves the orchestrator (task list / spawn queue / metrics); its messaging & spawn tools are wired to agents only in the builtin-agent and dynamic-spawn modes — the kept-best mini-swe agents do not use the bus, they coordinate via the seeding.

1. Headline results

Same agent (mini-swe), model (gpt-5.5), evaluator, and 10 pairs. A pair passes only if both features' held-out suites pass on one merged tree.

harnesssuccessper-featurenotes
CooperAgents solo (1 agent, own container)30% (3/10)60%single-agent baseline
CooperBench team harness (reference)60% (6/10)65%like-for-like reference (own containers + scratchpad)
CooperAgents unified team — OWN containers (seed+commit)50% (5/10)65–70%current; satisfies the hard constraint
unified team — shared single container56%67%SUPERSEDED (violated own-container constraint)

2. Hypotheses tested

#hypothesisresultverdict
H1Unifying team+agent onto one shared bus + git tree enables co-designAgents + supervisor share a TeamBus; agent loop editable in-repoSUPPORTED
H2A team can spawn helper agents at runtime (capped)Works (spawn_helper + supervisor); covered by tests + offline flashSUPPORTED
H3A shared code substrate beats isolated per-agent patchesIsolated → merge conflicts → team below solo. Shared tree → conflicts gone → team 30%→56%STRONGLY SUPPORTED
H4The unified harness can match CooperBench's team harness (same agent)56% vs 60% on the same 10 pairsSUPPORTED
H5A stronger agent loop (toy JSON loop → mini-swe) raises the absolute levelsolo 20%→30%, team 30%→56%SUPPORTED
H6 (S5)A team-level verify-and-fix integrator pass fixes remaining failuresgo_chi probe 0/2→0/2, ~2× time, no changeREJECTED (failures were API-semantic, not build errors)
H7 (S8)A spec-fidelity prompt fixes API mismatchesteam 5/10 (= baseline)REJECTED (misses are spec-ambiguous)
H8 (S2)Live teammate context (show each agent the prior diff) improves integrationteam 6/10 (+1, best variant)SUPPORTED (suggestive, n=10)
H9Combining S8+S2 beats either alonecombo 5/10 (< S2's 6/10), but fastestREJECTED (S8 adds nothing)

3. Seam-improvement ablation — three metrics (own-container)

Same agent, same 10 pairs; only the team↔agent seam changes, all on the own-container + S2 baseline. Success (both-pass), Efficiency (avg wall-clock; lower better), Judge (mean LLM rubric 1–5). Reliable rows = 10/10 evals.

variantSuccessEfficiency (s)Judge /5verdict
baseline (own-container + S2)5/10 (50%)6863.40
S6 budget-803/107863.00DROP
S7 guard4/108703.80DROP
S8 spec-fidelity5/1010853.40DROP
S5+S66/1014343.70INCONCLUSIVE
S5+S85/1015643.60DROP
S6+S76/109373.80INCONCLUSIVE
all (S5+S6+S7)5/101342DROP

The three metrics agree on the shape: the two 60% combos (S5+S6, S6+S7) also have the highest judge (3.70/3.80), a mild positive signal — but budget alone scored worst (30%, judge 3.00) and both winners are slower, so it's noise-consistent. No robust, attributable gain at n=10; only n≈30 could confirm S5+S6/S6+S7. Earlier shared-container S2/S8/combo numbers are superseded.

4. Line of evolution

Designs are added on top of what was kept; each is measured (success / efficiency / judge) and kept or rejected. KEEP stacks into the baseline; REJECT is recorded and dropped. The tree below reads top-to-bottom as the progressive build: first the architecture, then the substrate, then four distinct mechanism families of seam experiments layered on the kept baseline.

ARCHITECTURE
two-level design, toy JSON agent loop                 team < solo (uncoordinated)   REJECT
  └─ unify team+agent onto ONE shared bus + per-agent envs                            KEEP   (the seam)
       └─ + dynamic helper spawning (capped by --max-agents)                          KEEP   (capability)
            └─ + mini-swe worker  (real agent, HELD CONSTANT — never codex)           KEEP   (worker)

SUBSTRATE  (the one big lever)
                 └─ shared SINGLE container (build-on-each-other)       30%→56%        REJECT  ✗ violates own-container constraint
                      └─ OWN containers, each seeded w/ teammates' committed diff  →50%  KEEP  ★ BASELINE
                         │   ≈ CooperBench team (60%), 2× solo (30%); seed-commit ~2× faster
                         │   than independent+integrate at equal success/judge → kept.
                         │
   ┌─────────────────────┴─ four mechanism families layered on the ★ baseline ─────────────────────┐
   │
   ├─ FAMILY A · coordination-prompt seams  (state crossing team↔agent)
   │    ├─ S2 live teammate-diff context           6/10 @n10 (judge↑)   keep* marginal — new best on judge
   │    ├─ S8 spec-fidelity prompt                 5/10 = baseline       REJECT (misses are spec-ambiguous)
   │    ├─ S5 verify-fix · S6 budget · S7 git-guard 30–60%@n10 → 40%@n20 REJECT (n=10 noise; combos collapse)
   │    └─ reverse order · S3 contracts            50%                   REJECT (irrelevant / subsumed)
   │
   ├─ FAMILY B · post-hoc reviewer / repair passes  (extra LLM pass after the team) — analyst-generated
   │    └─ T3 completeness-review reviewer pass    6/10 @n10 → 10/20 = 10/20   REJECT double-edged (+2 fixed / −2 broken)
   │
   ├─ FAMILY C · in-loop workflow preambles  (change the SAME agent's loop, no extra pass) — analyst-generated
   │    ├─ T2 spec-derived TDD (self-verify first) 5/10 = baseline, 0 flips  REJECT (pure wash)
   │    └─ T4 convention-mining before editing     4/10 (baseline −1)        REJECT (mining eats step budget)
   │
   └─ FAMILY D · selection over run-to-run variance  (best-of-N)
        └─ T6 best-of-N + LLM-judge self-select    5/10 = baseline (+1 / −1 flip)  REJECT 2× cost; judge mis-selects

   ═══ 4 families → the SAME ~50% plateau · binding constraint = agent capability (mini-swe) ═══

CHANGE THE EVAL SET  (Round 6 — is the plateau a sampling artifact?)
  └─ rank pairs by GOLD-patch coupling (shared files×2 + shared symbols)        (scripts/select_coupled.py)
       └─ old fixed set was already 8/10 coupled → NOT a coupling-poor sample
       └─ pick 14 MOST-coupled pairs (lite pool, 9 repos)
            ├─ solo                       5/14 (36%)
            ├─ team — baseline substrate  5/14 (36%)   team = solo  ✗ multi-agent advantage GONE on coupled work
            └─ team + S2 coordination     4/14 (−1)    no help
       ★ INSIGHT: team wins come from SEPARABILITY, not coordination. Coupled features →
         build-on-prior substrate causes interference → cancels the parallelism benefit.

SEPARABILITY-AWARE ORCHESTRATION  (Round 7 — can we CREATE separability?)
  └─ planner re-cuts task into an independence-maximizing subtask DAG       (planner.py + _run_decomposed)
       ├─ file-level merge       OLD 30% · COUP 21%   ≈solo  CooperBench pairs always share files → merges to 1 agent
       └─ region-aware split      OLD 30% · COUP 14%   worse  splits 2–3 ways but 12/14 merges CONFLICT
       ★ LIMIT: a conflict-free partition needs the write-set BEFORE coding, but it's only
         known AFTER coding. Separability can't be manufactured ex-ante; it must exist in the task.

COORDINATION UNDER INTERDEPENDENCE  (Round 8 — coupling IS the target)
  └─ diagnose coupled failure: dominant = f1=F,f2=P (2nd agent BREAKS 1st agent's feature)
       └─ C1 regression-guard: each agent publishes a runnable check; later agents keep ALL green
            ├─ coupled set-1 (regression-heavy)  5/14 → 8/14   +3  all 3 = feature-1 F→P recovery
            ├─ coupled set-2 (0 regressions)     9/14 → 8/14   -1  nothing to fix → overhead
            └─ OLD-10 (mixed)                    5/10 → 4/10   -1  overhead
       ★ FIRST attributable coordination WIN on coupled work — TARGETED: helps iff the
         "later agent breaks earlier feature" regression is present. Diagnose failure → target it.

WHERE DO COORDINATION PROBLEMS LIVE?  (Round 9 — detector + fresh tasks)
  └─ detector: solo passes a feature but TEAM breaks it = coordination-CAUSED loss
       └─ harvest 28 pairs (coupled set-1 + 14 fresh, diverse repos):
            ├─ sequential-seed team   0/28 losses   coordination-SAFE  (serializes on committed work)
            └─ parallel decompose     7/28 losses   lossy  (split+merge breaks working features)
       ★ Coordination problems concentrate in the PARALLEL regime (the merge boundary). The
         substrate (build-on-prior) is dominant BECAUSE it eliminates them. Open hard problem =
         LOSS-FREE PARALLELISM (parallel speed without the 7/28 integration losses).

LOSS-FREE PARALLELISM  (Round 10 — guarded merge: decompose + published invariants)
  └─ each parallel branch publishes a check; guarded integrator repairs merge-broken features
       ├─ plain parallel decompose      2/14
       ├─ guarded-merge decompose       3/14   partial  recovers FF→PP (click f1_f3) but double-edged
       └─ sequential build-on-prior     5/14   still best
       ★ Guard repairs real merge losses but can't close the gap: residual unrepairable conflicts
         (Round 7 write-set limit) + overhead. Coupled work → serialize. Loss-free parallel only partial.

ADAPTIVE TOPOLOGY  (Round 11 — let the work decide, by observation)
  └─ run parallel → probe merge (git apply --check) → clean? keep parallel : fall back to sequential
       ├─ CooperBench (all coupled)   20/20 → fallback   no regression  ≈ sequential (parallel never engages)
       └─ separable work (controlled)  → parallel        1.99× faster  8.0s → 4.1s, both features intact
       ★ sequential's success on coupled + parallel's efficiency on separable, decided by conflict not prediction.

GENERALIZE  (Round 12 — agents choose the topology, domain-agnostic)
  └─ planner AGENT reasons about subtask deps (independent / needs-output / combine-at-end) → a DAG
       ├─ chain = sequential   ·   no edges = parallel   ·   diamond = parallel-then-join (pipeline)
       └─ executor runs the agent's DAG: parallel 4.0s · sequential 8.1s · pipeline 8.1s (2×~4s work)
       ★ topology decision + scheduling are domain-general; only the integration backend (git merge) is SWE.

How to read it: each FAMILY is a structurally different way to spend extra effort at the seam. The substrate (own-container, build-on-prior) is the one change that moved the needle (30%→50%); everything layered after it lands back on ~50%. Note the shared-single-container design briefly hit 56% but was rejected for violating the own-container constraint — re-run under the constraint the team holds 50%, so the substrate value transfers via committing teammates' diff into each isolated container. Families B and C were proposed by the harness's own LLM analyst reading its failures; D (best-of-N self-selection) is the last untried mechanism. D is the most telling negative: best-of-2 does expose run-to-run variance (≈1 pair flips each way), so headroom exists — but the LLM judge can't reliably pick the passing candidate (it mis-selected a pair baseline gets right). A selector good enough to net-gain would have to know which diff passes the hidden tests — i.e. it would have to be more capable than the agent, which is the constraint itself.

5. Experiment log (chronological)

Each round explores hypotheses on top of the current kept baseline; winners stack.

Round 0 — architecture

Unified the team and agent layers onto one TeamBus (task list + messaging + spawn queue) and a shared environment, so coordination is just more tools on the agent and the agent loop is editable in-repo. Added runtime helper spawning. Enables co-design; not a pass-rate experiment.

Round 1 — establish the team baseline (on the unified harness)

Round 1.5 — hard constraint: every agent in its OWN container (redo)

Round 2/3 — seam ablation on the own-container baseline (9 variants, parallel)

Nine variants run in parallel (no API rate-limiting; eval backfilled single-process). All on top of own-container + S2; only the named seam changes. Reliable rows = 10/10 evals.

variantSuccessEfficiency (s)verdict
baseline (own-container + S2)5/10 (50%)686
S6 budget-803/10786DROP
S7 guard4/10870DROP
S8 spec-fidelity5/101085DROP tied, slower
S5+S66/101434INCONCLUSIVE uncorroborated, slower
S5+S85/101564DROP
S6+S76/10937INCONCLUSIVE uncorroborated
all (S5+S6+S7)5/101342DROP

Finding — the loop has converged. Results scatter 30–60% around the 50% baseline, i.e. pure n=10 noise (±1–2 pairs). The two 60% combos aren't corroborated (budget alone scored 30%) and every "winner" is slower. No seam tweak gives an attributable gain at this scale. The dominant levers (shared substrate via seed-commit, real mini-swe agent) are already banked; the rest is noise + agent capability. Distinguishing these would require n≈30, not more tweaks.

S3 (interface contracts) subsumed by seed-commit; S4 (spawn-as-tool) deferred to the dynamic/parallel mode.

Round 4 — design-space closure + n=20 confirmation

experimentSuccessEfficiency (s)Judge /5finding
baseline (seed-commit)5/10 (50%)6863.40
no-seed (independent + integrator merge)5/10 (50%)12453.30DROP = on success/judge but ~2× slower
reverse feature order5/10 (50%)8273.20DROP ordering irrelevant
n=20 confirm — baseline10/20 (50%)baseline stable across n=10 and n=20
n=20 confirm — S6+S7 (the n=10 "60%")8/20 (40%)REJECT 60% was noise — collapses below baseline

Definitive convergence. The n=20 confirmation kills the only live signal: S6+S7 drops from 60% (n=10) to 40% (n=20), below the stable 50% baseline. The three metrics also refine the design choice: seed-commit and independent+integrate tie on success (50%) and judge (3.3–3.4), but seed-commit is ~2× faster (686s vs 1245s — no extra integrator pass), so the kept design wins on the efficiency axis. Ordering is irrelevant. No seam tweak produces a real, attributable gain — the dominant levers (own-container substrate + real mini-swe agent) are fully banked. The coordination-prompt seam sub-space has converged — but see Round 5: an LLM analyst reopened the loop on the untested tool/workflow axis.

Metric coverage: Rounds 1–3 used all three metrics; wave-4 tie cases (no-seed, reverse-order) now carry all three (above). The n=20 confirmation is success-only by design — S6+S7 regressed on pass-rate, a drop regardless of judge.

Round 5 — LLM analyst generates the next hypotheses (loop reopened)

An LLM analyst (eval/analyst.py) read the 5 failing pairs' real evidence — spec, submitted diff, and the grader's actual error — diagnosed each, and proposed new hypotheses. Key correction it surfaced: 4 of 5 failures are "incomplete implementation" (a whole feature silently omitted/half-done), which IS seam-addressable; only 1 is truly spec-ambiguous. So the earlier "converged" call applies only to coordination-prompt seams — the tool/workflow axis is open (raising effective agent capability via harness scaffolding is legitimate co-design; the agent model stays fixed).

analyst-proposed hypothesisclasstargets
T1 — API-contract extraction + AST/probe gate before submittoolwrong API, incomplete
T2 — spec-derived acceptance tests (TDD) before codingworkflowincomplete, wrong API
T3 — feature-coverage ledger: block submit until each feature has evidenceworkflowincomplete (4/5 failures)
T4 — existing-test / convention mining step before editingtoolincomplete, conflict
T5 — ambiguity triage (prefer least-new API; flag SPEC-AMBIGUOUS)capabilityspec-ambiguous

These were generated by the harness from its own failures — the loop's reflect→generate step, automated.

T3 result (completeness review). n=10: 6/10 vs baseline 5/10 (+1, flipping an incomplete-impl pair via its intended mechanism). But the n=20 confirmation washes it out: 10/20 = 10/20. Per-pair, T3 flips 2 fail→pass but regresses 2 pass→fail — the extra reviewer pass is double-edged (fills omissions, but sometimes breaks working code), net zero and slower. DROP. The n=10 +1 was noise (again).

T2 / T4 result (in-loop workflows). Because post-hoc passes are double-edged, the next two analyst hypotheses were built with a distinct mechanism — in-loop prompt preambles that change what the same agent does within its own budget, adding no extra pass. T2 (spec-derived TDD: derive acceptance criteria + run throwaway local checks before submitting) scored 5/10 = baseline, zero pair flips — a pure wash. T4 (convention mining: inspect existing tests/usages before editing) scored 4/10, baseline −1 — the mining step spends step-budget on exploration and costs a pair. DROP both.

T6 result (best-of-N self-selection). The last untried mechanism: run the whole isolated team twice (own containers each), then the LLM judge — pairwise, both orders, no hidden-grader access — selects the candidate diff to submit. 5/10 = baseline, but with a flip each way (gains 1706 f5_f8, loses 1655 f6_f7) at 2× compute. DROP. This is the most informative negative: best-of-2 does surface run-to-run variance, so the headroom is real — but the judge mis-selected a pair the baseline gets right. A selector good enough to net-gain would need to know which diff passes the hidden tests, i.e. be more capable than the agent — the constraint itself.

Four-family convergence. The ~50% plateau now holds across four distinct mechanism families: (a) coordination-prompt seams (Rounds 2–4), (b) post-hoc reviewer/repair passes (T3/S5), (c) in-loop workflow preambles (T2/T4), and (d) selection over variance (T6). None lifts it — post-hoc passes regress as much as they fix, in-loop preambles are wash-or-harmful (extra exploration eats the step budget), and selection is bottlenecked on an oracle no cheaper than the agent. This is strong, mechanism-diverse evidence that the binding constraint is genuinely agent capability for mini-swe at this scale, not any single untried seam (the remaining T1/T5 are variants of already-washed families). The honest stopping point: 50% plateau, ≈ CooperBench team (60%), 2× solo (30%); the next real lever is the agent, not the seam.

Round 6 — change the eval set: a coordination-sensitive benchmark

Hypothesis (from the question "is the plateau a sampling artifact of the fixed 10 pairs?"): maybe that set isn't where coordination matters, so seams have nothing to grab. To test it directly we built a graded coupling score from the gold solution patches (scripts/select_coupled.py): shared files (×2) + shared changed symbols. First surprise: the old fixed set was already 8/10 gold-conflict (coupling-heavy, above the 74% base rate), so the plateau was not from a coupling-poor sample. We then selected the 14 most-coupled pairs across the 100-pair lite pool (9 repos, cap 2/repo, 6 repos the old set never touched — e.g. dirty_equals/43 sharing 2 files +3 symbols, the click/2800 family sharing _termui_impl.py+termui.py).

arm (14 most-coupled pairs)successnote
solo (1 agent)5/14 (36%)single-agent baseline
team — baseline substrate5/14 (36%)= solo; team advantage gone
team + S2 coordination seam4/14 (36%→−1)no help

The result inverts the coordination intuition — a strong negative. On the old mixed set team (50%) doubled solo (30%); on the most-coupled set the gap vanishes (team 36% = solo 36%), and it vanishes because the team regressed (50%→36%) while solo held (~30→36%). The explicit coordination seam doesn't recover it. Interpretation: the multi-agent win is about separability, not coordination. When two features touch the same files/symbols, the sequential build-on-prior substrate creates interference (the second agent inherits and must reconcile edits to shared code) that cancels the parallelism benefit — and a single agent holding the whole coupled problem in one context does just as well. So a harder-via-coupling set does not unlock seam value; it removes the team advantage. The actionable lever for multi-agent gains is partitioning work to be separable (S1 region-partitioning), not richer coordination — the opposite of where the seam backlog was pointed.

Round 7 — separability-aware orchestration (decompose by write-set)

If multi-agent value needs separability (Round 6), build an orchestrator that creates it: a PLANNER (cooperagents/planner.py) re-cuts the task into a dependency DAG of subtasks chosen to maximize independence; the harness (_run_decomposed) runs independent subtasks in parallel (own containers, seeded only along DAG edges) and merges the branch deltas. Two versions: (a) file-level (merge work that shares a file) and, after the insight that conflict is region-level not file-level, (b) region-aware — split same-file work by function, give each agent a disjoint owns write-set, enforce it as an ownership boundary in the prompt. 84/100 lite pairs are the "win zone" (share a file, disjoint symbols) where (b) should pay off.

orchestrationOLD-10COUPLED-14note
solo30%36%one agent, whole task
sequential team (build-on-prior)50%36%best on OLD
decompose — file-level merge30%21%collapses to ≈solo
decompose — region-aware split30%14%worse 12/14 merges conflicted

The planner behaves correctly — file-level merges ~every pair (CooperBench pairs are coupled by construction: both features from the same PR, always sharing files), so decompose collapses to ≈solo and loses the sequential team's 50%. Region-aware does split (2–3 parallel subtasks/pair) but scores worse, and 12/14 merges still conflicted (.rej / conflict markers). The deep reason — a fundamental limit, not a bug: a conflict-free partition needs each subtask's write-set before the code is written, but the true write-set is only known after doing the work. The planner predicts regions from the spec; real features need cross-cutting edits it can't foresee (a shared import, a new field, a registration line), so agents either stray outside their lane → conflicts, or obey ownership → incomplete features. You cannot manufacture separability by ex-ante partitioning. On intrinsically-coupled work, one agent holding the whole context beats any predicted cut.

Round 8 — coordination under interdependence (the actual hard problem)

Reframe: CooperBench couples its pairs on purpose — the coupling IS the research target, so the lever must be coordination DURING execution, not partitioning before it. Diagnosis (from the sequential team's per-feature eval on coupled pairs): the dominant failure is f1=F, f2=P — the second agent, building on the first's work, silently breaks the first agent's feature while implementing its own. Mechanism (C1 — regression-guarded coordination): each agent PUBLISHES a runnable check for its own feature into the shared tree (.cb_checks/, transferred to teammates via the substrate but stripped before grading); later agents must run and keep ALL prior checks green. Coordination via a preserved invariant communicated agent→agent at runtime.

setsequential team+ C1 regression-guardΔ
coupled set-1 (regression-heavy)5/14 (36%)8/14 (57%)+3 — all 3 are feature-1 recovering F→P
coupled set-2 (0 regressions)9/14 (64%)8/14 (57%)−1 (overhead; nothing to fix)
OLD-10 (mixed)5/10 (50%)4/10 (40%)−1 (overhead)

The first attributable coordination win on coupled work — and precisely targeted. On set-1 the +3 is exactly the diagnosed regressions flipping (feature-1 goes F→P), no previously-passing pair lost. The replication on set-2 is the honest control: set-2 had zero f1=F,f2=P pairs, so C1 had nothing to fix and its check-writing overhead cost one pass. C1 helps iff the set contains "later agent breaks earlier teammate" regressions — a real solution to one specific hard-coordination failure mode, not a universal gain. This is the opposite lesson from the coordination seams of Rounds 2–4: a coordination mechanism does move the needle on coupled work when it targets a diagnosed failure with a verifiable runtime invariant (vs. a generic "reuse names" prompt). Next: make C1 adaptive (invoke only when regression risk is present) or cheaper (use the repo's existing tests as the invariant instead of agent-authored checks) to keep the set-1 win without the set-2 overhead.

Hunting the next coordination failure mode — and finding the boundary. Applying the same recipe (diagnose → targeted mechanism) to the other coupled failures (f1=P,f2=F and f1=F,f2=F), the real grader errors are per-feature CORRECTNESS bugs, not coordination: pillow/290 fails a palette-sort assertion and a missing ValueError; llama_index/18813 gets max-bytes and audio-bytes resolution wrong; dspy/8394 implements feature 1 fine but feature 2's compression simply doesn't reduce size. Each feature fails on its own merits (no "agent 2 broke agent 1", no missing-interface, no merge conflict), and solo fails these too — so they are agent-capability bound, not seam-addressable. Conclusion: C1 captures essentially the entire coordination-addressable surface of coupled work (the regression mode); the residual is capability. That is the useful boundary the hard benchmark was built to expose — coordination's job here is to stop the team from creating failures (regressions) that a solo agent wouldn't, and once that's done, what's left is the agent.

Round 9 — where do coordination problems actually live? (harvest new tasks)

To find more — and more kinds of — coordination problems, we need a detector and fresh tasks. Detector: a failure is coordination-caused iff solo passes a feature but the team breaks it (the team setup created a failure a single agent wouldn't have). We harvested 14 fresh pairs (diverse repos, none reused) running solo + sequential-team + parallel-decompose, and mined every solo-passes-but-team-breaks instance — combined with coupled set-1, 28 pairs total.

team designcoordination-caused feature losses (of 28 pairs)
sequential-seed (build-on-prior)0 — coordination-safe
parallel decompose (split + merge)7 (set-1: 5, fresh: 2)

Replicated cleanly: the sequential team broke nothing solo got right across 28 pairs, while the parallel design broke 7 working features (e.g. click/2068 f1_f4, llama/17244 f2_f6 went solo[PP]→dec[PF]). So coordination problems are real and findable — and they concentrate in the PARALLEL regime, at the merge/integration boundary. This unifies the whole report: the dominant lever (the own-container build-on-prior substrate, Round 1) is dominant precisely because it eliminates coordination losses — it serializes on committed work, so there are no conflicts and no regressions to coordinate away. The sequential substrate is the de-facto solution to coordination on coupled tasks. The genuine open hard problem is therefore loss-free parallelism: getting the speed of the parallel split without its 7/28 integration losses — which is exactly where C1's published-invariant idea should transfer (each branch publishes a check; the integrator must land a merge that keeps all of them green, repairing conflicts).

Round 10 — loss-free parallelism (guarded merge)

Built it: decompose + preserve_invariants → each parallel branch publishes a runnable check; after the mechanical merge a guarded integrator runs every check and repairs whatever the merge broke until all pass (C1's invariant idea moved onto the parallel integrator). Tested on coupled set-1, where plain parallel decompose was lossy (2/14, 12/14 merges conflicted).

design (coupled set-1)pass
sequential build-on-prior (kept best)5/14
solo5/14
parallel decompose — guarded merge3/14
parallel decompose — plain2/14

The mechanism demonstrably works: the guarded integrator recovered merge-broken features, including click/2800 f1_f3 going FF→PP — the worst plain-decompose loss, where the split+merge had broken both features. But it is double-edged (repair overhead cost jinja f8_f9) and nets only 3/14 — still below the sequential substrate's 5/14. Two reasons it can't close the gap: residual conflicts the integrator can't fully repair (the same ex-ante write-set-prediction limit as Round 7), and the repair's own overhead. Conclusion: loss-free parallelism is only partially achievable; on intrinsically-coupled work, sequential build-on-prior remains the best design. Parallelism + guarding narrows the integration-loss gap but does not overcome the structural advantage of serializing on committed work.

Round 11 — adaptive topology: the harness picks sequential vs parallel by observation

Who decides: the harness, by a fixed rule — not the agents. There is no LLM judgment call ("this looks separable, go parallel"); the agents only implement their features. The topology adapts to the work (via an observed merge conflict), but the chooser is orchestrator code, not an agent. A genuinely agent-chosen topology is a separate, unbuilt mechanism (the ex-ante decompose planner is the closest, and it over-merges unreliably — Round 7).

Real multi-agent workloads are diverse: some tasks are hard-to-separate (CooperBench), some are easily separable. So instead of a fixed topology, the harness selects one per task at runtime, by observation, not by ex-ante prediction (which is impossible; the write-set isn't known until the code is written, Round 7). Method (a deterministic harness policy): run the features in parallel from base (each publishing an invariant check); probe the merge (git apply --check + run the checks); if the branches compose cleanly → keep the parallel result (fast); if they collide → fall back to the sequential build-on-prior handoff (safe), reusing branch 1. The observed conflict is the decision.

Adaptive topology — decide by observation, not prediction ① run features in PARALLEL from base, own container each, each publishes an invariant check ② merge probe git apply --check + run published checks collide? NO → KEEP PARALLEL (fast) separable work · wall-clock ≈ max(t₁,t₂) YES → SEQUENTIAL fallback (safe) coupled work · build-on-prior handoff Measured behaviour CooperBench: 30/30 → fallback (incl. 3-way-merge probe) success ≈ sequential, no regression. Parallel never engages even on the 25% gold-"not-entangled" pairs: independent agents over-touch shared scaffolding → gold-separability ≠ agent-separability Separable work (controlled): → parallel 2 agents × ~4s work, disjoint files: 8.0s sequential → 4.1s adaptive = 1.99× faster
Adaptive selector: parallel-probe → keep-parallel on separable work, sequential-fallback on coupled work. The conflict signal decides — no prediction.

Result — it adapts correctly; the win is on efficiency where work is separable. On a controlled separable task (two agents, disjoint files, ~4 s each) the selector takes the parallel path and is 1.99× faster (8.0 s → 4.1 s) with both features intact. On CooperBench it falls back on all 30 pairs tested — success holds at sequential levels (no regression), but the parallel path never engages.

Gold-separability ≠ agent-separability (the key correction). CooperBench is ~75% entangled / 25% not (gold-conflict report: 76.5% of gold-patch pairs conflict). One would expect adaptive to parallelize the 25%. It doesn't — and this is not a too-strict probe: switching the merge test from git apply --check to a real base-aware 3-way merge (verified to return PARALLEL on genuinely disjoint diffs) still fell back on 10/10 gold-"not-entangled" pairs. The reason: the 75/25 split is a property of the minimal gold patch; an LLM agent implementing a feature also touches shared scaffolding (imports, exports, registration, adjacent edits), so two independent agent implementations collide even when the gold patches don't. Agent-level separability is far rarer than gold-level. To exploit the 25% you'd have to stop agents touching the shared scaffolding independently — i.e. coordinate on it (back to the coupled problem) or constrain each to its region (the decompose/ownership route, which ships incomplete features — Round 7).

Net: the adaptive selector gives sequential's success on coupled work (safety, no regression) and parallel's efficiency on separable work (≈2×), choosing by observed conflict rather than a guess — a Pareto-safe default for diverse workloads. On CooperBench it (correctly) always serializes, because even the 25% gold-separable pairs are not separable once real agents implement them.

Round 12 — generalize: the agents choose the topology (domain-agnostic)

The Round-11 selector was SWE-bound (it decided via git merge conflicts) and harness-driven. Round 12 makes the decision (a) made by an agent and (b) domain-general. A planner agent (planner.plan_topology) reasons about subtask relationships in task-agnostic terms — independent → parallel; needs-another's-output → sequential edge; combine-at-end → fan-in — and emits a dependency DAG. That one abstraction expresses every topology the question raised: a chain is sequential, no edges is parallel, a diamond is parallel-then-sequential (pipeline), and arbitrary DAGs are hybrid. The executor runs whatever the agent picks; only the final integration/conflict-check is SWE-specific (git 3-way merge) and is a pluggable backend — the planning and scheduling are not.

agent-chosen topologyDAG (levels)wall-clock (2×~4s work)
parallel (independent)[[a,b]]4.0s
sequential (chain)[[a],[b]]8.1s
pipeline (parallel→join)[[a,b],[c]]8.1s

The planner agent decides the structure (verified: it picks parallel / sequential / pipeline from domain-agnostic reasoning); the harness executes it with the matching wall-clock. So "the agents decide sequential vs parallel vs parallel-then-sequential" is now a general capability — CooperBench (code) is just one instantiation, where the integration backend happens to be git.

6. Key lessons

7. Threats to validity

n = 10, single seed. A ±1-pair swing is within noise (~±15pp); S2's +1 is suggestive, not proven — confirm at n≈30.  QEMU emulation: all but go_chi images are arm64-only, so runs on this amd64 host are emulated (efficiency numbers are inflated; pass/fail should be unaffected).  Missing evals: 2 runs had one pair whose eval.json didn't write (counted as a miss), slightly understating them.

8. Reproduce

In the cooperagents repo:

scripts/measure.sh <label>                 # fixed-set solo+team (same agent/model/eval)
uv run python scripts/bench_compare.py --pairs <fixed 10> --team-only --teammate-context  # a seam variant (S2)
uv run python scripts/ablation_report.py --baseline cmp --candidates s8 s2 combo --setting team --judge

Self-improvement loop & backlog: docs/SELF_IMPROVEMENT_LOOP.md, docs/SEAM_BACKLOG.md in the CooperAgents repo.