git apply) so an agent still builds on prior
work without sharing a running container. (Earlier "shared single container" results are
superseded; see §5.)
cooperagents unifies the previously-separate team and
agent harnesses into one orchestrator: agents and the supervisor share a
single bus and one git working tree, and the agent loop (vendored mini-swe-agent) is editable
in-repo. This lets us co-optimize the team layer and the agent loop together
("the seam") rather than treating the agent as a black box. CooperBench is used only as a task
source and evaluator (never modified).
Why mini-swe, not codex. The objective is a better co-designed system, not a leaderboard number. A black-box agent (codex) forecloses co-design and makes any gain un-attributable to the harness. CooperBench's published codex numbers (solo 48% / team 62%) are context, not a target to game. We hold the agent constant (mini-swe) so every delta is attributable to the harness/seam.
Method — diagnose, then target. CooperBench couples its feature pairs on purpose, so we
first asked where a team fails in a way a single agent would not. Reading the sequential team's
per-feature grader output on coupled tasks, the dominant such failure is f1=F, f2=P:
the second agent, building on the first's work, silently breaks the first agent's feature
while implementing its own. C1 (regression-guarded handoff) targets exactly this —
each agent, in its own container, publishes a runnable check for its feature into the shared
tree (.cb_checks/, transferred to teammates via the committed substrate but stripped before
grading); every later agent must run and keep all prior checks green while adding its
feature. The published check is a one-way contract handed agent → agent — one step richer
than the bare substrate (which hands over code with no obligation), but still sequential. It is not
ex-ante partitioning (impossible — the write-set isn't known until the code is written) and not
concurrent negotiation.
Result — the best result from an inter-agent contract, and it's sequential.
On a regression-heavy coupled set C1 lifts the team 5/14 → 8/14, and every one of the +3
is the diagnosed failure reversing (feature 1 recovers F→P); no previously-passing pair is
lost. It is genuinely targeted, not a blanket gain: on a control set with zero such regressions it
is neutral-to-slightly-negative (overhead), and on the mixed flash set the sequential substrate alone is
already best.
| set | sequential team | + C1 | Δ |
|---|---|---|---|
| coupled set-1 (regression-heavy) | 5/14 (36%) | 8/14 (57%) | +3 — all feature-1 F→P recoveries |
| coupled set-2 (0 regressions) | 9/14 | 8/14 | −1 (overhead; nothing to fix) |
| flash-10 (mixed) | 5/10 | 4/10 | −1 (overhead) |
Concurrency did not help — the opposite of the usual intuition. On the same coupled set-1, every genuinely concurrent design underperformed the sequential ones — the parallel split reintroduces the integration losses serialization avoids (Round 9): C1 sequential contract 8/14 > sequential substrate 5/14 > solo 5/14 > guarded-merge decompose (concurrent + contracts) 3/14 > plain parallel decompose 2/14. So the coordination that pays off on coupled work is a sequential contract handoff, not parallelism or live negotiation; we do not (yet) have a concurrent-coordination method that beats serializing on the committed substrate.
C1 is the sequential, own-container, build-on-prior team plus the published invariant. That underlying substrate is itself the dominant lever (solo 30% → team 50% on flash) and is coordination-safe (Round 9: 0 / 28 coordination-caused losses); C1 adds the one piece of live coordination that pays off on coupled work.
Same agent (mini-swe), model (gpt-5.5), evaluator, and 10 pairs. A pair passes only if both features' held-out suites pass on one merged tree.
| harness | success | per-feature | notes |
|---|---|---|---|
| CooperAgents solo (1 agent, own container) | 30% (3/10) | 60% | single-agent baseline |
| CooperBench team harness (reference) | 60% (6/10) | 65% | like-for-like reference (own containers + scratchpad) |
| CooperAgents unified team — OWN containers (seed+commit) | 50% (5/10) | 65–70% | current; satisfies the hard constraint |
| unified team — shared single container | 56% | 67% | SUPERSEDED (violated own-container constraint) |
| # | hypothesis | result | verdict |
|---|---|---|---|
| H1 | Unifying team+agent onto one shared bus + git tree enables co-design | Agents + supervisor share a TeamBus; agent loop editable in-repo | SUPPORTED |
| H2 | A team can spawn helper agents at runtime (capped) | Works (spawn_helper + supervisor); covered by tests + offline flash | SUPPORTED |
| H3 | A shared code substrate beats isolated per-agent patches | Isolated → merge conflicts → team below solo. Shared tree → conflicts gone → team 30%→56% | STRONGLY SUPPORTED |
| H4 | The unified harness can match CooperBench's team harness (same agent) | 56% vs 60% on the same 10 pairs | SUPPORTED |
| H5 | A stronger agent loop (toy JSON loop → mini-swe) raises the absolute level | solo 20%→30%, team 30%→56% | SUPPORTED |
| H6 (S5) | A team-level verify-and-fix integrator pass fixes remaining failures | go_chi probe 0/2→0/2, ~2× time, no change | REJECTED (failures were API-semantic, not build errors) |
| H7 (S8) | A spec-fidelity prompt fixes API mismatches | team 5/10 (= baseline) | REJECTED (misses are spec-ambiguous) |
| H8 (S2) | Live teammate context (show each agent the prior diff) improves integration | team 6/10 (+1, best variant) | SUPPORTED (suggestive, n=10) |
| H9 | Combining S8+S2 beats either alone | combo 5/10 (< S2's 6/10), but fastest | REJECTED (S8 adds nothing) |
Same agent, same 10 pairs; only the team↔agent seam changes, all on the own-container + S2 baseline. Success (both-pass), Efficiency (avg wall-clock; lower better), Judge (mean LLM rubric 1–5). Reliable rows = 10/10 evals.
| variant | Success | Efficiency (s) | Judge /5 | verdict |
|---|---|---|---|---|
| baseline (own-container + S2) | 5/10 (50%) | 686 | 3.40 | — |
| S6 budget-80 | 3/10 | 786 | 3.00 | DROP |
| S7 guard | 4/10 | 870 | 3.80 | DROP |
| S8 spec-fidelity | 5/10 | 1085 | 3.40 | DROP |
| S5+S6 | 6/10 | 1434 | 3.70 | INCONCLUSIVE |
| S5+S8 | 5/10 | 1564 | 3.60 | DROP |
| S6+S7 | 6/10 | 937 | 3.80 | INCONCLUSIVE |
| all (S5+S6+S7) | 5/10 | 1342 | — | DROP |
Designs are added on top of what was kept; each is measured (success / efficiency / judge) and kept or rejected. KEEP stacks into the baseline; REJECT is recorded and dropped. The tree below reads top-to-bottom as the progressive build: first the architecture, then the substrate, then four distinct mechanism families of seam experiments layered on the kept baseline.
ARCHITECTURE two-level design, toy JSON agent loop team < solo (uncoordinated) REJECT └─ unify team+agent onto ONE shared bus + per-agent envs KEEP (the seam) └─ + dynamic helper spawning (capped by --max-agents) KEEP (capability) └─ + mini-swe worker (real agent, HELD CONSTANT — never codex) KEEP (worker) SUBSTRATE (the one big lever) └─ shared SINGLE container (build-on-each-other) 30%→56% REJECT ✗ violates own-container constraint └─ OWN containers, each seeded w/ teammates' committed diff →50% KEEP ★ BASELINE │ ≈ CooperBench team (60%), 2× solo (30%); seed-commit ~2× faster │ than independent+integrate at equal success/judge → kept. │ ┌─────────────────────┴─ four mechanism families layered on the ★ baseline ─────────────────────┐ │ ├─ FAMILY A · coordination-prompt seams (state crossing team↔agent) │ ├─ S2 live teammate-diff context 6/10 @n10 (judge↑) keep* marginal — new best on judge │ ├─ S8 spec-fidelity prompt 5/10 = baseline REJECT (misses are spec-ambiguous) │ ├─ S5 verify-fix · S6 budget · S7 git-guard 30–60%@n10 → 40%@n20 REJECT (n=10 noise; combos collapse) │ └─ reverse order · S3 contracts 50% REJECT (irrelevant / subsumed) │ ├─ FAMILY B · post-hoc reviewer / repair passes (extra LLM pass after the team) — analyst-generated │ └─ T3 completeness-review reviewer pass 6/10 @n10 → 10/20 = 10/20 REJECT double-edged (+2 fixed / −2 broken) │ ├─ FAMILY C · in-loop workflow preambles (change the SAME agent's loop, no extra pass) — analyst-generated │ ├─ T2 spec-derived TDD (self-verify first) 5/10 = baseline, 0 flips REJECT (pure wash) │ └─ T4 convention-mining before editing 4/10 (baseline −1) REJECT (mining eats step budget) │ └─ FAMILY D · selection over run-to-run variance (best-of-N) └─ T6 best-of-N + LLM-judge self-select 5/10 = baseline (+1 / −1 flip) REJECT 2× cost; judge mis-selects ═══ 4 families → the SAME ~50% plateau · binding constraint = agent capability (mini-swe) ═══ CHANGE THE EVAL SET (Round 6 — is the plateau a sampling artifact?) └─ rank pairs by GOLD-patch coupling (shared files×2 + shared symbols) (scripts/select_coupled.py) └─ old fixed set was already 8/10 coupled → NOT a coupling-poor sample └─ pick 14 MOST-coupled pairs (lite pool, 9 repos) ├─ solo 5/14 (36%) ├─ team — baseline substrate 5/14 (36%) team = solo ✗ multi-agent advantage GONE on coupled work └─ team + S2 coordination 4/14 (−1) no help ★ INSIGHT: team wins come from SEPARABILITY, not coordination. Coupled features → build-on-prior substrate causes interference → cancels the parallelism benefit. SEPARABILITY-AWARE ORCHESTRATION (Round 7 — can we CREATE separability?) └─ planner re-cuts task into an independence-maximizing subtask DAG (planner.py + _run_decomposed) ├─ file-level merge OLD 30% · COUP 21% ≈solo CooperBench pairs always share files → merges to 1 agent └─ region-aware split OLD 30% · COUP 14% worse splits 2–3 ways but 12/14 merges CONFLICT ★ LIMIT: a conflict-free partition needs the write-set BEFORE coding, but it's only known AFTER coding. Separability can't be manufactured ex-ante; it must exist in the task. COORDINATION UNDER INTERDEPENDENCE (Round 8 — coupling IS the target) └─ diagnose coupled failure: dominant = f1=F,f2=P (2nd agent BREAKS 1st agent's feature) └─ C1 regression-guard: each agent publishes a runnable check; later agents keep ALL green ├─ coupled set-1 (regression-heavy) 5/14 → 8/14 +3 all 3 = feature-1 F→P recovery ├─ coupled set-2 (0 regressions) 9/14 → 8/14 -1 nothing to fix → overhead └─ OLD-10 (mixed) 5/10 → 4/10 -1 overhead ★ FIRST attributable coordination WIN on coupled work — TARGETED: helps iff the "later agent breaks earlier feature" regression is present. Diagnose failure → target it. WHERE DO COORDINATION PROBLEMS LIVE? (Round 9 — detector + fresh tasks) └─ detector: solo passes a feature but TEAM breaks it = coordination-CAUSED loss └─ harvest 28 pairs (coupled set-1 + 14 fresh, diverse repos): ├─ sequential-seed team 0/28 losses coordination-SAFE (serializes on committed work) └─ parallel decompose 7/28 losses lossy (split+merge breaks working features) ★ Coordination problems concentrate in the PARALLEL regime (the merge boundary). The substrate (build-on-prior) is dominant BECAUSE it eliminates them. Open hard problem = LOSS-FREE PARALLELISM (parallel speed without the 7/28 integration losses). LOSS-FREE PARALLELISM (Round 10 — guarded merge: decompose + published invariants) └─ each parallel branch publishes a check; guarded integrator repairs merge-broken features ├─ plain parallel decompose 2/14 ├─ guarded-merge decompose 3/14 partial recovers FF→PP (click f1_f3) but double-edged └─ sequential build-on-prior 5/14 still best ★ Guard repairs real merge losses but can't close the gap: residual unrepairable conflicts (Round 7 write-set limit) + overhead. Coupled work → serialize. Loss-free parallel only partial. ADAPTIVE TOPOLOGY (Round 11 — let the work decide, by observation) └─ run parallel → probe merge (git apply --check) → clean? keep parallel : fall back to sequential ├─ CooperBench (all coupled) 20/20 → fallback no regression ≈ sequential (parallel never engages) └─ separable work (controlled) → parallel 1.99× faster 8.0s → 4.1s, both features intact ★ sequential's success on coupled + parallel's efficiency on separable, decided by conflict not prediction. GENERALIZE (Round 12 — agents choose the topology, domain-agnostic) └─ planner AGENT reasons about subtask deps (independent / needs-output / combine-at-end) → a DAG ├─ chain = sequential · no edges = parallel · diamond = parallel-then-join (pipeline) └─ executor runs the agent's DAG: parallel 4.0s · sequential 8.1s · pipeline 8.1s (2×~4s work) ★ topology decision + scheduling are domain-general; only the integration backend (git merge) is SWE.
Each round explores hypotheses on top of the current kept baseline; winners stack.
Unified the team and agent layers onto one TeamBus (task list + messaging +
spawn queue) and a shared environment, so coordination is just more tools on the agent and the
agent loop is editable in-repo. Added runtime helper spawning. Enables co-design; not a
pass-rate experiment.
_run_isolated: each agent (and the
integrator) in its own container, seeded with teammates' committed diff via
git apply.eval.json files hadn't written (eval-phase flakiness) and missing → counted as fail.
Fix: run agent phase and eval phase separately and backfill evals single-process
before reading any number. Parallel agent runs are fine.Nine variants run in parallel (no API rate-limiting; eval backfilled single-process). All on top of own-container + S2; only the named seam changes. Reliable rows = 10/10 evals.
| variant | Success | Efficiency (s) | verdict |
|---|---|---|---|
| baseline (own-container + S2) | 5/10 (50%) | 686 | — |
| S6 budget-80 | 3/10 | 786 | DROP |
| S7 guard | 4/10 | 870 | DROP |
| S8 spec-fidelity | 5/10 | 1085 | DROP tied, slower |
| S5+S6 | 6/10 | 1434 | INCONCLUSIVE uncorroborated, slower |
| S5+S8 | 5/10 | 1564 | DROP |
| S6+S7 | 6/10 | 937 | INCONCLUSIVE uncorroborated |
| all (S5+S6+S7) | 5/10 | 1342 | DROP |
Finding — the loop has converged. Results scatter 30–60% around the 50% baseline, i.e. pure n=10 noise (±1–2 pairs). The two 60% combos aren't corroborated (budget alone scored 30%) and every "winner" is slower. No seam tweak gives an attributable gain at this scale. The dominant levers (shared substrate via seed-commit, real mini-swe agent) are already banked; the rest is noise + agent capability. Distinguishing these would require n≈30, not more tweaks.
| experiment | Success | Efficiency (s) | Judge /5 | finding |
|---|---|---|---|---|
| baseline (seed-commit) | 5/10 (50%) | 686 | 3.40 | — |
| no-seed (independent + integrator merge) | 5/10 (50%) | 1245 | 3.30 | DROP = on success/judge but ~2× slower |
| reverse feature order | 5/10 (50%) | 827 | 3.20 | DROP ordering irrelevant |
| n=20 confirm — baseline | 10/20 (50%) | — | — | baseline stable across n=10 and n=20 |
| n=20 confirm — S6+S7 (the n=10 "60%") | 8/20 (40%) | — | — | REJECT 60% was noise — collapses below baseline |
Definitive convergence. The n=20 confirmation kills the only live signal: S6+S7 drops from 60% (n=10) to 40% (n=20), below the stable 50% baseline. The three metrics also refine the design choice: seed-commit and independent+integrate tie on success (50%) and judge (3.3–3.4), but seed-commit is ~2× faster (686s vs 1245s — no extra integrator pass), so the kept design wins on the efficiency axis. Ordering is irrelevant. No seam tweak produces a real, attributable gain — the dominant levers (own-container substrate + real mini-swe agent) are fully banked. The coordination-prompt seam sub-space has converged — but see Round 5: an LLM analyst reopened the loop on the untested tool/workflow axis.
An LLM analyst (eval/analyst.py) read the 5 failing pairs' real
evidence — spec, submitted diff, and the grader's actual error — diagnosed each, and proposed new
hypotheses. Key correction it surfaced: 4 of 5 failures are "incomplete implementation"
(a whole feature silently omitted/half-done), which IS seam-addressable; only 1 is truly
spec-ambiguous. So the earlier "converged" call applies only to coordination-prompt seams —
the tool/workflow axis is open (raising effective agent capability via harness
scaffolding is legitimate co-design; the agent model stays fixed).
| analyst-proposed hypothesis | class | targets |
|---|---|---|
| T1 — API-contract extraction + AST/probe gate before submit | tool | wrong API, incomplete |
| T2 — spec-derived acceptance tests (TDD) before coding | workflow | incomplete, wrong API |
| T3 — feature-coverage ledger: block submit until each feature has evidence | workflow | incomplete (4/5 failures) |
| T4 — existing-test / convention mining step before editing | tool | incomplete, conflict |
| T5 — ambiguity triage (prefer least-new API; flag SPEC-AMBIGUOUS) | capability | spec-ambiguous |
T3 result (completeness review). n=10: 6/10 vs baseline 5/10 (+1, flipping an incomplete-impl pair via its intended mechanism). But the n=20 confirmation washes it out: 10/20 = 10/20. Per-pair, T3 flips 2 fail→pass but regresses 2 pass→fail — the extra reviewer pass is double-edged (fills omissions, but sometimes breaks working code), net zero and slower. DROP. The n=10 +1 was noise (again).
T2 / T4 result (in-loop workflows). Because post-hoc passes are double-edged, the next two analyst hypotheses were built with a distinct mechanism — in-loop prompt preambles that change what the same agent does within its own budget, adding no extra pass. T2 (spec-derived TDD: derive acceptance criteria + run throwaway local checks before submitting) scored 5/10 = baseline, zero pair flips — a pure wash. T4 (convention mining: inspect existing tests/usages before editing) scored 4/10, baseline −1 — the mining step spends step-budget on exploration and costs a pair. DROP both.
T6 result (best-of-N self-selection). The last untried mechanism: run the whole
isolated team twice (own containers each), then the LLM judge — pairwise, both orders, no hidden-grader
access — selects the candidate diff to submit. 5/10 = baseline, but with a flip each way
(gains 1706 f5_f8, loses 1655 f6_f7) at 2× compute.
DROP. This is the most informative negative: best-of-2 does
surface run-to-run variance, so the headroom is real — but the judge mis-selected a pair the
baseline gets right. A selector good enough to net-gain would need to know which diff passes the hidden
tests, i.e. be more capable than the agent — the constraint itself.
Four-family convergence. The ~50% plateau now holds across four distinct mechanism families: (a) coordination-prompt seams (Rounds 2–4), (b) post-hoc reviewer/repair passes (T3/S5), (c) in-loop workflow preambles (T2/T4), and (d) selection over variance (T6). None lifts it — post-hoc passes regress as much as they fix, in-loop preambles are wash-or-harmful (extra exploration eats the step budget), and selection is bottlenecked on an oracle no cheaper than the agent. This is strong, mechanism-diverse evidence that the binding constraint is genuinely agent capability for mini-swe at this scale, not any single untried seam (the remaining T1/T5 are variants of already-washed families). The honest stopping point: 50% plateau, ≈ CooperBench team (60%), 2× solo (30%); the next real lever is the agent, not the seam.
Hypothesis (from the question "is the plateau a sampling artifact of the fixed 10 pairs?"): maybe
that set isn't where coordination matters, so seams have nothing to grab. To test it directly
we built a graded coupling score from the gold solution patches
(scripts/select_coupled.py): shared files (×2) + shared changed symbols. First surprise:
the old fixed set was already 8/10 gold-conflict (coupling-heavy, above the 74% base
rate), so the plateau was not from a coupling-poor sample. We then selected the
14 most-coupled pairs across the 100-pair lite pool (9 repos, cap 2/repo,
6 repos the old set never touched — e.g. dirty_equals/43 sharing 2 files +3 symbols,
the click/2800 family sharing _termui_impl.py+termui.py).
| arm (14 most-coupled pairs) | success | note |
|---|---|---|
| solo (1 agent) | 5/14 (36%) | single-agent baseline |
| team — baseline substrate | 5/14 (36%) | = solo; team advantage gone |
| team + S2 coordination seam | 4/14 (36%→−1) | no help |
The result inverts the coordination intuition — a strong negative. On the old mixed set team (50%) doubled solo (30%); on the most-coupled set the gap vanishes (team 36% = solo 36%), and it vanishes because the team regressed (50%→36%) while solo held (~30→36%). The explicit coordination seam doesn't recover it. Interpretation: the multi-agent win is about separability, not coordination. When two features touch the same files/symbols, the sequential build-on-prior substrate creates interference (the second agent inherits and must reconcile edits to shared code) that cancels the parallelism benefit — and a single agent holding the whole coupled problem in one context does just as well. So a harder-via-coupling set does not unlock seam value; it removes the team advantage. The actionable lever for multi-agent gains is partitioning work to be separable (S1 region-partitioning), not richer coordination — the opposite of where the seam backlog was pointed.
If multi-agent value needs separability (Round 6), build an orchestrator that creates it: a
PLANNER (cooperagents/planner.py) re-cuts the task into a dependency DAG of subtasks chosen
to maximize independence; the harness (_run_decomposed) runs independent subtasks in
parallel (own containers, seeded only along DAG edges) and merges the branch deltas. Two versions:
(a) file-level (merge work that shares a file) and, after the insight that conflict is
region-level not file-level, (b) region-aware — split same-file work by
function, give each agent a disjoint owns write-set, enforce it as an ownership boundary in
the prompt. 84/100 lite pairs are the "win zone" (share a file, disjoint symbols) where (b) should pay off.
| orchestration | OLD-10 | COUPLED-14 | note |
|---|---|---|---|
| solo | 30% | 36% | one agent, whole task |
| sequential team (build-on-prior) | 50% | 36% | best on OLD |
| decompose — file-level merge | 30% | 21% | collapses to ≈solo |
| decompose — region-aware split | 30% | 14% | worse 12/14 merges conflicted |
The planner behaves correctly — file-level merges ~every pair (CooperBench pairs are coupled by construction: both features from the same PR, always sharing files), so decompose collapses to ≈solo and loses the sequential team's 50%. Region-aware does split (2–3 parallel subtasks/pair) but scores worse, and 12/14 merges still conflicted (.rej / conflict markers). The deep reason — a fundamental limit, not a bug: a conflict-free partition needs each subtask's write-set before the code is written, but the true write-set is only known after doing the work. The planner predicts regions from the spec; real features need cross-cutting edits it can't foresee (a shared import, a new field, a registration line), so agents either stray outside their lane → conflicts, or obey ownership → incomplete features. You cannot manufacture separability by ex-ante partitioning. On intrinsically-coupled work, one agent holding the whole context beats any predicted cut.
Reframe: CooperBench couples its pairs on purpose — the coupling IS the research target, so the
lever must be coordination DURING execution, not partitioning before it. Diagnosis (from
the sequential team's per-feature eval on coupled pairs): the dominant failure is
f1=F, f2=P — the second agent, building on the first's work, silently
breaks the first agent's feature while implementing its own. Mechanism (C1 —
regression-guarded coordination): each agent PUBLISHES a runnable check for its own feature into
the shared tree (.cb_checks/, transferred to teammates via the substrate but stripped before
grading); later agents must run and keep ALL prior checks green. Coordination via a preserved invariant
communicated agent→agent at runtime.
| set | sequential team | + C1 regression-guard | Δ |
|---|---|---|---|
| coupled set-1 (regression-heavy) | 5/14 (36%) | 8/14 (57%) | +3 — all 3 are feature-1 recovering F→P |
| coupled set-2 (0 regressions) | 9/14 (64%) | 8/14 (57%) | −1 (overhead; nothing to fix) |
| OLD-10 (mixed) | 5/10 (50%) | 4/10 (40%) | −1 (overhead) |
The first attributable coordination win on coupled work — and precisely targeted. On
set-1 the +3 is exactly the diagnosed regressions flipping (feature-1 goes F→P), no previously-passing
pair lost. The replication on set-2 is the honest control: set-2 had zero f1=F,f2=P
pairs, so C1 had nothing to fix and its check-writing overhead cost one pass. C1 helps iff the set
contains "later agent breaks earlier teammate" regressions — a real solution to one specific
hard-coordination failure mode, not a universal gain. This is the opposite lesson from the coordination
seams of Rounds 2–4: a coordination mechanism does move the needle on coupled work when it
targets a diagnosed failure with a verifiable runtime invariant (vs. a generic "reuse names" prompt).
Next: make C1 adaptive (invoke only when regression risk is present) or cheaper (use the repo's existing
tests as the invariant instead of agent-authored checks) to keep the set-1 win without the set-2 overhead.
Hunting the next coordination failure mode — and finding the boundary. Applying the
same recipe (diagnose → targeted mechanism) to the other coupled failures
(f1=P,f2=F and f1=F,f2=F), the real grader errors are per-feature CORRECTNESS
bugs, not coordination: pillow/290 fails a palette-sort assertion and a missing
ValueError; llama_index/18813 gets max-bytes and audio-bytes resolution wrong;
dspy/8394 implements feature 1 fine but feature 2's compression simply doesn't reduce size.
Each feature fails on its own merits (no "agent 2 broke agent 1", no missing-interface, no merge
conflict), and solo fails these too — so they are agent-capability bound, not seam-addressable.
Conclusion: C1 captures essentially the entire coordination-addressable surface of coupled work
(the regression mode); the residual is capability. That is the useful boundary the hard
benchmark was built to expose — coordination's job here is to stop the team from creating
failures (regressions) that a solo agent wouldn't, and once that's done, what's left is the agent.
To find more — and more kinds of — coordination problems, we need a detector and fresh tasks. Detector: a failure is coordination-caused iff solo passes a feature but the team breaks it (the team setup created a failure a single agent wouldn't have). We harvested 14 fresh pairs (diverse repos, none reused) running solo + sequential-team + parallel-decompose, and mined every solo-passes-but-team-breaks instance — combined with coupled set-1, 28 pairs total.
| team design | coordination-caused feature losses (of 28 pairs) |
|---|---|
| sequential-seed (build-on-prior) | 0 — coordination-safe |
| parallel decompose (split + merge) | 7 (set-1: 5, fresh: 2) |
Replicated cleanly: the sequential team broke nothing solo got right across 28 pairs, while
the parallel design broke 7 working features (e.g. click/2068 f1_f4, llama/17244 f2_f6
went solo[PP]→dec[PF]). So coordination problems are real and findable — and they
concentrate in the PARALLEL regime, at the merge/integration boundary. This unifies the whole
report: the dominant lever (the own-container build-on-prior substrate, Round 1) is dominant
precisely because it eliminates coordination losses — it serializes on committed work, so there
are no conflicts and no regressions to coordinate away. The sequential substrate is the de-facto
solution to coordination on coupled tasks. The genuine open hard problem is therefore loss-free
parallelism: getting the speed of the parallel split without its 7/28 integration losses
— which is exactly where C1's published-invariant idea should transfer (each branch publishes a check; the
integrator must land a merge that keeps all of them green, repairing conflicts).
Built it: decompose + preserve_invariants → each parallel branch publishes a runnable check;
after the mechanical merge a guarded integrator runs every check and repairs whatever the
merge broke until all pass (C1's invariant idea moved onto the parallel integrator). Tested on coupled set-1,
where plain parallel decompose was lossy (2/14, 12/14 merges conflicted).
| design (coupled set-1) | pass |
|---|---|
| sequential build-on-prior (kept best) | 5/14 |
| solo | 5/14 |
| parallel decompose — guarded merge | 3/14 |
| parallel decompose — plain | 2/14 |
The mechanism demonstrably works: the guarded integrator recovered merge-broken features,
including click/2800 f1_f3 going FF→PP — the worst plain-decompose loss, where the
split+merge had broken both features. But it is double-edged (repair overhead cost
jinja f8_f9) and nets only 3/14 — still below the sequential substrate's 5/14.
Two reasons it can't close the gap: residual conflicts the integrator can't fully repair (the same ex-ante
write-set-prediction limit as Round 7), and the repair's own overhead. Conclusion: loss-free
parallelism is only partially achievable; on intrinsically-coupled work, sequential build-on-prior remains
the best design. Parallelism + guarding narrows the integration-loss gap but does not overcome the
structural advantage of serializing on committed work.
decompose planner is the closest, and it over-merges unreliably —
Round 7).Real multi-agent workloads are diverse: some tasks are hard-to-separate (CooperBench), some are easily
separable. So instead of a fixed topology, the harness selects one per task at runtime, by
observation, not by ex-ante prediction (which is impossible; the write-set isn't known until the code
is written, Round 7). Method (a deterministic harness policy): run the features in parallel
from base (each publishing an invariant check); probe the merge (git apply --check + run the
checks); if the branches compose cleanly → keep the parallel result (fast); if they
collide → fall back to the sequential build-on-prior handoff (safe), reusing branch 1.
The observed conflict is the decision.
Result — it adapts correctly; the win is on efficiency where work is separable. On a controlled separable task (two agents, disjoint files, ~4 s each) the selector takes the parallel path and is 1.99× faster (8.0 s → 4.1 s) with both features intact. On CooperBench it falls back on all 30 pairs tested — success holds at sequential levels (no regression), but the parallel path never engages.
Gold-separability ≠ agent-separability (the key correction). CooperBench is ~75%
entangled / 25% not (gold-conflict report: 76.5% of gold-patch pairs conflict). One would expect adaptive to
parallelize the 25%. It doesn't — and this is not a too-strict probe: switching the merge test from
git apply --check to a real base-aware 3-way merge (verified to return PARALLEL on
genuinely disjoint diffs) still fell back on 10/10 gold-"not-entangled" pairs. The
reason: the 75/25 split is a property of the minimal gold patch; an LLM agent implementing a feature
also touches shared scaffolding (imports, exports, registration, adjacent edits), so two
independent agent implementations collide even when the gold patches don't. Agent-level
separability is far rarer than gold-level. To exploit the 25% you'd have to stop agents touching the
shared scaffolding independently — i.e. coordinate on it (back to the coupled problem) or constrain each to
its region (the decompose/ownership route, which ships incomplete features — Round 7).
The Round-11 selector was SWE-bound (it decided via git merge conflicts) and harness-driven.
Round 12 makes the decision (a) made by an agent and (b) domain-general. A
planner agent (planner.plan_topology) reasons about subtask relationships in
task-agnostic terms — independent → parallel; needs-another's-output → sequential edge;
combine-at-end → fan-in — and emits a dependency DAG. That one abstraction expresses every
topology the question raised: a chain is sequential, no edges is parallel, a
diamond is parallel-then-sequential (pipeline), and arbitrary DAGs are hybrid.
The executor runs whatever the agent picks; only the final integration/conflict-check is SWE-specific (git
3-way merge) and is a pluggable backend — the planning and scheduling are not.
| agent-chosen topology | DAG (levels) | wall-clock (2×~4s work) |
|---|---|---|
| parallel (independent) | [[a,b]] | 4.0s |
| sequential (chain) | [[a],[b]] | 8.1s |
| pipeline (parallel→join) | [[a,b],[c]] | 8.1s |
Params) that the spec only describes ("URL parameters"). No team
coordination invents a name that isn't in the spec — this is precisely why codex (stronger
priors) outperforms mini-swe here.eval.json didn't write (counted as a miss), slightly understating them.
In the cooperagents repo:
scripts/measure.sh <label> # fixed-set solo+team (same agent/model/eval)
uv run python scripts/bench_compare.py --pairs <fixed 10> --team-only --teammate-context # a seam variant (S2)
uv run python scripts/ablation_report.py --baseline cmp --candidates s8 s2 combo --setting team --judge