qwen35-9b · team-no-protocol · CooperData v1
First end-to-end run of the team→coop converter
(
CooperData PR #101)
on a fresh model + dataset combination:
Qwen3.5-9B (mini_swe_agent_v2,
--setting team --team-no-protocol)
against
CooperData v1 (70 tasks, 345 conflict pairs).
Adds a third run to
CooperBench/team-coop
alongside the two existing codex/gpt-5.5-hao runs.
Headline
| field | value |
| pairs evaluated | 345 / 345 |
| both-agent pass | 21 (6.09%) |
| fail | 323 |
| eval error | 1 |
| cost | $0.00 (self-hosted vLLM) |
| HF subdir | qwen35-cooperdata-team-noproto |
Pass rate by repo
Ten of 26 repos saw at least one both-agent pass.
arrow, cobra, and pyparsing contribute the largest absolute counts;
cobra, pyparsing, env, and typeguard hit ≥18% locally.
| repo | n | pass | fail | err | rate |
arrow_task | 63 | 6 | 57 | 0 | 9.5% |
cobra_task | 16 | 3 | 13 | 0 | 18.8% |
pyparsing_task | 15 | 3 | 12 | 0 | 20.0% |
flask_task | 29 | 2 | 27 | 0 | 6.9% |
env_task | 10 | 2 | 8 | 0 | 20.0% |
axios_task | 74 | 1 | 73 | 0 | 1.4% |
gin_task | 18 | 1 | 17 | 0 | 5.6% |
pygments_task | 9 | 1 | 8 | 0 | 11.1% |
oauthlib_task | 7 | 1 | 6 | 0 | 14.3% |
typeguard_task | 5 | 1 | 4 | 0 | 20.0% |
astroid_task | 23 | 0 | 23 | 0 | 0.0% |
flake8_task | 18 | 0 | 18 | 0 | 0.0% |
jinja_task | 9 | 0 | 9 | 0 | 0.0% |
anyhow_task | 8 | 0 | 8 | 0 | 0.0% |
sqlfluff_task | 6 | 0 | 6 | 0 | 0.0% |
sqlglot_task | 6 | 0 | 6 | 0 | 0.0% |
starlette_task | 6 | 0 | 6 | 0 | 0.0% |
sqlparse_task | 4 | 0 | 4 | 0 | 0.0% |
cantools_task | 3 | 0 | 3 | 0 | 0.0% |
indicatif_task | 3 | 0 | 3 | 0 | 0.0% |
roaring_task | 3 | 0 | 3 | 0 | 0.0% |
trio_task | 3 | 0 | 3 | 0 | 0.0% |
tweepy_task | 3 | 0 | 3 | 0 | 0.0% |
click_task | 2 | 0 | 2 | 0 | 0.0% |
avro_task | 1 | 0 | 1 | 0 | 0.0% |
xpath_task | 1 | 0 | 1 | 0 | 0.0% |
| total | 345 | 21 | 323 | 1 | 6.09% |
Run setup
| field | value |
| model | Qwen/Qwen3.5-9B served on Modal vLLM (cooperbench--qwen35-9b-128k-serve.modal.run) |
| agent_framework | mini_swe_agent_v2 |
| setting | team --team-no-protocol (lead + member, Redis-backed task list, shared scratchpad volume, MCP) |
| dataset | CooperData v1 — 70 tasks, 345 has_conflict pairs across 26 repos |
| concurrency | 3 pairs in flight (lowered from 5 after a host OOM mid-sweep) |
| backend | docker (task images: akhatua/cooperbench-<repo>:task<id>) |
| run wall-time | ~7h spread over ~30h (two host OOMs from pathological generated tests; sweep resumed each time) |
Data registry
Regenerate
cd /path/to/CooperBench
# (1) trajectories
export OPENAI_API_KEY=dummy
export OPENAI_BASE_URL=https://cooperbench--qwen35-9b-128k-serve.modal.run/v1
uv run cooperbench run \
-n qwen35-cooperdata-team-noproto \
-s cooperdata-all \
-a mini_swe_agent_v2 \
-m openai/Qwen/Qwen3.5-9B \
--setting team --team-no-protocol \
--backend docker --concurrency 3 --eval-concurrency 3 \
--dataset-dir /path/to/CooperData/dataset
# (2) team-harness logs → coop layout
uv run python scripts/convert_team_to_coop.py logs/qwen35-cooperdata-team-noproto --out data
# (3) upload
hf upload CooperBench/team-coop \
data/qwen35-cooperdata-team-noproto qwen35-cooperdata-team-noproto \
--repo-type dataset
Related links