User.skill: how faithfully can we simulate a developer?

We distil each SWE-chat user (≥6 sessions) into a role-playable folder, then a Claude-Code agent role-plays them and we score the next message it produces against the real held-out one. This report covers the evaluation as it stands after several iterations: filtering targets to genuine user actions, scoring on content, realism and speech-act match, and an optimized simulator built on a shared scaffold. 9 users; the distillation itself ran for all 99.

How we measure fidelity

Predicting a user's exact next message is near-impossible — at any point many messages are plausible, and the specific one depends on the user's private plan and the repo state. So we measure three complementary things, and we only score turns the user actually authored (genuine prompts and interrupts; harness artifacts like injected docs / command output are filtered out).

The clean test for user-specific signal is own folder vs. wrong folder: it holds "has a folder" constant, so any gap is genuinely about which user. Across content, realism and speech-act, the distilled folder beats a wrong user's folder for the v1 simulators — the distillation encodes real user-specific behaviour.

Simulator variants compared

All on filtered user-action targets (9 users). "own" = the user's own folder; "wrong" = a different user's folder.

Simulator content owncontent wrong realism ownrealism wrong act ownact wrong
Inline (folder pasted in)25.220.629.524.329.4%25.5%
Folder-access (reads folder)27.124.031.630.323.5%19.6%
Folder + intent-first prompt24.32731.930.525.5%25.5%
Folder + shared scaffold (optimized)26.430.236.032.329.2%29.2%

Green row = the optimized simulator (shared scaffold). Content/realism are 0–100 judge means; act is speech-act match rate.

The optimization: a shared simulator scaffold

The biggest fidelity lever isn't voice — it's which move the user makes. We added a shared layer across all users (simulator/AGENT.md + move-playbook skills: drive-the-project, push-back, interrupt, report-what-you-notice, calibrate-to-the-user) that encodes how real developers drive a session and tells each simulation to match the user's move rates. The per-user folder then only specialises voice. Effect on the distilled condition (folder-v1 → optimized):

+4.4
realism (31.6 → 36.0)
+5.7%
speech-act match (23.5% → 29.2%)
-0.7
content (27.1 → 26.4)

The shared scaffold genuinely raised absolute realism and move-accuracy. But it exposed two hard limits, below.

Limit 1 — the simulator still won't diversify its moves

Real users approve only part of the time and make a wide range of moves; the simulator collapses toward "approve / continue" and — despite an explicit interrupt action and skill — never interrupts.

movereal simulated
approve_proceed28%
54%
refine_redirect19%
4%
new_work15%
17%
bug_report13%
8%
pushback11%
8%
interrupt ⚠9%
0%
question6%
8%

real users optimized simulator (labeled predictions). ⚠ = the simulator produced zero of this move across all targets.

The base model's cooperative "approve and continue" prior survives prompt-level instructions. Forcing move-diversity needs move-conditioned generation: sample the move from the user's own rate distribution, then generate that move — the clear next iteration.

Limit 2 — better, but not more user-specific

The shared scaffold lifts every condition — the no-folder generic baseline's realism rose to 46.7 and the wrong folder improved too. So while absolute fidelity went up, the own-vs-wrong gap did not: content even inverted (own 26.4 vs wrong 30.2) and speech-act match tied. General simulator skill and user-discriminability are partly in tension — shared knowledge helps the wrong folder just as much.

Where it stands