User.skill: how faithfully can we simulate a developer?

We distil each SWE-chat user (≥6 sessions) into a role-playable folder, then a Claude-Code agent role-plays them and we score the next message it produces against the real held-out one. This report covers the evaluation as it stands after several iterations: filtering targets to genuine user actions, scoring on content, realism and speech-act match, and an optimized simulator built on a shared scaffold. 9 users; the distillation itself ran for all 99.

How we measure fidelity

Predicting a user's exact next message is near-impossible — at any point many messages are plausible, and the specific one depends on the user's private plan and the repo state. So we measure three complementary things, and we only score turns the user actually authored (genuine prompts and interrupts; harness artifacts like injected docs / command output are filtered out).

Content (0–100) — did the simulator predict roughly the right ask? Useful when the substance is predictable; modest ceiling otherwise.
Realism (0–100) — is the message a plausible, in-character thing this user would send, judged by intent/substance and not catchphrase mimicry?
Speech-act match — did the simulator make the right move (new_work / refine_redirect / pushback / bug_report / approve_proceed / question / interrupt)? This is the fidelity a persona can fairly be held to when exact content is unknowable.

The clean test for user-specific signal is own folder vs. wrong folder: it holds "has a folder" constant, so any gap is genuinely about which user. Across content, realism and speech-act, the distilled folder beats a wrong user's folder for the v1 simulators — the distillation encodes real user-specific behaviour.

Simulator variants compared

All on filtered user-action targets (9 users). "own" = the user's own folder; "wrong" = a different user's folder.

Simulator	content own	content wrong	realism own	realism wrong	act own	act wrong
Inline (folder pasted in)	25.2	20.6	29.5	24.3	29.4%	25.5%
Folder-access (reads folder)	27.1	24.0	31.6	30.3	23.5%	19.6%
Folder + intent-first prompt	24.3	27	31.9	30.5	25.5%	25.5%
Folder + shared scaffold (optimized)	26.4	30.2	36.0	32.3	29.2%	29.2%

Green row = the optimized simulator (shared scaffold). Content/realism are 0–100 judge means; act is speech-act match rate.

The optimization: a shared simulator scaffold

The biggest fidelity lever isn't voice — it's which move the user makes. We added a shared layer across all users (simulator/AGENT.md + move-playbook skills: drive-the-project, push-back, interrupt, report-what-you-notice, calibrate-to-the-user) that encodes how real developers drive a session and tells each simulation to match the user's move rates. The per-user folder then only specialises voice. Effect on the distilled condition (folder-v1 → optimized):

+4.4

realism (31.6 → 36.0)

+5.7%

speech-act match (23.5% → 29.2%)

-0.7

content (27.1 → 26.4)

The shared scaffold genuinely raised absolute realism and move-accuracy. But it exposed two hard limits, below.

Limit 1 — the simulator still won't diversify its moves

Real users approve only part of the time and make a wide range of moves; the simulator collapses toward "approve / continue" and — despite an explicit interrupt action and skill — never interrupts.

move	real	simulated
approve_proceed	28%	54%
refine_redirect	19%	4%
new_work	15%	17%
bug_report	13%	8%
pushback	11%	8%
interrupt ⚠	9%	0%
question	6%	8%

real users optimized simulator (labeled predictions). ⚠ = the simulator produced zero of this move across all targets.

The base model's cooperative "approve and continue" prior survives prompt-level instructions. Forcing move-diversity needs move-conditioned generation: sample the move from the user's own rate distribution, then generate that move — the clear next iteration.

Limit 2 — better, but not more user-specific

The shared scaffold lifts every condition — the no-folder generic baseline's realism rose to 46.7 and the wrong folder improved too. So while absolute fidelity went up, the own-vs-wrong gap did not: content even inverted (own 26.4 vs wrong 30.2) and speech-act match tied. General simulator skill and user-discriminability are partly in tension — shared knowledge helps the wrong folder just as much.

Where it stands

The distillation captures real user signal — own beats wrong on all three axes (v1 sims).
Exact content has a low ceiling; speech-act + realism are the fair fidelity axes, and content stays useful where substance is predictable.
The shared scaffold raised absolute realism (+4.4) and move-accuracy (+5.7%), but can't force move-diversity or add user-specificity by itself.
Next: move-conditioned generation — sample each turn's move from the user's measured distribution, then generate — to fix both the over-approval/zero-interrupt bias and discriminability.