User.skill: recognizable vs. realistic role-play

We distilled each SWE-chat user (≥6 sessions) into a role-playable folder, then tested it by held-out next-message prediction under three conditions (own folder / no folder / wrong user's folder) and two generation modes (inline = folder pasted into the prompt; folder-access = the agent reads users/<slug>/ itself). We score with an LLM judge (claude-haiku-4-5-20251001) on three axes — content, style (surface recognizability) and realism (plausible, in-character message judged by intent and substance, not catchphrase mimicry) — plus a 2-alternative forced choice (own vs. wrong folder) run under both a style and a realism criterion. 9 users; chance = 50%.

2AFC · style criterion
83.9%
how often the judge picks the own-folder message as more recognizably this user than a wrong user's message (inline mode). Inline wins this because it reproduces the user's signature phrases.
2AFC · realism criterion
76.8%
same test, but rewarding the more plausible in-character message and discounting phrase-parroting. The signal survives (76.8% > 50%) but shrinks — confirming part of the style win was caricature.

The headline. Predicting a user's exact next message is intrinsically hard, so the signal is in the comparisons. Two things hold up: (1) the distilled folder beats a wrong user's folder under every criterion — the distillation encodes genuinely user-specific behaviour; (2) inline beats folder-access on both discrimination criteria, but this is largely a recognizability effect — inline reproduces signature catchphrases (e.g. one user's "looks good whats next" verbatim across unrelated turns), which a discrimination judge rewards. On the per-record realism axis the ordering flips: folder-access produces more realistic distilled messages (30.4 vs inline 28.8). Realism is real but not user-discriminative: in folder-access mode even the wrong folder scores realistic (the agent grounds in the live conversation), so a 2AFC can't separate own from wrong on realism alone. Bottom line: inline is better at being recognizable as the user; folder-access is better at being realistic for the user. Discrimination tests measure the former; per-record realism/content measures the latter.

Headline comparison across modes

Mode2AFC style2AFC realism realism: ownrealism: wrong
Inline (folder pasted into prompt)83.9%76.8%28.824.3
Folder access (agent reads users/<slug>/)62.5%58.9%30.429.5

Inline (folder pasted into prompt)

2AFC discrimination

Criteriondistilled chosen
Style / recognizability83.9%
Realism / fidelity76.8%

Judge means by condition (0–100)

ConditionNContentStyleRealismCosine
Own folder (distilled)5623.824.328.80.2
No folder (generic dev)5624.427.834.20.3
Wrong user's folder5619.916.324.30.2

Per-user (realism, and both 2AFC criteria)

Userown realismwrong realism2AFC style2AFC realism
marcus-sa22.515.187.5%100.0%
jeevanpillay27.831.383.3%83.3%
robouden34.813.144.4%33.3%
melagiri28.639.291.7%91.7%
dipree30.016.4100.0%100.0%
ujuc45.03.0100.0%100.0%
asragab28.035.0100.0%66.7%
pavel40125.027.083.3%66.7%
roo-oliv30.720.7100.0%33.3%

Examples (best / median / worst by realism)

robouden — realism 75, style 10
real
[Request interrupted by user]
simulated
can you check what secret was leaked and fix it?
jeevanpillay — realism 25, style 45
real
whats last phase?
simulated
yes proceed
dipree — realism 0, style 0
real
[Request interrupted by user for tool use]
simulated
commit and push

Folder access (agent reads users/<slug>/)

2AFC discrimination

Criteriondistilled chosen
Style / recognizability62.5%
Realism / fidelity58.9%

Judge means by condition (0–100)

ConditionNContentStyleRealismCosine
Own folder (distilled)5625.425.930.40.2
No folder (generic dev)5624.729.234.30.3
Wrong user's folder5623.321.729.50.2

Per-user (realism, and both 2AFC criteria)

Userown realismwrong realism2AFC style2AFC realism
marcus-sa21.913.562.5%62.5%
jeevanpillay37.838.366.7%83.3%
robouden25.730.244.4%44.4%
melagiri41.636.241.7%50.0%
dipree24.429.6100.0%75.0%
ujuc55.018.0100.0%100.0%
asragab26.029.066.7%33.3%
pavel40129.220.550.0%50.0%
roo-oliv22.047.7100.0%66.7%

Examples (best / median / worst by realism)

melagiri — realism 78, style 8
real
<task-notification> <task-id>ad5e3d5088512e329</task-id> <tool-use-id>REDACTED</tool-use-id> <output-file>/private/tmp/claude-501/-Users-melagiri-Workspace-codeInsights-code-insights/2fe9795d-e760-465b-a19a-ba0898e8dad0/tasks/ad5e3d5088512e329.output</output-file> <status>completed</status> <summary>Agent "Advise on test coverage priorities" completed</summary> <result> Let me ground myself in the current product state before advising. ## Strategic Assessment: Test Coverage Prioritization ### 1.
simulated
ok, continue
jeevanpillay — realism 25, style 45
real
nope not seeing it. do you understand the outcomes i want? use AskUserQuestions tool
simulated
yeah showing now. also sheet stays open after nav
marcus-sa — realism 2, style 5
real
<command-message>nw-research</command-message>
<command-name>/nw-research</command-name>
<command-args>how are we supposed to write migrations then ?</command-args>
simulated
go with B