User.skill: recognizable vs. realistic role-play

We distilled each SWE-chat user (≥6 sessions) into a role-playable folder, then tested it by held-out next-message prediction under three conditions (own folder / no folder / wrong user's folder) and two generation modes (inline = folder pasted into the prompt; folder-access = the agent reads users/<slug>/ itself). We score with an LLM judge (claude-haiku-4-5-20251001) on three axes — content, style (surface recognizability) and realism (plausible, in-character message judged by intent and substance, not catchphrase mimicry) — plus a 2-alternative forced choice (own vs. wrong folder) run under both a style and a realism criterion. 9 users; chance = 50%.

2AFC · style criterion

83.9%

how often the judge picks the own-folder message as more recognizably this user than a wrong user's message (inline mode). Inline wins this because it reproduces the user's signature phrases.

2AFC · realism criterion

76.8%

same test, but rewarding the more plausible in-character message and discounting phrase-parroting. The signal survives (76.8% > 50%) but shrinks — confirming part of the style win was caricature.

The headline. Predicting a user's exact next message is intrinsically hard, so the signal is in the comparisons. Two things hold up: (1) the distilled folder beats a wrong user's folder under every criterion — the distillation encodes genuinely user-specific behaviour; (2) inline beats folder-access on both discrimination criteria, but this is largely a recognizability effect — inline reproduces signature catchphrases (e.g. one user's "looks good whats next" verbatim across unrelated turns), which a discrimination judge rewards. On the per-record realism axis the ordering flips: folder-access produces more realistic distilled messages (30.4 vs inline 28.8). Realism is real but not user-discriminative: in folder-access mode even the wrong folder scores realistic (the agent grounds in the live conversation), so a 2AFC can't separate own from wrong on realism alone. Bottom line: inline is better at being recognizable as the user; folder-access is better at being realistic for the user. Discrimination tests measure the former; per-record realism/content measures the latter.

Headline comparison across modes

Mode	2AFC style	2AFC realism	realism: own	realism: wrong
Inline (folder pasted into prompt)	83.9%	76.8%	28.8	24.3
Folder access (agent reads users/<slug>/)	62.5%	58.9%	30.4	29.5

Inline (folder pasted into prompt)

2AFC discrimination

Criterion	distilled chosen
Style / recognizability	83.9%
Realism / fidelity	76.8%

Judge means by condition (0–100)

Condition	N	Content	Style	Realism	Cosine
Own folder (distilled)	56	23.8	24.3	28.8	0.2
No folder (generic dev)	56	24.4	27.8	34.2	0.3
Wrong user's folder	56	19.9	16.3	24.3	0.2

Per-user (realism, and both 2AFC criteria)

User	own realism	wrong realism	2AFC style	2AFC realism
marcus-sa	22.5	15.1	87.5%	100.0%
jeevanpillay	27.8	31.3	83.3%	83.3%
robouden	34.8	13.1	44.4%	33.3%
melagiri	28.6	39.2	91.7%	91.7%
dipree	30.0	16.4	100.0%	100.0%
ujuc	45.0	3.0	100.0%	100.0%
asragab	28.0	35.0	100.0%	66.7%
pavel401	25.0	27.0	83.3%	66.7%
roo-oliv	30.7	20.7	100.0%	33.3%

Examples (best / median / worst by realism)

robouden — realism 75, style 10

real

[Request interrupted by user]

simulated

can you check what secret was leaked and fix it?

jeevanpillay — realism 25, style 45

real

whats last phase?

simulated

yes proceed

dipree — realism 0, style 0

real

[Request interrupted by user for tool use]

simulated

commit and push

Folder access (agent reads users/<slug>/)

2AFC discrimination

Criterion	distilled chosen
Style / recognizability	62.5%
Realism / fidelity	58.9%

Judge means by condition (0–100)

Condition	N	Content	Style	Realism	Cosine
Own folder (distilled)	56	25.4	25.9	30.4	0.2
No folder (generic dev)	56	24.7	29.2	34.3	0.3
Wrong user's folder	56	23.3	21.7	29.5	0.2

Per-user (realism, and both 2AFC criteria)

User	own realism	wrong realism	2AFC style	2AFC realism
marcus-sa	21.9	13.5	62.5%	62.5%
jeevanpillay	37.8	38.3	66.7%	83.3%
robouden	25.7	30.2	44.4%	44.4%
melagiri	41.6	36.2	41.7%	50.0%
dipree	24.4	29.6	100.0%	75.0%
ujuc	55.0	18.0	100.0%	100.0%
asragab	26.0	29.0	66.7%	33.3%
pavel401	29.2	20.5	50.0%	50.0%
roo-oliv	22.0	47.7	100.0%	66.7%

Examples (best / median / worst by realism)

melagiri — realism 78, style 8

real

<task-notification> <task-id>ad5e3d5088512e329</task-id> <tool-use-id>REDACTED</tool-use-id> <output-file>/private/tmp/claude-501/-Users-melagiri-Workspace-codeInsights-code-insights/2fe9795d-e760-465b-a19a-ba0898e8dad0/tasks/ad5e3d5088512e329.output</output-file> <status>completed</status> <summary>Agent "Advise on test coverage priorities" completed</summary> <result> Let me ground myself in the current product state before advising. ## Strategic Assessment: Test Coverage Prioritization ### 1.

simulated

ok, continue

jeevanpillay — realism 25, style 45

real

nope not seeing it. do you understand the outcomes i want? use AskUserQuestions tool

simulated

yeah showing now. also sheet stays open after nav

marcus-sa — realism 2, style 5

real

<command-message>nw-research</command-message>
<command-name>/nw-research</command-name>
<command-args>how are we supposed to write migrations then ?</command-args>

simulated

go with B