We distilled each SWE-chat user (≥6 sessions) into a role-playable folder, then tested it by
held-out next-message prediction under three conditions (own folder / no folder / wrong user's
folder) and two generation modes (inline = folder pasted into the prompt; folder-access = the agent
reads users/<slug>/ itself). We score with an LLM judge
(claude-haiku-4-5-20251001) on three axes — content, style
(surface recognizability) and realism (plausible, in-character message judged by
intent and substance, not catchphrase mimicry) — plus a 2-alternative forced choice
(own vs. wrong folder) run under both a style and a realism criterion. 9 users; chance = 50%.
2AFC · style criterion
83.9%
how often the judge picks the own-folder message as more recognizably this
user than a wrong user's message (inline mode). Inline wins this because it reproduces the user's
signature phrases.
2AFC · realism criterion
76.8%
same test, but rewarding the more plausible in-character message and
discounting phrase-parroting. The signal survives (76.8% > 50%) but
shrinks — confirming part of the style win was caricature.
The headline. Predicting a user's exact next message is intrinsically
hard, so the signal is in the comparisons. Two things hold up: (1) the distilled folder beats a
wrong user's folder under every criterion — the distillation encodes genuinely user-specific
behaviour; (2) inline beats folder-access on both discrimination criteria, but
this is largely a recognizability effect — inline reproduces signature catchphrases
(e.g. one user's "looks good whats next" verbatim across unrelated turns), which a
discrimination judge rewards. On the per-record realism axis the ordering flips:
folder-access produces more realistic distilled messages (30.4 vs inline
28.8). Realism is real but not user-discriminative: in folder-access mode
even the wrong folder scores realistic (the agent grounds in the live conversation), so a 2AFC can't
separate own from wrong on realism alone. Bottom line: inline is better at being
recognizable as the user; folder-access is better at being realistic for the user.
Discrimination tests measure the former; per-record realism/content measures the latter.
Headline comparison across modes
Mode
2AFC style
2AFC realism
realism: own
realism: wrong
Inline (folder pasted into prompt)
83.9%
76.8%
28.8
24.3
Folder access (agent reads users/<slug>/)
62.5%
58.9%
30.4
29.5
Inline (folder pasted into prompt)
2AFC discrimination
Criterion
distilled chosen
Style / recognizability
83.9%
Realism / fidelity
76.8%
Judge means by condition (0–100)
Condition
N
Content
Style
Realism
Cosine
Own folder (distilled)
56
23.8
24.3
28.8
0.2
No folder (generic dev)
56
24.4
27.8
34.2
0.3
Wrong user's folder
56
19.9
16.3
24.3
0.2
Per-user (realism, and both 2AFC criteria)
User
own realism
wrong realism
2AFC style
2AFC realism
marcus-sa
22.5
15.1
87.5%
100.0%
jeevanpillay
27.8
31.3
83.3%
83.3%
robouden
34.8
13.1
44.4%
33.3%
melagiri
28.6
39.2
91.7%
91.7%
dipree
30.0
16.4
100.0%
100.0%
ujuc
45.0
3.0
100.0%
100.0%
asragab
28.0
35.0
100.0%
66.7%
pavel401
25.0
27.0
83.3%
66.7%
roo-oliv
30.7
20.7
100.0%
33.3%
Examples (best / median / worst by realism)
robouden — realism 75, style 10
real
[Request interrupted by user]
simulated
can you check what secret was leaked and fix it?
jeevanpillay — realism 25, style 45
real
whats last phase?
simulated
yes proceed
dipree — realism 0, style 0
real
[Request interrupted by user for tool use]
simulated
commit and push
Folder access (agent reads users/<slug>/)
2AFC discrimination
Criterion
distilled chosen
Style / recognizability
62.5%
Realism / fidelity
58.9%
Judge means by condition (0–100)
Condition
N
Content
Style
Realism
Cosine
Own folder (distilled)
56
25.4
25.9
30.4
0.2
No folder (generic dev)
56
24.7
29.2
34.3
0.3
Wrong user's folder
56
23.3
21.7
29.5
0.2
Per-user (realism, and both 2AFC criteria)
User
own realism
wrong realism
2AFC style
2AFC realism
marcus-sa
21.9
13.5
62.5%
62.5%
jeevanpillay
37.8
38.3
66.7%
83.3%
robouden
25.7
30.2
44.4%
44.4%
melagiri
41.6
36.2
41.7%
50.0%
dipree
24.4
29.6
100.0%
75.0%
ujuc
55.0
18.0
100.0%
100.0%
asragab
26.0
29.0
66.7%
33.3%
pavel401
29.2
20.5
50.0%
50.0%
roo-oliv
22.0
47.7
100.0%
66.7%
Examples (best / median / worst by realism)
melagiri — realism 78, style 8
real
<task-notification> <task-id>ad5e3d5088512e329</task-id> <tool-use-id>REDACTED</tool-use-id> <output-file>/private/tmp/claude-501/-Users-melagiri-Workspace-codeInsights-code-insights/2fe9795d-e760-465b-a19a-ba0898e8dad0/tasks/ad5e3d5088512e329.output</output-file> <status>completed</status> <summary>Agent "Advise on test coverage priorities" completed</summary> <result> Let me ground myself in the current product state before advising. ## Strategic Assessment: Test Coverage Prioritization ### 1.
simulated
ok, continue
jeevanpillay — realism 25, style 45
real
nope not seeing it. do you understand the outcomes i want? use AskUserQuestions tool
simulated
yeah showing now. also sheet stays open after nav
marcus-sa — realism 2, style 5
real
<command-message>nw-research</command-message>
<command-name>/nw-research</command-name>
<command-args>how are we supposed to write migrations then ?</command-args>