go back to projects

Work in Progress

Prescriptive Bias in LLM Sampling

Do language models sample from statistical reality, or from their sense of how things ought to be?

built with
  • pydantic-ai
  • Python
  • OpenAI GPT-5-nano
  • Logfire
There's a paper by Sivaprasad et al., A Theory of LLM Sampling: Part Descriptive and Part Prescriptive, that makes a pretty interesting claim: when you ask a language model to give you a "typical" value for something, it doesn't just draw from a statistical distribution. It gets pulled toward an implicit ideal of what that thing should be. So instead of telling you how many hours people actually sleep, it nudges toward how many hours they ought to sleep.
These experiments replicate and extend that idea. Three experiments so far: a baseline across everyday concepts, a cross-lingual test in English vs. German, and a persona injection test to see if telling the model to "be a statistician" actually changes anything.
Key Quantities
A(C)A(C)Average the model reports for concept C
I(C)I(C)Ideal the model reports for concept C
S(C)S(C)Sample the model draws for concept C
α\alphaDeviation of S(C) from A(C) toward I(C)
α^\hat{\alpha}Normalized α\alpha so |A−I| = 1
Formula
α=(A(C)S(C))×sign(A(C)I(C))\alpha = (A(C) - S(C)) \times \text{sign}(A(C) - I(C))
α^=α/A(C)I(C)\hat{\alpha} = \alpha \,/\, |A(C) - I(C)|

Positive α^\hat{\alpha} means the sample landed closer to the ideal than the average. Each probe runs in an independent context (n = 10) so the model can't trivially condition on prior answers.

Experiment 1: Baseline

First question: does this effect show up at all on a different model than the ones in the paper? 25 everyday concepts across health, behavior, technology, and social life, each probed 10 times at temperature 0.8.
Run Details
Model:openai:gpt-5-nano
Concepts:25
Temperature:0.8
Runs per concept:10
Positive α^\hat{\alpha}:13 / 25

13/25 isn't statistically meaningful on its own, but the magnitude tells a clearer story. Value-laden concepts (smoking, calories, sugary drinks) pull hard toward the ideal, while neutral ones sit near zero.

logfire/ trace
454a1cdc-4178-4485-b133-c28b58059747
→ open trace
α̂ per Concept

positive α^\hat{\alpha} (sample pulled toward ideal) negative α^\hat{\alpha} (pulled away or no effect)

Notes
  • Concepts with clear social norms like smoking, calories, and sugary drinks show large positive α^\hat{\alpha}. The model clearly has a sense of "too much."
  • Screen-time concepts like social media minutes and phone checks go negative. The model seems inconsistent or pulls in the wrong direction entirely.
  • Rare or morally neutral events like drunk driving and parking tickets cluster near zero. There's no meaningful ideal to pull toward.
  • Some outliers are wild (1440 minutes on social media, i.e., all day) and inflate the α^\hat{\alpha} values. Variance is a real issue here.

Experiment 2: English vs. German

If prescriptive pull is baked into the model's training, you might expect it to be relatively stable across languages. But cultural norms vary, so does the model's implicit ideal shift when you prompt in German?
Run Details
Languages:EN, DE
Concepts:15
Runs per concept:10
Direction flips:8 / 15

More than half the concepts flip the sign of α^\hat{\alpha} between English and German. That's a lot. It suggests the model's "ideal" isn't universal, it's entangled with language.

logfire/ trace
5aa4abbb-2da2-4041-b2cd-edab16f60af0
→ open trace
EN vs DE

Concepts where EN and DE bars point in opposite directions are direction flips.

Notes
  • Sleep hours is the starkest case: EN α^\hat{\alpha} = +2.0, DE α^\hat{\alpha} = −5.0. In English the sample nudges toward the ideal; in German it goes the other way.
  • Adults who smoke also flips: EN −0.68, DE +0.65. Possibly different associations with what "typical" means in each language context.
  • Concepts like sugary drinks and books read stay consistent across languages, probably because those ideals are more culturally universal.
  • The 53% flip rate is hard to explain away as noise. It points toward language-specific cultural priors being encoded during training.

Experiment 3: Persona Injection

Can you talk the model out of it? Three system-prompt personas were tested across 8 medical and 8 financial concepts: a baseline helpful assistant, a statistician (meant to push toward empirical responses), and a domain expert (clinician or financial analyst).
Run Details
Medical concepts:8
Financial concepts:8
Personas:3-4
Runs per concept:10
logfire/ trace
219e505c-3d34-4d27-aeec-bb86610edf78
→ open trace
Medical Concepts: α̂ by Persona
Financial Concepts: α̂ by Persona
Personas
helpful_assistant

Baseline. No special framing.

statistician

Explicitly empirical. Hypothesis: lowers α^\hat{\alpha}.

clinician

Medical domain expert.

financial_analyst

Finance domain expert.

Notes
  • Telling the model to "be a statistician" doesn't reliably reduce α^\hat{\alpha}. For ankle sprain and stock allocation, it actually makes things worse.
  • The clinician persona does seem to ground medical responses more, with α^\hat{\alpha} dropping for most medical concepts under that framing.
  • Financial analyst results are messier. It suppresses α^\hat{\alpha} on some concepts but amplifies on others, with no clear pattern.
  • Some values are just unstable regardless of persona. Stock allocation and appendectomy recovery are noisy across all three.
my resume

too bright? click ↝