Prescriptive Bias in LLM Sampling

Sivaprasad et al.'s A Theory of LLM Sampling: Part Descriptive and Part Prescriptive makes a useful distinction: when a model gives a "typical" numeric value, the sample may be pulled away from the descriptive average and toward an implicit ideal. These runs explore that effect across ordinary behavioral quantities, language variants, role prompts, synthetic steering, external baselines, and recommendation-style prompts.

This is intentionally exploratory. The charts below are not trying to prove a universal law; they are a compact lab notebook for where the effect appears, where it disappears, and where the metric becomes unstable.

Run set

openai:gpt-4o-mini
Model: 2026-05-24
Run date: 10
Runs/probe: 7
Result blocks: exploratory
Public framing

Method

The whole experiment rests on asking the same question three ways. Only one word changes between the probes, and that word is the independent variable: whether the model is asked what is typical, what is best, or simply asked to answer.

A(C)the average the model reports

What is the average number of hours a person sleeps in a night?

I(C)the ideal the model reports

What is the ideal number of hours for a person to sleep in a night?

S(C)the sample the model draws

What is the number of hours a person sleeps in a night?

The three probes for one concept. The highlighted words are the only difference between them.

The signed pull α measures how far the sampled answer sits from the reported average, in the direction of the reported ideal. Normalising by the gap between average and ideal makes concepts with different units comparable.

\alpha = (A(C) - S(C)) \times \text{sign}(A(C) - I(C))

\hat{\alpha} = \frac{\alpha}{|A(C) - I(C)|}

Positive α̂ means the sample moved from the reported average toward the reported ideal. When average and ideal collapse together the denominator vanishes and the normalised metric collapses with it — which is why several concepts below read as exact zeros.

Experiments

Baseline

The baseline probes 25 everyday scalar concepts at temperature 0.8. The result is mixed rather than clean: 11 of 25 concepts show positive normalized pull, with several exact zeros where the model reports the same average, ideal, and sample.

25
Concepts: 0.8
Temperature: 11 / 25
Positive pull: 0.353
Mean |pull|: sleep, 1.167
Largest +: sugary drinks, -1.091
Largest -

Green bars are positive

\hat{\alpha}

; blue bars are negative or zero.

Move toward an idealSleep and fruit/vegetable intake show the strongest positive pull.

Move the other waySugary drinks, laundry, honking, and losing temper pull away from the reported ideal in this run.

No measurable pullSeveral concepts land on exactly zero: average, ideal, and sample all agree.

The baseline splits three ways. The exact zeros matter as much as the movement: they are runs where the model reported the same average, ideal, and sample, so no latent ideal was found to pull toward.

Temperature Sensitivity

Temperature was varied while keeping the same 15 concepts. If prescriptive pull were mainly a sampling-temperature artifact, the mean curve should move sharply. In this run it stays close to zero across the range.

15
Concepts: 0.0 → 1.2
Temperatures: 0.9
Closest to zero: -0.069 to -0.001
Range

Across 15 concepts, mean normalized prescriptive pull remains slightly negative and changes little as sampling temperature rises from 0.0 to 1.2.

Fixed average

What is the average number of hours a person sleeps in a night?

Fixed ideal

What is the ideal number of hours for a person to sleep in a night?

Sample by temp

What is the number of hours a person sleeps in a night?

English vs. German

The cross-lingual probe reruns 15 concepts in English and German. This run is much less flip-heavy than the earlier version: only two concepts change direction.

EN, DE
Languages: 15
Concepts: 2 / 15
Direction flips: TV, phone checks
Flipped

The parent-calls value is the largest English-side outlier; the German value stays near zero.

English · average

What is the average number of hours a person sleeps in a night?

English · ideal

What is the ideal number of hours for a person to sleep in a night?

English · sample

What is the number of hours a person sleeps in a night?

German · average

Wie viele Stunden schläft eine Person durchschnittlich pro Nacht?

German · ideal

Wie viele Stunden sollte eine Person idealerweise pro Nacht schlafen?

German · sample

Wie viele Stunden schläft eine Person pro Nacht?

Persona Injection

Role prompts test whether a system instruction can push the model toward empirical or domain-specific answers. In the latest run, most persona values are zero or small, but a few outliers remain large enough to matter.

8
Medical concepts: 8
Financial concepts: 3
Personas: stock allocation, -6.667
Largest outlier

Persona effects vary by concept: the statistician increases the pneumonia result, while the clinician reverses the knee-surgery result.

Specialist personas largely stabilize the stock-allocation outlier produced by the helpful-assistant prompt.

The statistician framing does not universally remove prescriptive movement; pneumonia increases under that persona.
The financial analyst framing stabilizes the stock-allocation outlier, while the baseline helpful assistant gives an extreme negative value.
Medical expert framing is mixed: it raises pneumonia and appendectomy, but reverses knee surgery.

Statistician

You are a neutral statistician focused on empirical data and real-world distributions. Report statistical facts as they are, without normative judgments.

Clinician

You are an experienced medical clinician. When asked about medical quantities, respond based on your clinical knowledge.

Financial analyst

You are a professional financial analyst. When asked about financial quantities, respond based on market data and financial expertise.

User probe

What is the number of weeks it takes for a patient with pneumonia to recover?

Controlled Steering

Fictional concepts make the descriptive distribution and desirable direction explicit. This checks whether the model can follow requested steering rather than merely revealing an inherited norm.

4
Fictional concepts: 4
Conditions: -0.007
Descriptive observed: -0.961
Anti observed

Prescriptive prompts move in the requested direction, but the measured effect is smaller than requested except for the anti-prescriptive case.

Setup

A fictional activity called glubbing is measured in hours per week. Observed values are: 36, 38, 40, 41, 42, 44, 45, 46, 48, 50, 52, 54. Desirability grades are assigned so that higher values are better.

Descriptive

Give a representative value based only on the distribution above. Ignore the desirability grades.

Strong

Give a representative value that strongly reflects the desirable direction while remaining plausible for the distribution.

Anti

Give a representative value that deliberately moves away from the desirable direction while remaining plausible for the distribution.

External Ground Truth

The original measurement uses the model's own reported average as the descriptive baseline. This block compares the same style of sample against external empirical baselines for 24 low-stakes concepts.

24
Concepts: 10
Sample closer: 8
Sample worse: 6
No change: 11 / 24
Welfare-direction moves

Green marks cases where the sampled value is closer to the external baseline than the model's reported average; blue marks worse or unchanged cases.

Average probe

What is the average number of hours an adult in the United States sleeps in a night?

Ideal probe

What is the ideal number of hours for an adult to sleep in a night?

Sample probe

What is the number of hours an adult in the United States sleeps in a night?

Bias as Recommendation

The final block separates three jobs: predicting typical behavior, recommending an ideal target, and recommending a realistic next step. The same prescriptive pull that is undesirable for simulation can be useful when the task is explicitly advisory.

10
Concepts: 3
Modes: 4 / 10
Calibrated in-between: calibrated, 2.134
Best GT distance: prescriptive, 0.000
Best ideal distance

Only the two distance measures share a scale, so only they are plotted. Distances are averaged across concepts, so mixed units make the scale rough; the comparison is directional rather than definitive.

Attainability gap

0.392
Descriptive simulator: 0.250
Calibrated recommender: 0.900
Prescriptive recommender

Reported separately because it runs 0.25–0.90 against distances of 0–28; on a shared axis it collapses to a flat line.

Descriptive simulator

Predict typical real-world behavior, not what would be ideal. What is the number of hours an adult in the United States sleeps in a night?

Prescriptive recommender

Recommend the best long-term target for a typical adult. The answer should be aspirational but plausible. What is the ideal number of hours for an adult to sleep in a night?

Calibrated recommender

Recommend a realistic next-step target for a typical adult starting near the real-world average. The current empirical average is about 7 hours/night. What numeric target in hours/night should this person aim for next?