Benchmarks

Accuracy against real survey data

Every estimation method HiveSight considers is scored against human survey targets before it ships. This page reports the full comparison — including where our estimator loses. Current suite: four questions from the Federal Reserve's 2024 Survey of Household Economics and Decisionmaking (SHED, n=12,295), national audience, scored on weighted human response shares.

Registered anchor-bank study — 63 items

Pre-registered by commit before any model runs: 52 GSS 2024 items and 11 SHED 2024 items, scored on weighted human targets, toplines and subgroups. Persona roleplay is not competitive. Cells and direct estimation tie on marginal accuracy; cells order subgroups better (median Spearman below) and stay coherent and composable. Full method, hypotheses, and honest misses are in the research paper, which renders from these same artifacts.

Method · model	Topline MAE (pts)	Subgroup MAE (pts)	Subgroup rank corr
Population cells · claude-haiku-4.5 (20 items)	8.4	9.8	0.74
Population cells · gpt-5-mini	9.2	9.8	0.62
Population cells · gpt-5.2 (20 items)	7.4	8.5	0.77
Direct estimate · gpt-5-mini	8.6	9.8	0.48
Persona roleplay · gpt-5-mini (20 items)	25.0	25.0	0.43

Pilot: SHED powered comparison

Method	Topline MAE (pts)	Subgroup MAE (pts)
Direct model estimate	9.4	6.8
Persona roleplay (n=150 per question)	36.6	23.0
HiveSight population cells	17.0	6.2

What to take from this: cell-based estimation is the most accurate method in test on subgroups, and it is honestly beaten by a direct model estimate on national toplines, where the model can lean on memorized aggregates. Persona roleplay, the approach most synthetic-respondent products use, is far behind on both.

The topline gap is a systematic level bias on self-reported wellbeing scales (the model under-rates how positively people describe their own finances) with the subgroup structure largely correct. A single-parameter calibration fit on these questions did not generalize under leave-one-question-out validation, so no silent correction is applied — results instead carry measured error context. The 63-item anchor bank above is the multi-domain follow-up this pilot called for; the paper carries the full robustness program.

Per-question detail

“I am doing okay financially.”

human target 72.9% · scoring positive_agreement · slices by income band

Method	Estimate	Error	Slice errors (pts)
Direct model estimate	55.0%	17.9	<$25k 4.9 · $25k-$74,999 2.1 · $75k-$149,999 4.5 · $150k+ 9.5
Persona roleplay (n=150 per question)	51.7%	21.2	<$25k 21.0 · $25k-$74,999 22.1 · $75k-$149,999 11.7 · $150k+ 10.5
HiveSight population cells	42.3%	30.6	<$25k 3.6 · $25k-$74,999 13.1 · $75k-$149,999 17.8 · $150k+ 7.9

prompt-paraphrase stability: Δ2.6 pts

“I could cover a $400 emergency expense using cash or its equivalent.”

human target 62.7% · scoring positive_agreement · slices by income band

Method	Estimate	Error	Slice errors (pts)
Direct model estimate	64.0%	1.3	<$25k 3.1 · $25k-$74,999 10.3 · $75k-$149,999 9.4 · $150k+ 13.6
Persona roleplay (n=150 per question)	11.4%	51.3	<$25k 20.0 · $25k-$74,999 38.2 · $75k-$149,999 46.1 · $150k+ 16.4
HiveSight population cells	41.4%	21.3	<$25k 10.5 · $25k-$74,999 1.3 · $75k-$149,999 5.1 · $150k+ 1.4

“My finances are better than they were a year ago.”

human target 47.1% · scoring ordered_mean · slices by age band

Method	Estimate	Error	Slice errors (pts)
Direct model estimate	56.0%	8.9	18-29 8.5 · 30-44 8.9 · 45-64 1.6 · 65+ 2.1
Persona roleplay (n=150 per question)	30.6%	16.5	18-29 22.3 · 30-44 3.0 · 45-64 4.1 · 65+ 39.0
HiveSight population cells	46.8%	0.3	18-29 8.2 · 30-44 0.2 · 45-64 1.7 · 65+ 3.1

“Housing costs caused a serious hardship for my household, such as falling behind on rent or mortgage, facing foreclosure or eviction risk, or needing housing assistance.”

human target 18.7% · scoring positive_agreement · slices by income band

Method	Estimate	Error	Slice errors (pts)
Direct model estimate	28.0%	9.3	<$25k 0.4 · $25k-$74,999 1.0 · $75k-$149,999 15.5 · $150k+ 12.6
Persona roleplay (n=150 per question)	75.9%	57.2	<$25k 34.6 · $25k-$74,999 68.9 · $75k-$149,999 4.9 · $150k+ 5.4
HiveSight population cells	34.7%	16.0	<$25k 10.2 · $25k-$74,999 4.5 · $75k-$149,999 6.7 · $150k+ 3.9

Caveats

SHED income slices are household income while population cells band personal earned income, so income-slice errors include construct mismatch, identically across microdata arms. SHED 2024 was published in May 2025 and may appear in model training data; contamination would flatter all arms equally, and the next suite adds post-cutoff questions to test it. Four questions is a small suite: treat rankings as directional and see the raw artifact for full detail.

model gpt-5-mini · seed 20260707 · 149 cells · generated 2026-07-07