Benchmark program

HiveSight benchmark program

A public benchmark plan for direct inference on calibrated local synthetic populations.

The goal is not to prove that LLM surveys work in the abstract. It is to test whether HiveSight's geography-assigned calibrated population outperforms weaker prompting baselines, including rich nonlocal microdata, on stable, recent, human-collected benchmarks.

Updated 2026-04-28

Current status

Benchmark scaffolding is live, the first measured human target snapshot is published for SHED 2024, and a small first-pass synthetic-response comparison now covers the starter SHED household-finance questions. The first expanded run is a useful miss: direct naive estimates beat the persona arms at n=12, so the next benchmark priority is larger samples and better respondent calibration.

3

starter suites

5

comparison arms

5

core metrics

Not another abstract LLM benchmark

The target is not generic plausibility. Each suite tests whether HiveSight's calibrated local population improves on weaker prompting baselines.

Direct inference on a local population

HiveSight does not post-stratify a generic national sample after the fact. It filters the geography-assigned synthetic population first, then simulates.

Built to surface representativeness gains

Topline accuracy alone is not enough. The important checks are subgroup error, local error, and stability under prompt variation.

First measured result

SHED 2024 human target snapshot

First checked-in benchmark result snapshot: weighted human target means from the normalized SHED 2024 public-use file. This is not yet a synthetic-response accuracy result; it is the measured target table the comparison arms will be scored against.

Rows normalized

12,295

Generated 4/25/2026

I am doing okay financially.

72.9%

weighted target mean

12,295 respondents · financial_wellbeing

<$25k

39.9%

$25k-$74,999

59.9%

$75k-$149,999

79.5%

$150k+

89.5%

I could cover a $400 emergency expense using cash or its equivalent.

62.7%

weighted target mean

12,295 respondents · can_cover_400_expense

<$25k

23.1%

$25k-$74,999

47.7%

$75k-$149,999

69.4%

$150k+

83.6%

My finances are better than they were a year ago.

47.1%

weighted target mean

12,295 respondents · financial_change_vs_last_year

18-29

50.5%

30-44

47.1%

45-64

44.4%

65+

48.1%

Housing costs caused a serious hardship for my household, such as falling behind on rent or mortgage, facing foreclosure or eviction risk, or needing housing assistance.

18.7%

weighted target mean

12,295 respondents · housing_cost_stress

<$25k

48.4%

$25k-$74,999

29.0%

$75k-$149,999

12.5%

$150k+

5.4%

Small model comparison

First synthetic-response pass against SHED targets

Small first-pass model comparison on 4 SHED 2024 household-finance questions. This is intentionally directional: 12 simulated respondents per persona arm, one direct naive estimate per question, and calibrated microdata sampled from CA, TX, NY, FL rather than the full national file. Average absolute errors in this run: naive direct estimate 6.9 pts, basic persona 8.6 pts, HiveSight microdata 13.9 pts. HiveSight microdata does not yet beat the lowest-error arm, naive direct estimate.

Execution

gpt-5-mini

seed 20260426 · n=12

4 questions · 100 model calls

Microdata states: CA, TX, NY, FL

Naive LLM direct estimate

6.9 pts

average absolute error

Basic demographic persona

8.6 pts

average absolute error

HiveSight microdata prompt

13.9 pts

average absolute error

I am doing okay financially.

Human target: 72.9%

Naive LLM direct estimate

55.0%

MAE 17.9 pts

Basic demographic persona

75.0%

MAE 2.1 pts

HiveSight microdata prompt

58.3%

MAE 14.6 pts

I could cover a $400 emergency expense using cash or its equivalent.

Human target: 62.7%

Naive LLM direct estimate

60.0%

MAE 2.7 pts

Basic demographic persona

66.7%

MAE 4.0 pts

HiveSight microdata prompt

50.0%

MAE 12.7 pts

My finances are better than they were a year ago.

Human target: 47.1%

Naive LLM direct estimate

43.0%

MAE 4.1 pts

Basic demographic persona

41.7%

MAE 5.4 pts

HiveSight microdata prompt

33.3%

MAE 13.8 pts

Housing costs caused a serious hardship for my household, such as falling behind on rent or mortgage, facing foreclosure or eviction risk, or needing housing assistance.

Human target: 18.7%

Naive LLM direct estimate

16.0%

MAE 2.7 pts

Basic demographic persona

41.7%

MAE 23.0 pts

HiveSight microdata prompt

33.3%

MAE 14.6 pts

Comparison arms

Generic persona prompting

Sparse synthetic personas with no explicit local population conditioning.

Geography-only prompting

Audience prompts that know the place but do not use richer local household records.

Rich nonlocal microdata

Richer household and policy-linked records drawn from calibrated microdata without geography-specific local assignment.

HiveSight local population

Direct inference over geography-assigned calibrated synthetic microdata.

Fallback profiles

The lightweight fallback population used when calibrated local microdata is unavailable.

Core metrics

Topline MAE

Absolute error on overall support or preference levels.

Subgroup MAE

Absolute error on slices like age, sex, income, tenure, and race/ethnicity where the benchmark supports them.

Local MAE

Absolute error across state, district, or other available geography cuts.

Rank accuracy

How often models preserve the ordering between answer options, messages, or geographies.

Prompt stability

How sensitive results are to small wording and formatting changes.

Initial suites

Recent, stable, public benchmarks

These first suites bias toward evergreen attitudes, household economics, and consumer behavior rather than election swings or weekly headline cycles.

GSS 2024 evergreen attitudes

planned

Stable social attitudes with low week-to-week news sensitivity.

Source

Why this suite

This is the best broad benchmark for trust, happiness, redistribution, family, speech, and other durable dispositions.

Starter manifest

4 starter questions wired to benchmark fields from GSS 2024 Cross-section.

Source freshness

GSS 2024 Cross-section was fielded 2024 and released May 2025. Fresh enough to reduce training-set contamination risk while still covering evergreen attitudes.

Question families

General trust and fairness

Work, redistribution, and role-of-government attitudes

Family, religion, and speech norms

Evaluation focus

Topline MAESubgroup MAEPrompt stability

Start with questions that do not depend on a specific political controversy or headline cycle.

Starter prompts

Most people can be trusted.

TRUSTtrust_general

Classic evergreen trust item with low dependence on the weekly news cycle.

Most people are helpful rather than mostly looking out for themselves.

HELPFULhelpfulness_general

General social perception item that should expose broad worldview differences without headline dependence.

Most people try to be fair.

FAIRfairness_general

Pairs naturally with trust and helpfulness as a durable social-attitudes battery.

The government should reduce income differences between rich and poor.

EQWLTHredistribution_support

Good test of whether richer household context improves class- and income-linked attitudes.

SHED 2024 household finance

in progress

Economic perceptions and household stress that map to real disposable-income context.

Source

Why this suite

These questions should benefit from HiveSight's richer household and policy-linked attributes rather than pure demographics.

Starter manifest

4 starter questions wired to benchmark fields from SHED 2024.

Source freshness

Survey of Household Economics and Decisionmaking 2024 was fielded October 2024 and released May 28, 2025. Recent enough to be useful, but centered on durable household economics rather than campaign events.

Question families

Financial well-being

Ability to absorb shocks

Housing, debt, and retirement expectations

Evaluation focus

Topline MAESubgroup MAERank accuracyPrompt stability

This suite is especially relevant to the claim that policy-linked household records matter for marketing and consumer research.

Starter prompts

I am doing okay financially.

FINWELLfinancial_wellbeing

Core household-finance benchmark that should benefit from taxes, transfers, tenure, and family context.

I could cover a $400 emergency expense using cash or its equivalent.

EMERGCASHcan_cover_400_expense

Directly tests disposable-income realism and economic fragility rather than generic ideology.

My finances are better than they were a year ago.

FINCHGfinancial_change_vs_last_year

Useful for directional economic sentiment without tying the benchmark to campaign news.

Housing costs caused a serious hardship for my household, such as falling behind on rent or mortgage, facing foreclosure or eviction risk, or needing housing assistance.

HOUSCOSThousing_cost_stress

A natural test of the claim that local policy-linked household attributes matter for marketing and consumer inference.

SDCPC 2024 consumer choice

planned

Stable consumer preference and payment-choice behavior outside politics.

Source

Why this suite

This is the cleanest public non-political consumer benchmark available without paying for new fieldwork.

Starter manifest

3 starter questions wired to benchmark fields from Survey and Diary of Consumer Payment Choice 2024.

Source freshness

Survey and Diary of Consumer Payment Choice 2024 was fielded 2024 and released May 2025. Recent public-use files with topics that are much less exposed to weekly media swings than election sentiment.

Question families

Payment preference

Convenience and security tradeoffs

Consumer behavior by income and household characteristics

Evaluation focus

Topline MAESubgroup MAERank accuracy

Use this suite to pressure-test the non-civics positioning and the claim that local household calibration matters in consumer settings too.

Starter prompts

Credit cards are my preferred way to pay for purchases.

PREFCREDITprefers_credit

Consumer-choice item that should reveal whether HiveSight generalizes beyond civics into behavior and preference.

Cash is a convenient payment method for my day-to-day life.

CONVCASHcash_convenience

Stable, non-political preference with meaningful subgroup structure.

Security matters more than convenience when I choose how to pay.

PAYSECsecurity_priority

Tests consumer tradeoff judgments instead of pure opinion statements.

Roadmap

Now

Lock the benchmark schema, page, and CLI so the methodology has one source of truth.

Next

Add the first reproducible dataset adapters and question manifests for GSS 2024 and SHED 2024.

Later

Publish benchmark runs comparing generic personas, geography-only prompts, rich nonlocal microdata, HiveSight local populations, and fallback profiles.