Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 61 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 129 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings (2510.08338v1)

Published 9 Oct 2025 in cs.AI

Abstract: Consumer research costs companies billions annually yet suffers from panel biases and limited scale. LLMs offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.

Summary

  • The paper presents the SSR method that maps free-text responses from LLMs to Likert-scale distributions, reaching 90% human test–retest reliability.
  • It demonstrates that SSR outperforms direct and follow-up Likert rating approaches with high distributional similarity (KS > 0.85) and robust demographic alignment.
  • The approach provides scalable, interpretable synthetic consumer surveys that capture both quantitative fidelity and qualitative insights for market research.

Semantic Similarity Elicitation Enables LLMs to Reproduce Human Purchase Intent Distributions

Introduction

The paper "LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings" (2510.08338) addresses a critical limitation in the application of LLMs for synthetic consumer research: the inability of LLMs to generate realistic Likert-scale response distributions when directly prompted for numerical ratings. The authors introduce the semantic similarity rating (SSR) method, which leverages free-text responses from LLMs and maps them to Likert-scale distributions using embedding-based semantic similarity to reference anchor statements. This approach is evaluated on a substantial dataset of 57 personal care product surveys (9,300 human responses), demonstrating that SSR achieves 90% of human test–retest reliability and high distributional similarity (KS similarity > 0.85) to real survey data.

Methodology

Synthetic Consumer Construction and Response Elicitation

Synthetic consumers are instantiated by prompting LLMs (GPT-4o and Gemini-2.0-flash) with demographic attributes and product concept stimuli (text and/or image). Three response generation strategies are compared:

  1. Direct Likert Rating (DLR): LLMs are asked to respond with an integer (1–5) directly.
  2. Follow-up Likert Rating (FLR): LLMs first generate a free-text statement, which is then mapped to a Likert score by a second LLM instance acting as a "Likert rating expert."
  3. Semantic Similarity Rating (SSR): Free-text responses are embedded and compared to five reference anchor statements (one per Likert category) using cosine similarity. The resulting similarities are normalized to produce a probability mass function (pmf) over the Likert scale. Figure 1

Figure 1

Figure 1: Overview of response generation procedures and SSR mapping, illustrating the construction of synthetic consumers and the embedding-based mapping to Likert distributions.

Success Metrics

Two primary metrics are used to evaluate synthetic panels:

  • Distributional Similarity: Kolmogorov–Smirnov (KS) similarity between synthetic and real Likert distributions.
  • Correlation Attainment (ρ\rho): Pearson correlation between mean purchase intents of synthetic and real surveys, normalized by the maximum attainable correlation from human test–retest reliability.

Results

Direct Likert Rating Baseline

DLR yields high correlation attainment (ρ80%\rho \approx 80\%) but poor distributional similarity (KS similarity: 0.26 for GPT-4o, 0.39 for Gem-2f). LLMs predominantly regress to the center of the scale (response '3'), rarely producing extreme ratings ('1' or '5'), resulting in unrealistic, narrow distributions. Figure 2

Figure 2: Comparison of real and synthetic survey distributions for GPT-4o, showing the limited dynamic range of DLR responses.

Figure 3

Figure 3: KS similarity comparison for GPT-4o, highlighting the distributional mismatch of DLR versus SSR and FLR.

SSR and FLR Performance

SSR markedly improves both metrics: for GPT-4o, SSR achieves ρ=90%\rho = 90\% and KS similarity xy=0.88xy = 0.88; for Gem-2f, ρ=90%\rho = 90\% and xy=0.8xy = 0.8. FLR also improves over DLR but is consistently outperformed by SSR in distributional similarity. SSR-generated distributions closely match human data, and product concept rankings by mean purchase intent are robust. Figure 4

Figure 4: Gem-2f results showing SSR and FLR distributions compared to real data, with SSR achieving superior alignment.

Figure 5

Figure 5: KS similarity for Gem-2f, demonstrating SSR's advantage in distributional fidelity.

Demographic and Product Feature Conditioning

SSR enables LLMs to replicate demographic and product feature effects observed in human data. For example, mean purchase intent exhibits a concave relationship with age, and income level conditioning produces lower purchase intent for budget-constrained personas. Product category and price tier effects are also mirrored. Figure 6

Figure 6: Mean purchase intent stratified by demographic and product features, showing SSR's ability to capture nuanced subgroup effects.

Figure 7

Figure 7: Gender and region stratification, indicating weaker alignment for these features but overall low influence on purchase intent.

Qualitative Feedback

SSR preserves the richness of free-text responses, enabling qualitative analysis. Synthetic consumers provide detailed rationales for their ratings, often surpassing the depth of human survey responses. This qualitative data can be mined for actionable insights in product development.

Generalization and Ablation

SSR generalizes to other Likert-based constructs (e.g., concept relevance) with high correlation attainment and distributional similarity. Removing demographic conditioning increases distributional similarity but reduces correlation attainment, indicating that persona prompts are essential for meaningful product differentiation.

Comparison to Supervised ML

A LightGBM classifier trained on demographic and product features achieves lower correlation attainment (ρ=65%\rho = 65\%) than SSR (ρ=88%\rho = 88\%), despite moderate distributional similarity. This demonstrates the superiority of zero-shot LLM elicitation for capturing human-like response behavior without training data.

Implementation Details

SSR Algorithm

The SSR mapping is implemented as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np

def ssr(response_text, reference_texts, embedding_model, temperature=1.0, epsilon=0.0):
    # Embed response and reference statements
    response_vec = embedding_model.embed(response_text)
    ref_vecs = [embedding_model.embed(ref) for ref in reference_texts]
    # Compute cosine similarities
    sims = np.array([np.dot(response_vec, ref_vec) / (np.linalg.norm(response_vec) * np.linalg.norm(ref_vec)) for ref_vec in ref_vecs])
    # Subtract minimum similarity for normalization
    sims -= sims.min()
    # Add epsilon to avoid zero probabilities
    sims += epsilon
    # Apply temperature scaling
    probs = sims ** (1.0 / temperature)
    # Normalize to obtain pmf
    pmf = probs / probs.sum()
    return pmf

Reference anchor sets should be carefully constructed to span the semantic range of Likert categories. Averaging over multiple anchor sets mitigates mapping variance.

Resource and Scaling Considerations

SSR requires access to a high-quality embedding model (e.g., OpenAI's text-embedding-3-small). The computational cost is dominated by embedding inference, which is tractable for survey-scale applications. SSR is model-agnostic and does not require fine-tuning, making it suitable for rapid deployment and scaling across domains with sufficient LLM training coverage.

Trade-offs and Limitations

  • Reference Set Design: Manual optimization of anchor statements is required; dynamic or LLM-generated anchors may further improve alignment.
  • Demographic Conditioning: SSR captures some but not all subgroup effects; caution is warranted in interpreting synthetic subgroup analyses.
  • Domain Coverage: SSR's validity depends on LLM exposure to relevant product categories in training data.
  • Embedding Model Choice: Alternative embedding spaces may yield improved results; benchmarking is recommended for new domains.

Implications and Future Directions

SSR establishes a robust framework for synthetic consumer research, enabling scalable, interpretable, and cost-effective simulation of human survey panels. The method preserves both quantitative and qualitative fidelity, facilitating richer product concept evaluation. Potential extensions include:

  • Generalization to other survey constructs (e.g., satisfaction, trust)
  • Automated optimization of SSR parameters (temperature, epsilon)
  • Multi-stage LLM pipelines for enhanced interpretability
  • Hybrid approaches combining SSR with light fine-tuning or calibration

SSR can accelerate early-stage product screening, reduce research costs, and democratize access to consumer insights. However, it should be viewed as an augmentation rather than a replacement for human panels, especially in domains with limited LLM training data.

Conclusion

Semantic similarity elicitation via SSR enables LLMs to reproduce human purchase intent distributions and product rankings with high fidelity, overcoming the limitations of direct Likert-scale elicitation. The approach is computationally efficient, interpretable, and preserves qualitative richness, making it a valuable tool for synthetic consumer research. While further optimization and validation are warranted, SSR represents a significant advance in the practical application of LLMs for survey simulation and market research.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper shows a new way to use LLMs to act like “synthetic consumers” in product surveys. Instead of asking the AI to pick a number on a 1–5 scale (like humans do), the authors ask it to explain its opinion in words, then convert that text into a 1–5 score using a smart matching trick. This method makes the AI’s answers look and behave much more like real human survey results.

What questions did the researchers ask?

They focused on a few simple, big questions:

  • Can AI “pretend” to be groups of consumers and rate products in a way that matches real people?
  • Why do AIs do badly when we force them to choose numbers (1–5) directly?
  • If we let AIs write short answers first, then translate those answers into 1–5 ratings, do we get more human-like results?
  • Can this approach keep the good parts of surveys (like rankings of best products) and also give helpful written comments?
  • Do AI “personas” (like age or income) change ratings in ways that mirror real human differences?

How did they do the paper?

First, here’s the real-world setup:

  • The team used data from 57 real product surveys about personal care items (like toothpaste or shampoo).
  • About 9,300 people took these surveys in total.
  • Everyone rated “purchase intent” (how likely they were to buy) on a 1–5 scale (1 = “definitely not,” 5 = “definitely yes”).

Then they created “synthetic consumers” using LLMs (two well-known models). Each synthetic consumer got a short persona (like age or income), saw the same product description (often with an image), and answered the same question: “How likely are you to purchase this?”

They tested three ways to get ratings from the AI:

  • Direct Likert Rating (DLR): Ask for a single number (1–5).
  • Follow-up Likert Rating (FLR): First the AI writes a short opinion, then another prompt asks it to convert that text to 1–5.
  • Semantic Similarity Rating (SSR): The AI writes a short opinion. That text is then converted to a 1–5 score using “semantic similarity.”

What is “semantic similarity”? Think of it as measuring how close two sentences are in meaning. The method works like this:

  • The researchers prepared five short “anchor” sentences, one for each rating (1–5). Example: “I will definitely buy this” (5), “I probably won’t buy this” (2), etc.
  • They turned both the AI’s text and the anchor sentences into “embeddings,” which you can imagine as meaning fingerprints: numbers that capture the meaning of a sentence.
  • They then measured how similar the AI’s text fingerprint was to each anchor fingerprint (using a standard measure called cosine similarity).
  • The closer it was to, say, the “definitely buy” anchor, the higher the odds of a 5. This gives a full probability distribution across 1–5, not just a single number.

How did they check if it worked?

  • Distribution similarity: Do the shapes of the AI’s rating distributions match humans’? (They used a statistic called KS similarity; higher means more similar.)
  • Product rankings: Do the AI’s average ratings put the products in about the same order as humans do? They used “correlation attainment,” which is like asking: “How close are we to the best possible match, given that even two human groups won’t agree perfectly?” Hitting 100% would mean as good as human–human agreement.

What did they find?

  • Direct numbers don’t work well. When forced to pick 1–5 directly, AIs usually choose the safe middle (3) too often. That makes the distribution unrealistic, even if the average product ranking is okay.
  • Letting the AI talk first helps a lot. Both FLR and SSR improved results, but SSR was best overall.
  • SSR produced human-like distributions and strong rankings. It matched 90% of the “best possible” human-level agreement (correlation attainment ≈ 90%) and had high distribution similarity (KS similarity > 0.85). In plain terms: the AI’s answers looked like real survey data and ranked products much like humans do.
  • Personas matter. When the AI was asked to “be” certain types of people (like different ages or income levels), its ratings shifted in ways similar to how real humans differ. Age and income patterns were especially well-captured.
  • But don’t remove personas entirely. If the AI wasn’t given any demographic persona, its ratings looked superficially similar to human distributions but did a worse job ranking which products were better—so the results were less useful.
  • Extra bonus: rich comments. Because SSR starts with a short written answer, companies also get clear reasons: what people liked, what worried them, and what could be improved.
  • It generalizes to other questions. The same method worked reasonably well for a different survey question (“How relevant was the concept?”).
  • It beat a trained machine learning baseline. SSR (and even FLR) outperformed a traditional model trained on the survey data for ranking products—despite the LLMs using no special training on this dataset.

Why does it matter?

This approach could make product testing faster, cheaper, and more detailed:

  • Companies can screen many early ideas with synthetic surveys, then spend money on human studies for the most promising ones.
  • The method keeps familiar metrics (1–5 ratings and averages) but adds richer written feedback.
  • It’s “plug-and-play”: no costly fine-tuning needed.

At the same time, there are important cautions:

  • The quality depends on good “anchor” sentences and the text embedding model used.
  • Not all demographic patterns are perfectly captured (some, like gender or region, were less consistent).
  • LLMs work best in domains they “know” from training (like consumer products). In niche areas, results may be weaker.
  • Synthetic consumers should complement, not fully replace, real people—especially for final decisions or sensitive subgroups.

In short: By asking AIs to explain themselves first and then translating those explanations into 1–5 scores using semantic similarity, this paper gets AI survey results that look and behave a lot more like the real thing—bringing speed, scale, and useful insights to early-stage consumer research.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, articulated so future researchers can act on each item:

  • External validity to market behavior: Assess whether SSR-based rankings predict real-world outcomes (e.g., in-market sales, simulated test markets) rather than only reproducing human survey panels.
  • Domain generalization: Replicate SSR across product categories (e.g., food, durables, services), brands, and decision contexts (e.g., risk, credence goods) to test robustness beyond personal care.
  • Cross-cultural and multilingual validity: Evaluate performance in non-U.S. populations and other languages; test whether anchors and embeddings transfer across cultures and linguistic contexts.
  • Anchor sensitivity and optimization: Quantify how results vary with different anchor statements (content, wording, number, polarity), and develop systematic anchor search/optimization (e.g., automated anchor generation, data-driven calibration on held-out surveys).
  • Risk of overfitting anchors to the studied corpus: The anchor sets were manually optimized on the 57 surveys; test out-of-corpus performance and establish protocols to prevent anchor overfitting.
  • Mapping function design: Compare cosine-similarity normalization to alternative mappings (e.g., softmax with temperature, ordinal regression, IRT-based mappings, kernel methods) and assess calibration, monotonicity, and uncertainty properties.
  • Embedding model dependence: Benchmark different embedding models (general vs domain-specific, multilingual, open-source vs closed) and similarity metrics (cosine vs alternatives), including sensitivity to embedding drift over time.
  • Model and version drift reproducibility: Track performance stability across LLM/embedding model updates, document versioning effects, and propose procedures for periodic re-calibration.
  • Persona construction and ablations: Systematically ablate and test which demographic or psychographic attributes drive alignment (age, income, gender, region, values); explore the utility of attitudinal/behavioral personas beyond demographics.
  • Subgroup fidelity and fairness: Rigorously evaluate subgroup validity (e.g., gender, region, ethnicity) where replication was weaker; test for bias propagation, stereotype amplification, and disparate error rates.
  • Causal responsiveness to controlled manipulations: Validate that SSR responds correctly (direction and magnitude) to experimental changes in concept attributes (e.g., price, claims, format) consistent with known human causal effects.
  • Panel composition and weighting: Study how to synthesize respondents without mapping 1:1 to human participants; evaluate stratified sampling and post-stratification weighting to match target populations.
  • Sample size and stability: Determine the number of LLM samples per persona needed for stable distributions and rankings; quantify variance reduction vs cost trade-offs.
  • Metric choice for ordinal distributions: Replace or complement KS similarity with ordinal-aware distances (e.g., Earth Mover’s/Wasserstein, Cramér–von Mises for ordinal data) and report sensitivity to metric choice.
  • Multi-item scales and latent constructs: Extend beyond single-item PI to multi-item scales (e.g., satisfaction, trust) and test internal consistency, factor structure, and convergent/discriminant validity.
  • Generalization to other question types: Systematically evaluate SSR on binary, multiple-choice, continuous, and open-ended coding tasks beyond PI and “relevance.”
  • Vision vs text-only stimuli: More deeply quantify the added value and failure modes of multimodal stimuli (vision models’ comprehension limits, artifacts) relative to text-only descriptions.
  • Data contamination and brand familiarity: Test whether performance depends on LLM pretraining exposure to specific brands/categories; construct contamination-controlled benchmarks with synthetic or obfuscated brands.
  • Stronger baselines: Compare SSR to advanced supervised baselines (e.g., ordinal regression with textual features, SBERT/LLM-encoder + calibrated classifiers, fine-tuned small LMs) under strict out-of-sample protocols.
  • Calibration of dynamic range: LLMs produced more dispersed mean PIs than humans for low-appeal products; develop post-hoc calibration methods (e.g., monotonic transformations) to match human scale use without degrading rankings.
  • Temporal stability of SSR panels: Assess test–retest reliability of synthetic respondents across days/weeks and under different random seeds/temperatures; quantify within-persona stability.
  • Qualitative feedback validation: Develop methods to score the informativeness, specificity, and actionability of synthetic rationales versus human feedback (e.g., human-coded benchmarks, content validity, redundancy measures).
  • Privacy and ethics of persona impersonation: Analyze risks of replicating sensitive subgroup traits and biases, and propose governance for responsible use in decision-making.
  • Open-data and reproducibility: Results rely on proprietary surveys and closed models; provide public benchmarks, share anchors/protocols, and replicate with open-source models to enable independent verification.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

Below are actionable uses that can be deployed now, leveraging the paper’s SSR pipeline (free-text elicitation + embedding-based mapping to Likert distributions), the demographic persona conditioning, and the qualitative rationales produced by LLMs.

  • Sector: Consumer Packaged Goods (CPG) and Market Research — Synthetic concept pre-screening
    • Use case: Screen early-stage product concepts with SSR to rank ideas and approximate Likert distributions before commissioning large human panels; reserve human studies for finalists.
    • Tools/workflow: Persona-conditioned prompts → short free-text PI rationales → embeddings → cosine similarity to anchor statements → per-respondent Likert pmf → survey-level distribution and concept ranking dashboard.
    • Assumptions/dependencies: Anchor statements must be curated; embedding model quality matters; best performance when personas include age/income; domain should be well represented in LLM training data.
  • Sector: Advertising and Creative Testing — Rapid copy and claim iteration
    • Use case: Test alternative headlines, claims, packaging copy, and imagery; select creatives that maximize SSR-derived PI or “relevance” (demonstrated generalization) by persona and price tier.
    • Tools/workflow: Batch runs across variations; heatmaps of Likert pmfs; rationale mining to surface objections and value drivers.
    • Assumptions/dependencies: Image/text stimuli fidelity; ensure anchors are tuned to construct (PI vs relevance vs trust).
  • Sector: Product Management / UX — Feature prioritization with Likert-compatible outputs
    • Use case: Map free-text user comments on features (from interviews or forums) to Likert-like “importance” or “satisfaction” using SSR anchors; prioritize roadmap items.
    • Tools/workflow: Qual → SSR quantization → weighted prioritization by target personas.
    • Assumptions/dependencies: Requires construct-specific anchors (e.g., “importance,” “ease of use”); may need light calibration.
  • Sector: E-commerce and Performance Marketing — Persona-targeted messaging optimization
    • Use case: Estimate PI/relevance by demographic persona (e.g., age, income) to tailor messaging and channel mix.
    • Tools/workflow: Persona grid testing with SSR; uplift charts comparing segments; rationale-based messaging guidelines.
    • Assumptions/dependencies: Demographic conditioning improves ranking fidelity; be cautious on attributes the paper found weaker (gender/region).
  • Sector: Academia and Survey Methods — Pilot paper replacement and instrument design
    • Use case: Use SSR to approximate Likert distributions for pilot surveys; test question wording, anchoring vignettes, and scale labels prior to fielding.
    • Tools/workflow: Multi-wording A/Bs → SSR outcomes; select forms with best distributional properties; power analysis using synthetic variance.
    • Assumptions/dependencies: Validity depends on domain alignment; finalize with small human validation.
  • Sector: Public Health Communication and Policy — Message pre-testing
    • Use case: Pre-test public-facing messages (e.g., on hygiene, OTC products) to gauge intent/relevance among personas before running field surveys.
    • Tools/workflow: SSR over message variants; rationale clustering to identify misunderstandings and barriers.
    • Assumptions/dependencies: LLMs mirror known topics better; do not substitute for population-representative polling.
  • Sector: Finance / Corporate Strategy — Early-stage demand sensing
    • Use case: Use SSR-based rankings as a leading indicator for concept appeal in due diligence or product portfolio reviews.
    • Tools/workflow: “Synthetic concept OS” dashboard aggregating PI distributions, rank order, and rationales by consumer segment.
    • Assumptions/dependencies: Treat as directional signal; pair with small human panels for critical go/no-go decisions.
  • Sector: Customer Insights Operations — Mixed-methods synthesis at scale
    • Use case: Combine SSR distributions with qualitatively rich rationales to produce “synthetic focus group” summaries for each concept.
    • Tools/workflow: Topic modeling/salience detection over rationales; issue/benefit heatmaps aligned to Likert pmfs; auto-generated insight decks.
    • Assumptions/dependencies: Maintain anchor libraries and versioning; ensure prompt hygiene.
  • Sector: Survey Platforms / Software — Plug-in SSR module
    • Use case: Add SSR as a service to Qualtrics/SurveyMonkey-type tools: upload stimuli, define personas, get Likert distributions and rationales.
    • Tools/workflow: API microservice (LLM + embeddings + anchor sets), UI for anchor selection, persona templates, drift monitoring.
    • Assumptions/dependencies: API access to robust LLM and embedding models; governance for data privacy and audit logs.
  • Sector: Small Businesses / Indie Creators — Quick concept gut-checks
    • Use case: Evaluate product ideas, price tiers, and packaging options without panel budgets.
    • Tools/workflow: Lightweight web app with prebuilt anchors, common persona presets, and ranked recommendations.
    • Assumptions/dependencies: Results are indicative, not substitutes for market tests; ensure domain is within LLM familiarity.

Long-Term Applications

These require further validation, scaling, domain adaptation, or methodological development beyond the paper’s current scope.

  • Sector: Cross-Domain Surveying — Generalization to other Likert constructs
    • Use case: Extend SSR to satisfaction, trust, safety, usability, perceived risk, fairness, etc., across healthcare, education, public services.
    • Enablers: Construct-specific anchor libraries; domain-tuned embeddings; multilingual/cross-cultural anchors.
    • Dependencies: Calibration against human benchmarks in each construct/domain.
  • Sector: Policy and Governance — Rapid policy prototyping and barometers
    • Use case: Maintain “synthetic panels” to pre-test public policies, health advisories, or climate programs; identify segments with low acceptance.
    • Enablers: Persona populations representing geographies and socioeconomics; drift detection and periodic human recalibration.
    • Dependencies: Ethical oversight; transparency of limitations; non-substitution for official polling.
  • Sector: Energy / Automotive / IoT Hardware — Adoption and feature acceptance modeling
    • Use case: Test willingness to adopt EV features, smart home devices, or dynamic tariffs; identify barriers via rationales.
    • Enablers: Multimodal stimuli (video, 3D renders); price-sensitivity scripts; domain-specific anchors.
    • Dependencies: Model exposure to domain knowledge; integration with discrete choice/pivot to conjoint for pricing realism.
  • Sector: Healthcare and Biopharma — Patient adherence and education materials
    • Use case: Simulate message acceptance for adherence programs or OTC innovations; refine language for diverse literacy levels.
    • Enablers: Health literacy-aware anchors; co-design with clinicians; multilingual personas.
    • Dependencies: Clinical validation; stringent bias monitoring; regulatory compliance.
  • Sector: Education and EdTech — Curriculum and product adoption intent
    • Use case: Predict educator/parent/student receptivity to new curricula or tools; tailor rollout communications by segment.
    • Enablers: Role-specific persona conditioning; anchors for “relevance,” “appropriateness,” “ease of integration.”
    • Dependencies: Cultural/region-specific anchors; validation with pilot districts.
  • Sector: Platforms and Tooling — Auto-optimized SSR and adaptive anchoring
    • Use case: Learn anchor statements and similarity thresholds that maximize alignment with human data; adapt anchors per domain.
    • Enablers: Bayesian calibration with small human holdouts; meta-learning over anchor sets; ensemble of embedding models.
    • Dependencies: Continuous evaluation pipelines; versioned anchors and embeddings; governance.
  • Sector: ML Ops and Research Methods — Human-in-the-loop calibration and fairness controls
    • Use case: Combine SSR with small validation panels to correct drift and enforce fairness (e.g., align subgroup behavior to real data).
    • Enablers: Hierarchical models to reweight subgroup outputs; bias audits across gender/region; uncertainty quantification.
    • Dependencies: Access to periodic ground-truth data; documented thresholds for acceptability.
  • Sector: Creative/Design Automation — Closed-loop generative optimization
    • Use case: LLM generates concept variants; SSR evaluates; evolutionary search iterates toward high-PI variants under constraints (cost, sustainability).
    • Enablers: Generative design + SSR evaluator + constraint solvers; multi-objective optimization.
    • Dependencies: Guardrails to prevent mode collapse; ensure design diversity and compliance.
  • Sector: Standards and Regulation — Validation protocols and acceptance criteria
    • Use case: Develop industry standards for synthetic survey validity (distributional similarity thresholds, correlation attainment targets) and audit trails.
    • Enablers: Cross-industry consortia; benchmark datasets; reproducibility kits.
    • Dependencies: Consensus on metrics; periodic re-validation as models change.
  • Sector: Privacy-Preserving Analytics — On-device or private-cloud SSR
    • Use case: Run SSR pipelines where proprietary concepts cannot leave secure environments.
    • Enablers: Private embeddings; local LLMs; confidential computing.
    • Dependencies: Performance parity with public models; cost of secure infrastructure.

Notes on Feasibility and Risk

  • Domain dependence: SSR is strongest where LLMs have rich prior exposure (e.g., personal care). Expect weaker fidelity in niche domains; calibrate with small human samples.
  • Anchor sensitivity: Results depend on anchor design; maintain libraries, run A/Bs, and consider averaging across sets as in the paper.
  • Demographic conditioning: Improves ranking fidelity (age/income worked best); treat subgroup outputs cautiously where alignment was weaker (gender/region).
  • Embeddings and similarity metrics: Cosine similarity with general-purpose embeddings worked; domain-specific encoders may improve performance but require validation.
  • Ethical and legal constraints: Do not replace human research for high-stakes decisions; disclose synthetic nature; comply with claims substantiation and data privacy.
  • Not a direct proxy for conversion: SSR yields Likert-like intent distributions and relative rankings, not realized purchasing behavior; triangulate with experiments/market tests.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Acquiescence: A survey response bias where participants tend to agree with statements regardless of content. "responses may be distorted by satisficing, acquiescence, and positivity biases"
  • Anchor statements: Predefined reference texts used as scale anchors to map semantic similarities to Likert points. "predefined anchor statements"
  • Anchoring vignettes: A survey methodology technique using standardized scenarios to adjust for differences in how respondents use rating scales. "(anchoring vignettes)"
  • Angular distance: The angle between embedding vectors, indicating how similar two texts are in an embedding space. "In an embedding space, the synthetic response will have a certain angular distance to any other statement."
  • Conjoint-style willingness-to-pay estimation: A discrete-choice approach to infer how much respondents are willing to pay by varying product attributes. "conjoint-style willingness-to-pay estimation"
  • Correlation attainment: A metric comparing the correlation between synthetic and real outcomes to the maximum correlation achievable given human test–retest limits. "Correlation attainment is then quantified as ρ=E[Rxy]/E[Rxx]\rho = \mathrm{E}[R^{xy}]/\mathrm{E}[R^{xx}]"
  • Cosine similarity: A measure of similarity between two vectors based on the cosine of the angle between them. "computing the cosine similarity of embeddings with those of predefined anchor statements"
  • Cumulative distribution function (CDF): A function giving the probability that a variable is less than or equal to a value; used in KS distance. "as the maximum distance between two CDFs"
  • Demographic conditioning: Prompting LLMs with socio-demographic attributes or backstories to influence their responses. "Another focus of some studies is demographic conditioning, where prompts embed socio-demographic backstories."
  • Distributional cosine similarity: Cosine similarity applied to probability distribution vectors, which does not account for ordinal scales. "distributional cosine similarity defined as"
  • Distributional similarity: The degree to which two response distributions match, here assessed via KS similarity. "distributional similarity was poor"
  • Embedding space: A vector space where texts are represented as numerical embeddings for similarity computations. "In an embedding space, the synthetic response will have a certain angular distance to any other statement."
  • Feeling thermometer: A survey measure where respondents rate their feelings on a temperature-like scale. "provide \"feeling thermometer\" scores"
  • Kolmogorov--Smirnov (KS) similarity: A statistic based on the KS distance used to compare two distributions; here defined as 1 minus the KS distance. "We measure per-survey similarity between synthetic and real purchase intent distributions via Kolmogorov--Smirnov (KS) similarity"
  • Kronecker delta function: An indicator function equal to 1 when indices match and 0 otherwise. "where δirc\delta_{ir_c} is the Kronecker delta function."
  • LightGBM: A gradient boosting decision tree framework optimized for efficiency and speed. "we trained 300 LightGBM classifiers"
  • Likert scale: An ordinal survey scale (often 1–5) used to measure attitudes or intentions. "Standard practice is to elicit purchase intent on a Likert scale"
  • Ordinality: The property of ordered categories where relative order matters but not exact distances. "because it respects the ordinality of the scale."
  • Pearson correlation: A measure of linear association between two variables. "We compute Pearson correlations between mean purchase intents of real and synthetic surveys"
  • Personas: Demographic or attitudinal profiles used to condition LLM responses to mimic specific subgroups. "demographic or attitudinal personas"
  • Probability mass function (pmf): A function mapping discrete outcomes to their probabilities. "yielding a response probability mass function (pmf)"
  • Prompt engineering: Designing and refining prompts to elicit desired behaviors from LLMs. "stays with zero-shot elicitation or prompt engineering."
  • Regression-to-the-mean: The tendency of extreme values to move toward the average upon repeated measurement. "such as skewed distributions, over-positivity, or regression-to-the-mean"
  • Semantic similarity mapping: An NLP method that aligns texts by comparing their embeddings for semantic similarity. "(semantic similarity mapping)"
  • Semantic similarity rating (SSR): A method that maps free-text LLM responses to Likert distributions using embedding-based similarity to anchors. "We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions"
  • Test--retest reliability: The consistency of measurements across repeated survey administrations. "SSR achieves 90\% of human test--retest reliability"
  • Zero-shot elicitation: Obtaining model outputs for a task without task-specific training or fine-tuning. "stays with zero-shot elicitation or prompt engineering."
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 18 posts and received 7174 likes.