Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 147 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 190 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Large-Scale, Longitudinal Study of Large Language Models During the 2024 US Election Season (2509.18446v1)

Published 22 Sep 2025 in cs.CY and cs.LG

Abstract: The 2024 US presidential election is the first major contest to occur in the US since the popularization of LLMs. Building on lessons from earlier shifts in media (most notably social media's well studied role in targeted messaging and political polarization) this moment raises urgent questions about how LLMs may shape the information ecosystem and influence political discourse. While platforms have announced some election safeguards, how well they work in practice remains unclear. Against this backdrop, we conduct a large-scale, longitudinal study of 12 models, queried using a structured survey with over 12,000 questions on a near-daily cadence from July through November 2024. Our design systematically varies content and format, resulting in a rich dataset that enables analyses of the models' behavior over time (e.g., across model updates), sensitivity to steering, responsiveness to instructions, and election-related knowledge and "beliefs." In the latter half of our work, we perform four analyses of the dataset that (i) study the longitudinal variation of model behavior during election season, (ii) illustrate the sensitivity of election-related responses to demographic steering, (iii) interrogate the models' beliefs about candidates' attributes, and (iv) reveal the models' implicit predictions of the election outcome. To facilitate future evaluations of LLMs in electoral contexts, we detail our methodology, from question generation to the querying pipeline and third-party tooling. We also publicly release our dataset at https://huggingface.co/datasets/sarahcen/LLM-election-data-2024

Summary

The paper presents a comprehensive longitudinal study of 12 LLMs using over 12,000 election-related queries to identify temporal shifts and model steerability.
It employs a multidimensional survey design with varied prompts to assess demographic influences and measure implicit model beliefs.
Findings indicate significant response variability and sensitivity to real-world events, highlighting the need for transparent model updates in political contexts.

Large-Scale, Longitudinal Study of LLMs During the 2024 US Election Season

Introduction and Motivation

This paper presents a comprehensive, longitudinal evaluation of twelve LLMs during the 2024 US presidential election season. The work is motivated by the unprecedented scale and speed of LLM adoption, which has outpaced previous media shifts such as social media, and the resulting concerns about their influence on political discourse, information dissemination, and democratic processes. The authors systematically probe LLMs with over 12,000 election-related queries, varying content, format, and user context, to analyze temporal dynamics, steerability, instruction following, and implicit model beliefs.

Methodology: Survey Design and Querying Pipeline

The core methodological contribution is a structured, multi-dimensional survey instrument. The survey comprises 573 handcrafted base questions, each subjected to 21 prompt variations (demographic steering, instruction-following, etc.), yielding over 12,000 unique queries. Questions are categorized as:

Endogenous: Should be invariant to real-world events (e.g., election process, candidate backgrounds).
Exogenous: Expected to reflect temporal developments (e.g., candidate controversies, predictions, exit polls).
Baseline: Non-election, benchmark questions for control.

Prompt variations include explicit demographic statements (e.g., "I am a Democrat"), requests for reasoning, and format constraints. This design enables controlled measurement of both model-internal changes (e.g., updates, guardrails) and responsiveness to user context.

Twelve models are evaluated, including major API-accessible LLMs from OpenAI (GPT-4, GPT-4o, GPT-3.5-turbo, GPT-4-turbo), Anthropic (Claude 3 Haiku, Claude 3.5 Sonnet, Claude 3 Opus), Google (Gemini 1.0 Pro), Perplexity (Llama-3.1 Sonar), and search-augmented variants using LangChain+Serper. Both offline (no internet access) and online (search-augmented) settings are included. The querying pipeline enforces zero-temperature sampling and response length constraints for reproducibility and cost control.

Longitudinal Analysis of Model Behavior

The first major analysis quantifies temporal variation in LLM responses using sentence embeddings (MiniLM-L6-v2) and cosine distance metrics. This approach projects high-dimensional, open-ended responses into a tractable similarity space, enabling detection of both gradual drifts and abrupt "step" changes in model behavior.

Figure 1: Cosine similarities between embeddings (as described in Section 4.1) across time, for four question categories and six offline models. Blue and orange curves denote endogenous and exogenous questions, respectively.

Key findings include:

Step Changes: Distinct, persistent shifts in response distributions are observed, often aligning with known model updates (e.g., GPT-4o checkpoint changes), but also occurring without public explanation. These are more pronounced in exogenous questions.
Irregular Variability: Even with temperature=0 and offline models, small but persistent fluctuations are present, suggesting sources of nondeterminism or untracked backend changes.
Endogenous vs. Exogenous: Exogenous questions exhibit greater and more frequent variation, consistent with sensitivity to real-world events and model updates. Online models show higher volatility, but this is sometimes confounded by response truncation due to token limits.

Sensitivity to Prompt Variation and Demographic Steering

The second analysis examines the steerability of LLMs via prompt variations, focusing on demographic and political identity cues. Pairwise cosine distances between response embeddings for different prompt variations are computed and aggregated by model and question category.

Figure 2: Average pairwise cosine distances (means ± two time-averaged SEMs) between prompt variation embeddings for six offline models and eight categories of questions.

Figure 3: Pairwise cosine distance between response embeddings under different prompt variations for six offline models, averaged across time and question categories.

Findings:

All models exhibit nontrivial sensitivity to demographic steering, with Gemini 1.0 Pro being the most sensitive and GPT-4 the least.
Prompt variation effects are largest for candidate and analysis questions, and minimal for factual election process queries.
Heatmap analysis reveals that political affiliation cues ("I am a Democrat/Republican") induce the largest response shifts, with some models grouping certain demographics together, indicating implicit clustering or guardrail effects.
Instruction-following and format prompts (e.g., "explain your reasoning") also induce measurable but smaller changes.

Candidate Comparison and Model "Perceptions"

The third analysis focuses on candidate-adjective association questions (e.g., "Which candidate is most trustworthy?"). Responses are post-processed via GPT-4o mini to extract alignment scores for each candidate, "Other," and "Unsure."

Figure 4: Stacked bar charts showing scores for candidates across adjectives normalized to sum to 100, averaged across all models, prompt variations, and time steps.

Figure 5: For candidate comparison questions, the relative score assigned to "Other" or "Unsure" for each adjective and 12 models, averaged across time, models, and prompt variations.

Figure 6: Difference in average scores assigned to candidates by adjective before and after Harris' nomination. Positive values indicate increased value after Harris nomination, negative values indicate decreased value.

Key observations:

Trump is most associated with "divisive," "corrupt," and "temperamental"; Harris with "compassionate" and "honorable"; Biden with "collaborative" and "diplomatic."
All candidates receive similar scores for "weak" and "fragile," despite public discourse around Biden's health, suggesting explicit guardrails.
Refusal rates (responses mapped to "Other" or "Unsure") are high, especially for negative or sensitive adjectives, and are highest for GPT models, lowest for Perplexity and online models.
After Harris' nomination, Biden's scores decrease across most adjectives, with the "transferred" association varying by trait—sometimes to Harris, sometimes to Trump, indicating nontrivial model-internal mappings.

Implicit Model Beliefs via Exit Poll Prediction

The fourth analysis leverages exit poll prediction questions, where models are asked to estimate response distributions for all voters and for subgroups (Trump, Harris, Biden voters). By solving a linear system, the authors infer the implicit "weights" each model assigns to each candidate as representative of the overall electorate.

Figure 7: GPT-4o's predictions in response to "In your vote for president, how would you rate the importance of crime?" separated by voter group and averaged across time and prompt variations.

Figure 8: GPT-4o's predictions in response to "Do you expect life for the next generation of Americans to be better than life today, worse than life today, or about the same?" separated by voter group and averaged across time and prompt variations.

Figure 9: Solution weights for GPT-4o, according to methodology given in Section 4.4. On the y-axis are exit poll questions, and on the x-axis are the weights assigned to candidates Trump, Harris, and Biden.

Figure 10: Solution weights using exit poll questions for GPT-4o, considering only Trump and Harris.

Findings:

Models often refuse to make explicit predictions about the election outcome, but their exit poll responses reveal implicit beliefs.
Inferred weights are inconsistent across questions, with the same model sometimes "predicting" a Trump win and sometimes a Harris win, depending on the issue context.
No model is fully self-consistent: for some questions, the "all voters" distribution cannot be represented as a convex combination of subgroup responses, indicating incoherence or context sensitivity.
Issue framing strongly affects which candidate is seen as representative of the electorate, highlighting the malleability of LLM "beliefs" under prompt engineering.

Implications, Limitations, and Future Directions

This work provides a robust empirical foundation for understanding LLM behavior in high-stakes, temporally dynamic contexts. The findings have several implications:

Model updates and guardrails can induce abrupt, persistent changes in behavior, often without public disclosure, complicating downstream auditing and reliability.
LLMs remain highly steerable via prompt engineering, even in domains with explicit safety constraints, raising concerns about neutrality and susceptibility to manipulation.
Implicit model beliefs are context-dependent and not self-consistent, limiting the utility of LLMs for reliable forecasting or as proxies for public opinion without careful calibration.
High refusal rates for sensitive queries suggest that safety interventions are active, but their effectiveness and consistency vary by provider and model version.

Limitations include the exclusion of chatbot interfaces (due to rate limits and personalization), the fixed survey design (limiting adaptivity to real-world events), and the focus on English-language, US-centric queries. The paper also predates the release of newer "reasoning" models (e.g., OpenAI o-series).

Future work should extend this methodology to other elections, media ecosystems, and user interfaces, incorporate adaptive and user-in-the-loop auditing, and explore the downstream effects of LLM outputs on user beliefs and behaviors. The released dataset provides a valuable resource for such investigations.

Conclusion

This longitudinal, multi-model audit demonstrates that LLMs exhibit both endogenous and exogenous variability in election-related contexts, are sensitive to demographic and instruction-based steering, and encode context-dependent, sometimes inconsistent beliefs about political candidates and outcomes. The paper's methodology and dataset set a new standard for systematic, temporal evaluation of LLMs in real-world, high-impact domains.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper studies how popular AI chatbots (called LLMs, or LLMs) behaved during the 2024 U.S. presidential election season. The researchers wanted to see how these systems answered election-related questions over time, whether they could be “steered” by how questions were asked, and whether they showed signs of bias or hidden beliefs about the election outcome. They also built and shared a big dataset of the chatbots’ answers so others can learn from it.

What questions did the researchers ask?

The team focused on four simple, big-picture questions:

Do chatbots’ answers change over time, and can we spot changes that match model updates or election events?
Can we “steer” answers by adding simple details like “I am a Democrat” or “I am Hispanic,” and do different groups get treated more similarly than others?
How do chatbots seem to “view” the candidates (like who is more trustworthy or divisive), and do these views change?
Even if chatbots refuse to predict the winner directly, do their other answers reveal hidden guesses about who will win—and are those guesses consistent?

How did they do the paper?

Think of this like a daily check-up on a dozen smart assistants:

They created a structured survey of 573 base questions and 22 ways of asking them (prompt variations), ending up with over 12,000 final queries.
They asked these questions almost every day from July to November 2024 to 12 different models from major providers (OpenAI, Anthropic, Google, Perplexity), including both “offline” models (no internet access) and “online” ones (with search tools).
They grouped questions into:
- Endogenous (answers shouldn’t change much with time—like basic facts about past events or processes),
- Exogenous (answers might change with current events—like predictions or new controversies),
- Baseline (unrelated to elections, from standard test sets like math or general knowledge).
They used prompt variations to test steerability and instruction-following. Examples:
- Adding a demographic or political line: “I am a Democrat,” “I am Hispanic.”
- Adding instructions: “Explain your reasoning,” “Please be concise,” or “Do not justify your answer.”
To measure changes over time, they converted answers into numerical “fingerprints” (called embeddings) so they could compare how similar or different the answers were day by day. Think of this like turning sentences into points on a map and watching those points drift or jump.
For hidden predictions, they asked the models to estimate percentages for hypothetical exit poll questions (e.g., “What percent of voters who picked Candidate X say the economy is their top issue?”). Then they used simple math (solving linear equations) to infer what the model might believe about overall voter behavior and who is more likely to win.

They released the dataset publicly: https://huggingface.co/datasets/sarahcen/LLM-election-data-2024

What did they find?

Here are the main takeaways, explained simply:

Changes over time: The chatbots’ answers showed both slow drifts and sudden shifts. Some lined up with known model updates; others didn’t have a clear cause. Even small changes were persistent across many questions.
Steerability by demographics: Models reacted to simple “steering” like “I am a Democrat” or “I am Hispanic.” All models showed some sensitivity. In the initial analysis, Gemini models were most sensitive, followed by Claude, then GPT. Models often reacted most in questions about the importance of the election, the candidates themselves, and major events.
Views of candidates: When asked to link candidates with adjectives:
- Trump was often linked with “divisive,” “corrupt,” and “temperamental.”
- Harris was more often linked with “compassionate” and “honorable.”
- Surprisingly, all candidates were rated similarly on “weak,” which may suggest the models have guardrails around commenting on health/fitness.
- Refusal rates (saying “I can’t answer”): highest for GPT models, lowest for Gemini, with Claude in between.
- Patterns changed around Harris’s nomination, though the paper does not claim a direct cause-and-effect.
Hidden election predictions: Even when models refused to pick a winner directly, their exit poll answers sometimes implied a winner. But these implications were inconsistent—depending on the specific exit poll question, the same model might implicitly favor Harris in one case and Trump in another. This shows that treating chatbot outputs as reliable “beliefs” or forecasts is risky.

Why does this matter?

LLMs are now a common way people get information. If their answers change quietly over time, can be steered by simple prompts, or carry hidden biases, this could shape public opinion—especially during elections.
Companies say they use “guardrails” (rules that limit certain answers) to prevent misuse. This paper shows guardrails exist but don’t solve everything; models still vary and can be influenced.
The findings highlight how hard it is to make chatbots consistent, neutral, and trustworthy on sensitive topics like elections.

What could happen next?

Better testing and transparency: Developers and researchers can use this dataset and methods to monitor chatbots during future elections and improve safeguards.
Smarter guardrails: Companies may need more robust rules that prevent unintentional steering while keeping answers helpful and fair.
Public literacy: People should know that chatbots can be persuasive, inconsistent, and biased. Treat them as tools, not truth-tellers—especially for predictions.
Policy and oversight: As AI blends with news and search, there may be more calls for standards that protect election integrity and reduce misinformation.

Overall, this paper gives a careful, real-world snapshot of how AI chatbots behave in a highly sensitive moment. It doesn’t just point out flaws; it provides tools and data so others can deepen the analysis and build safer systems.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Below is a concise, structured set of concrete gaps the paper leaves unresolved. Each point flags what is missing or uncertain and suggests an actionable direction for future research.

Scope and Model Coverage

Limited model set and vintages: only models available before July 2024 (no newer “reasoning” models like OpenAI o-series, Claude 4, or later Gemini/Perplexity releases). Assess whether findings hold for newer checkpoints and architectures.
No evaluation of chatbot front-ends: paper uses APIs, not consumer chat apps where most users interact (with additional UI guardrails, memory, personalization). Measure differences introduced by chat UI policies, conversation history, and session-level safety layers.
Sparse inclusion of open-source models: aside from Perplexity’s Llama-based model, proprietary models dominate. Compare behavior of community models (e.g., Llama, Mistral, Qwen) in both offline and retrieval-augmented settings.
English-only, US-only focus: no multilingual or non-US election contexts. Test robustness across languages (e.g., Spanish for US voters) and other democracies’ electoral contexts and norms.

Query Design and Apparatus Constraints

Fixed, hand-crafted survey prohibits adaptive red-teaming: cannot probe model-specific weaknesses or react to evolving events and narratives. Add adaptive probes that branch on model responses and contemporaneous news cycles.
Short, single-turn prompts and strict length caps (≤100 words; 128-token cutoff): may truncate reasoning, suppress citations, and bias refusal/compliance rates. Re-run with variable length budgets and multi-turn contexts to test sensitivity.
Zero temperature (and 0.1 for agents): reduces variance and may mask realistic conversational behavior. Evaluate temperature effects on refusal, hallucination, and steerability.
Simulated “online” agents via LangChain + Serper differ from native provider browsing tools: unknown how much agent design, tool prompting, and retrieval source affect outputs. Compare against models’ first-party browsing modes and multiple search engines.
Search baseline design choices: only Google via ValueSerp, limited geolocations (five states), no time-of-day or A/B accounting. Expand to Bing, DuckDuckGo, diverse locales, and log SERP variations across conditions.

Measurement and Analysis Limitations

Unattributed behavioral shifts: longitudinal drifts and jumps cannot be causally tied to specific model/policy updates or external events. Combine with provider change logs, canary prompts, and synthetic interventions to isolate causes.
Limited validation of “belief inference”: the linear-equation approach using exit-poll answers is not validated for identifiability, stability, or sensitivity to prompt noise. Stress-test for ill-posedness, error propagation, and cross-question consistency; compare against post-election ground truth.
No systematic hallucination or citation accuracy assessment: despite taxonomy goals, the paper does not quantify misinformation rates or source correctness (especially for “online” agents). Add human/labeled audits and automated claim checking.
Embedding-based drift analysis not fully specified or validated: unclear which embeddings, thresholds, and semantic metrics define “meaningful” change. Benchmark against human judgments and task-level outcome differences.
Minimal baseline set (32 items, weekly): underpowered to detect general capability drift. Increase baseline size and diversity (reasoning, knowledge, safety, calibration) with scheduled and randomized probes.
Steerability measured via explicit self-descriptors (“I am Hispanic/Democrat”): may not reflect real-world, implicit or multi-turn personalization. Evaluate persona priming, conversation history, and intersectional attributes; test indirect and subtle signals.
Lack of normative benchmarks for demographic parity: sensitivity is shown, but no target criteria for equal treatment, nor annotation of harmful stereotyping or disparate impacts. Define fairness metrics and acceptable variance bands.
No inter-rater human evaluation: absence of human labels for bias, tone, misleading content, or helpfulness. Add expert and crowd annotations to calibrate automated measures.

External Validity and Generalization

Real user behavior not modeled: no multi-turn sessions, memory effects, or UI affordances (autocomplete, suggested prompts). Simulate realistic chat sessions and measure persistence, contagion, and safety spillover across turns.
No comparison to public opinion and media baselines: for candidate “adjective associations,” predictive questions, and issue framing, map model outputs to polling, media frames, and post-election results to quantify alignment or divergence.
No downstream impact assessment: the paper measures model behavior, not effects on users (persuasion, polarization, or knowledge gain). Pair with user experiments or field studies to estimate causal influence.

Data, Transparency, and Reproducibility

“Flagship” endpoints auto-update mid-paper: reduces reproducibility and confounds attributions. Archive exact model snapshots where possible; seek provider-signed hashes or model versioning for auditability.
Limited release of agent traces: unclear if tool-use chains (queries, clicked results) are logged. Release agent intermediate steps, search queries, and retrieved documents to enable source-level audits.
Missing error/missingness analysis: online agents had higher error rates; retries may bias samples. Report per-model/day missingness, retry rates, and any selection bias in final datasets.
System prompt effects unquantified: custom system prompt (“answer in ≤100 words”) may alter safety posture and content. Ablate system prompts to estimate their causal effect on refusals, tone, and content.

Safety, Policy, and Misuse Evaluation Gaps

Limited coverage of prohibited political-use cases: no systematic tests of targeted persuasion, voter suppression, tailored misinformation, or policy circumvention. Construct policy-aligned adversarial suites and measure compliance and leakage.
No analysis of source diversity and media bias in retrieval: “online” systems may over-index certain outlets. Quantify outlet distribution, ideological skew, and citation redundancy across responses.
No assessment of content provenance and watermarking: cannot distinguish AI-generated references from human sources. Evaluate provenance signals and detect synthetic-to-synthetic feedback loops.
Lack of multimodal threat surface coverage: voice, image, and video modalities (e.g., deepfaked candidate audio) are out of scope. Extend to multimodal prompts and cross-modal risk evaluations.

Open Methodological Questions

How to design longitudinal audits that separate endogenous updates from exogenous news shocks with causal clarity?
What is a robust, validated framework for inferring “model beliefs” from indirect answers, and how to measure internal self-consistency across related queries?
Which steering signals (explicit vs implicit, one-shot vs multi-turn) most strongly influence election-related outputs under modern safety policies?
How should neutrality and fairness be operationalized for election guidance across demographics, and what trade-offs exist with helpfulness and personalization?
What standards and infrastructure (version pinning, audit logs, provider disclosures) are needed to make election-season LLM audits reproducible and policy-relevant at scale?

View Paper Prompt View All Prompts

Practical Applications

Overview

The paper provides a blueprint, dataset, and analyses for large-scale, longitudinal auditing of LLMs during sensitive public events (here, the 2024 US election). Its methodology—structured question taxonomy, prompt variations for steerability and instruction-following, online/offline model comparisons, and time-series response analysis—yields immediately deployable workflows for industry, academia, media, and policy. The dataset and apparatus can be generalized to other domains (public health, finance, education) where LLM behavior impacts real-world decision-making.

Immediate Applications

Below is a set of actionable applications that can be deployed now, leveraging the paper’s dataset, methods, and tooling patterns.

Industry (software platforms, AI providers): Continuous LLM election-safety audits
- What: Stand up a monitoring pipeline using the paper’s taxonomy and prompt-variation harness to track drift, guardrail efficacy, and refusal patterns across models and updates.
- Sector: Software/platform governance, compliance.
- Tools/products/workflows: “LLM Election Monitor” dashboard; “Behavior Drift Alerts” (embedding-based similarity over time); “Guardrail Efficacy Tracker” (refusal-rate analytics).
- Assumptions/dependencies: API access; reproducible settings (low temperature, token limits); capacity to store and analyze responses; embedding choice affects drift signals.
Industry (consumer AI products): Prompt-safety instrumentation for politically sensitive queries
- What: Implement “Election Query Safety Mode” that detects politically sensitive topics and constrains outputs; suppress demographic steering (“I am [Democrat]/[Hispanic]”) effects noted by the paper.
- Sector: Consumer AI, search/chat products.
- Tools/products/workflows: Prompt preprocessors; safety policies; refusal and neutral reasoning templates; logging for post-hoc audits.
- Assumptions/dependencies: Clear taxonomy of sensitive queries; UX to communicate refusals and provide trusted sources; policy and localization.
Media and fact-checking organizations: LLM behavior tracking and misinformation watch
- What: Use the released dataset and querying pipeline to compare model outputs against trusted sources on election topics; flag shifts in responses and likely misinformation.
- Sector: Newsrooms, fact-checkers, civil society.
- Tools/products/workflows: “LLM Misinformation Watch” alerts; automated cross-check against curated knowledge bases; human-in-the-loop review queues.
- Assumptions/dependencies: Access to editorially controlled ground-truth repositories; process for rapid corrections; legal review for fair-use of model outputs.
Academia (computational social science, NLP): Bias and drift measurement at scale
- What: Extend the longitudinal analyses (RQ1–RQ4) to quantify model drift, demographic steering sensitivity, and implicit beliefs; replicate across geographies and model families.
- Sector: Research.
- Tools/products/workflows: Open research benchmarks using the HF dataset; embedding-based change detection; steerability metrics; refusal-rate diagnostics per question category.
- Assumptions/dependencies: Stable API or model snapshots; clear statistical controls; ethical review for political research.
Policy and regulators (FEC, FCC, NIST, state AGs): Audit-ready reporting requirements
- What: Define transparency expectations for election-related model behavior (e.g., periodic audits of refusal rates, guardrail changes, drift); encourage reporting of unannounced updates.
- Sector: Public policy, regulation.
- Tools/products/workflows: Standard “Event-Sensitive LLM Audit” template; reporting interfaces; early-warning channels for behavior shifts during elections.
- Assumptions/dependencies: Jurisdictional authority; standardized metrics; cooperation from providers; privacy-preserving reporting frameworks.
Enterprise compliance (finance, insurance, telecom): Demographic steerability scans for customer-facing bots
- What: Test whether demographic self-identification triggers differential advice or tone; implement neutral-mode guardrails similar to election guardrails.
- Sector: Finance, insurance, telecom customer service.
- Tools/products/workflows: “Steerability Scanner” (prompt variations for demographics); compliance reports; remediation playbooks.
- Assumptions/dependencies: Legal interpretations of discrimination; precise scope of protected classes; alignment with internal ethics policies.
Public-interest tech and civil society: Voter literacy tools
- What: Build lightweight browser extensions or chatbot wrappers that warn users when queries are politically sensitive and route them to curated sources.
- Sector: Daily life, civic tech.
- Tools/products/workflows: “Political Query Shield” extension; curated links to public broadcasters and election bureaus; simple guidance prompts.
- Assumptions/dependencies: Source curation; consent UX; non-intrusive defaults.
Product management and vendor selection: Evidence-based LLM procurement
- What: Use observed refusal rates, sensitivity to steering, and drift profiles to select LLM vendors for election-adjacent use cases.
- Sector: Software procurement, IT.
- Tools/products/workflows: Comparative scorecards; risk acceptance matrices; SLAs requiring event-aware behaviors.
- Assumptions/dependencies: Up-to-date vendor data; internal risk appetite; budget for model redundancy.
Health and crisis communications: Rapid repurposing of the apparatus to public health topics
- What: Apply the same longitudinal taxonomy to track LLM behavior during health emergencies (vaccines, outbreak response).
- Sector: Healthcare, public health communication.
- Tools/products/workflows: “Health Crisis AI Monitor”; medical misinformation watchlists; escalation to domain experts.
- Assumptions/dependencies: Clinical source-of-truth repositories; coordination with health authorities; domain-specific guardrails.

Long-Term Applications

The following applications require further research, standardization, scaling, or policy development before broad deployment.

AI governance standards and certification (e.g., “Election-Safe LLM”)
- What: Develop a certification regime with standardized tests for neutrality, refusal behavior, steerability, and drift during elections and other high-stakes events.
- Sector: Policy, standards bodies (NIST/ISO), industry consortia.
- Tools/products/workflows: Conformance test suites; continuous audit feeds; public registries of certified models.
- Assumptions/dependencies: Multi-stakeholder consensus; interoperable metrics; international coordination.
Adaptive, event-aware guardrail systems
- What: Guardrails that adjust safely to exogenous events (debates, scandals) while maintaining neutrality and helpfulness; dynamic safety policies driven by monitoring signals.
- Sector: AI safety engineering.
- Tools/products/workflows: Closed-loop monitoring → policy update pipelines; human governance boards; rollback mechanisms.
- Assumptions/dependencies: Robust change detection; governance processes; avoiding overfitting to media trends.
Belief consistency metrics and training-time constraints
- What: Use multi-question inference (like the exit-poll linear system) to measure and improve internal consistency; incorporate consistency regularizers in training or post-training.
- Sector: AI research and model training.
- Tools/products/workflows: Consistency validators; synthetic counterfactual testbeds; training loss terms for belief coherence.
- Assumptions/dependencies: Access to training or alignment pipelines; acceptance of new objectives vs. utility trade-offs.
Public opinion estimation frameworks with calibrated LLMs
- What: Combine LLM outputs with real polling and demographic models to produce calibrated, transparent estimates—explicitly not raw forecasts.
- Sector: Social science, polling analytics, media.
- Tools/products/workflows: Hybrid models; uncertainty quantification pipelines; public methodology disclosures.
- Assumptions/dependencies: High-quality ground truth; careful bias corrections; clear disclaimers on limits of LLMs.
Cross-national and cross-lingual election monitoring networks
- What: Scale the methodology to multiple countries and languages; create distributed observatories of LLM behavior during global elections.
- Sector: International NGOs, academia, policy.
- Tools/products/workflows: Federated data collection; localized taxonomies; translation and cultural review.
- Assumptions/dependencies: Language coverage; local legal constraints; funding and governance.
Sectoral expansion to other high-stakes domains (finance, education, energy)
- What: Finance: monitor investment advice bots during market stress; Education: ensure neutrality in civics curricula; Energy: crisis communications neutrality during outages.
- Sector: Finance, education, energy/utilities.
- Tools/products/workflows: Domain-specific taxonomies; event calendars; regulator-facing dashboards.
- Assumptions/dependencies: Domain ground truth; regulatory harmonization; privacy and safety requirements.
Open-source “LLM Behavior Observatory”
- What: A community-run platform offering reusable taxonomies, prompt-variation harnesses, agents with search augmentation, and longitudinal datasets.
- Sector: Open-source ecosystem, research.
- Tools/products/workflows: Data pipelines (SGLang/LangChain), search adapters (Serper/ValueSerp), embeddings for drift analysis, standard report formats.
- Assumptions/dependencies: Sustained maintainers; data governance; compute and API credit sustainability.
Regulatory data access and accountability mechanisms
- What: Policies granting regulators audited access to model behavior data and update logs during elections; clear reporting on unannounced changes.
- Sector: Governance, regulation.
- Tools/products/workflows: Secure data portals; legal frameworks; independent verification teams.
- Assumptions/dependencies: Provider cooperation; privacy protections; due process safeguards.
Curriculum development for AI literacy in civics and media studies
- What: Integrate findings on LLM persuasion, refusal, and bias into education, teaching students critical engagement with AI.
- Sector: Education.
- Tools/products/workflows: Classroom modules; simulation exercises using the released dataset; teacher training materials.
- Assumptions/dependencies: Institutional buy-in; age-appropriate content; localization.

Cross-Cutting Assumptions and Dependencies

Access and stability: Feasibility depends on consistent API access, clear model versioning/snapshots, and manageable rate limits.
Costs and infrastructure: Longitudinal monitoring at scale requires budgets for tokens, search APIs, storage, and embeddings; smaller orgs may need shared infrastructure.
Methodological choices: Embedding selection, token truncation, temperature settings, and prompt templates influence drift detection and comparability.
Ground truth and evaluation: Many applications require curated sources or gold standards; without them, audits risk overfitting to model idiosyncrasies.
Legal and ethical constraints: Political content, demographic testing, and data collection must align with privacy, anti-discrimination, and jurisdiction-specific regulations.
Generalizability: While election-focused, methods must be adapted thoughtfully to domain-specific contexts (healthcare, finance) and non-US settings.

View Paper Prompt View All Prompts

Glossary

Counterfactuals: Hypothetical comparisons used to isolate effects by holding conditions constant. "Compared to red-teaming efforts, our choice to prompt the models on a fixed survey of questions allows us to study clean counterfactuals"
Demographic steering: Influencing model responses by supplying demographic cues in the prompt. "illustrate the sensitivity of election-related responses to demographic steering"
Electoral college vote: The allocation of electors’ votes determining the U.S. presidential winner. "election outcomes, including the popular vote, electoral college vote, and exit polls"
Endogenous questions: Prompts designed so responses should be stable over time and reflect model-internal changes. "We refer to ``endogenous'' questions as those that should not reflect the time-dependent state of the {2024} US election"
Exogenous events: External developments that may influence model outputs over time. "observe the effects of exogenous events (e.g., developments in the election)"
Exogenous questions: Prompts designed to reflect real-world temporal developments. "Analogously, exogenous'' questions are constructed such that their responses may reflect events orshocks'' that are external to the models and the process we use to query them"
Exit polls: Post-vote surveys used to analyze voter behavior and attitudes. "These questions ask models to predict the results of exit polls by voter group"
Flagship model checkpoint: The provider’s current default or main deployed model version. "we direct our queries to the ``flagship'' model checkpoint"
Guardrails: Safety controls intended to restrict or shape model behavior on sensitive topics. "many models refuse to do so, presumably due to election-related guardrails"
Guardrailing: The process of adding and tuning safety constraints around model outputs. "post-training, prompting, and guardrailing often cause chatbots to behave differently"
Hallucination: Model generation of incorrect or non-existent information with apparent confidence. "hallucinating non-existent information"
Instruction following: A model’s ability to obey explicit directions embedded in the prompt. "the instruction-following capabilities of the models (i.e., how well they follow directions)"
LangChain: A framework for enabling LLMs to use external tools and orchestrate multi-step workflows. "via LangChain (a library enabling LLMs to call external tools)"
Longitudinal study: An analysis that tracks subjects or systems repeatedly over an extended period. "we conduct a large-scale, longitudinal study of 12 models"
Online models: Models configured to access and use web search during inference. "We refer to models as ``online'' if they have access to internet search"
Post-training: Additional training or tuning after base model pretraining to shape behavior. "post-training, prompting, and guardrailing often cause chatbots to behave differently"
Red-teaming: Adversarial probing to expose model failures or unsafe behaviors. "Compared to red-teaming efforts, our choice to prompt the models on a fixed survey of questions allows us to study clean counterfactuals"
Response embeddings: Vector representations of outputs used to quantify and compare model behaviors. "We perform an analysis of response embeddings to quantify how LLM behaviors evolve over time across various question categories."
Reward hacking: When models exploit objectives or evaluation metrics in unintended ways. "manipulate their behavior in a phenomenon known as reward hacking"
Search-augmented LLM agents: LLM systems that integrate web search into multi-step reasoning pipelines. "we create search-augmented LLM agents using LangChain's ChatOpenAI, ChatAnthropic, and ChatVertexAI."
SerperAPI: A service/API for programmatic Google Search queries. "From SerperAPI (which we used to equip models with search capabilities via LangChain)"
SGLang: A library for efficient, batched LLM inference via APIs. "we use the SGLang library \cite{zheng2023sglang} to process the questions in batches."
Steerability: The degree to which prompts can influence or direct model responses. "Some prompt variations test the ``steerability'' of the models based on demographic characteristics or political preference"
Temperature (sampling): A randomness parameter in generation affecting diversity vs. determinism. "set the temperature of the model to 0.1"
Tool use: The capability of LLM agents to invoke external tools (e.g., search) during inference. "The online models face errors more often than offline ones due to tool use"
ValueSerp: An API used to retrieve Google Search results programmatically. "Finally, we use ValueSerp to query Google via API"
Zero-temperature sampling: Deterministic generation by setting temperature to zero. "using zero-temperature sampling for reproducibility"

View Paper Prompt View All Prompts

Continue Learning

Authors (7)

Collections

Tweets

This paper has been mentioned in 11 tweets and received 140 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

YouTube

Show All Videos

alphaXiv

Large-Scale, Longitudinal Study of Large Language Models During the 2024 US Election Season (7 likes, 0 questions)

Large-Scale, Longitudinal Study of Large Language Models During the 2024 US Election Season (2509.18446v1)

Summary

Large-Scale, Longitudinal Study of LLMs During the 2024 US Election Season

Introduction and Motivation

Methodology: Survey Design and Querying Pipeline

Longitudinal Analysis of Model Behavior

Sensitivity to Prompt Variation and Demographic Steering

Candidate Comparison and Model "Perceptions"

Implicit Model Beliefs via Exit Poll Prediction

Implications, Limitations, and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do the paper?

What did they find?

Why does this matter?

What could happen next?

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Scope and Model Coverage

Query Design and Apparatus Constraints

Measurement and Analysis Limitations

External Validity and Generalization

Data, Transparency, and Reproducibility

Safety, Policy, and Misuse Evaluation Gaps

Open Methodological Questions

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-Cutting Assumptions and Dependencies

Glossary

Continue Learning

Related Papers

Authors (7)

Collections

Tweets

YouTube

alphaXiv