Large-Scale, Longitudinal Study of Large Language Models During the 2024 US Election Season (2509.18446v1)
Abstract: The 2024 US presidential election is the first major contest to occur in the US since the popularization of LLMs. Building on lessons from earlier shifts in media (most notably social media's well studied role in targeted messaging and political polarization) this moment raises urgent questions about how LLMs may shape the information ecosystem and influence political discourse. While platforms have announced some election safeguards, how well they work in practice remains unclear. Against this backdrop, we conduct a large-scale, longitudinal study of 12 models, queried using a structured survey with over 12,000 questions on a near-daily cadence from July through November 2024. Our design systematically varies content and format, resulting in a rich dataset that enables analyses of the models' behavior over time (e.g., across model updates), sensitivity to steering, responsiveness to instructions, and election-related knowledge and "beliefs." In the latter half of our work, we perform four analyses of the dataset that (i) study the longitudinal variation of model behavior during election season, (ii) illustrate the sensitivity of election-related responses to demographic steering, (iii) interrogate the models' beliefs about candidates' attributes, and (iv) reveal the models' implicit predictions of the election outcome. To facilitate future evaluations of LLMs in electoral contexts, we detail our methodology, from question generation to the querying pipeline and third-party tooling. We also publicly release our dataset at https://huggingface.co/datasets/sarahcen/LLM-election-data-2024
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper studies how popular AI chatbots (called LLMs, or LLMs) behaved during the 2024 U.S. presidential election season. The researchers wanted to see how these systems answered election-related questions over time, whether they could be “steered” by how questions were asked, and whether they showed signs of bias or hidden beliefs about the election outcome. They also built and shared a big dataset of the chatbots’ answers so others can learn from it.
What questions did the researchers ask?
The team focused on four simple, big-picture questions:
- Do chatbots’ answers change over time, and can we spot changes that match model updates or election events?
- Can we “steer” answers by adding simple details like “I am a Democrat” or “I am Hispanic,” and do different groups get treated more similarly than others?
- How do chatbots seem to “view” the candidates (like who is more trustworthy or divisive), and do these views change?
- Even if chatbots refuse to predict the winner directly, do their other answers reveal hidden guesses about who will win—and are those guesses consistent?
How did they do the paper?
Think of this like a daily check-up on a dozen smart assistants:
- They created a structured survey of 573 base questions and 22 ways of asking them (prompt variations), ending up with over 12,000 final queries.
- They asked these questions almost every day from July to November 2024 to 12 different models from major providers (OpenAI, Anthropic, Google, Perplexity), including both “offline” models (no internet access) and “online” ones (with search tools).
- They grouped questions into:
- Endogenous (answers shouldn’t change much with time—like basic facts about past events or processes),
- Exogenous (answers might change with current events—like predictions or new controversies),
- Baseline (unrelated to elections, from standard test sets like math or general knowledge).
- They used prompt variations to test steerability and instruction-following. Examples:
- Adding a demographic or political line: “I am a Democrat,” “I am Hispanic.”
- Adding instructions: “Explain your reasoning,” “Please be concise,” or “Do not justify your answer.”
- To measure changes over time, they converted answers into numerical “fingerprints” (called embeddings) so they could compare how similar or different the answers were day by day. Think of this like turning sentences into points on a map and watching those points drift or jump.
- For hidden predictions, they asked the models to estimate percentages for hypothetical exit poll questions (e.g., “What percent of voters who picked Candidate X say the economy is their top issue?”). Then they used simple math (solving linear equations) to infer what the model might believe about overall voter behavior and who is more likely to win.
They released the dataset publicly: https://huggingface.co/datasets/sarahcen/LLM-election-data-2024
What did they find?
Here are the main takeaways, explained simply:
- Changes over time: The chatbots’ answers showed both slow drifts and sudden shifts. Some lined up with known model updates; others didn’t have a clear cause. Even small changes were persistent across many questions.
- Steerability by demographics: Models reacted to simple “steering” like “I am a Democrat” or “I am Hispanic.” All models showed some sensitivity. In the initial analysis, Gemini models were most sensitive, followed by Claude, then GPT. Models often reacted most in questions about the importance of the election, the candidates themselves, and major events.
- Views of candidates: When asked to link candidates with adjectives:
- Trump was often linked with “divisive,” “corrupt,” and “temperamental.”
- Harris was more often linked with “compassionate” and “honorable.”
- Surprisingly, all candidates were rated similarly on “weak,” which may suggest the models have guardrails around commenting on health/fitness.
- Refusal rates (saying “I can’t answer”): highest for GPT models, lowest for Gemini, with Claude in between.
- Patterns changed around Harris’s nomination, though the paper does not claim a direct cause-and-effect.
- Hidden election predictions: Even when models refused to pick a winner directly, their exit poll answers sometimes implied a winner. But these implications were inconsistent—depending on the specific exit poll question, the same model might implicitly favor Harris in one case and Trump in another. This shows that treating chatbot outputs as reliable “beliefs” or forecasts is risky.
Why does this matter?
- LLMs are now a common way people get information. If their answers change quietly over time, can be steered by simple prompts, or carry hidden biases, this could shape public opinion—especially during elections.
- Companies say they use “guardrails” (rules that limit certain answers) to prevent misuse. This paper shows guardrails exist but don’t solve everything; models still vary and can be influenced.
- The findings highlight how hard it is to make chatbots consistent, neutral, and trustworthy on sensitive topics like elections.
What could happen next?
- Better testing and transparency: Developers and researchers can use this dataset and methods to monitor chatbots during future elections and improve safeguards.
- Smarter guardrails: Companies may need more robust rules that prevent unintentional steering while keeping answers helpful and fair.
- Public literacy: People should know that chatbots can be persuasive, inconsistent, and biased. Treat them as tools, not truth-tellers—especially for predictions.
- Policy and oversight: As AI blends with news and search, there may be more calls for standards that protect election integrity and reduce misinformation.
Overall, this paper gives a careful, real-world snapshot of how AI chatbots behave in a highly sensitive moment. It doesn’t just point out flaws; it provides tools and data so others can deepen the analysis and build safer systems.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
Below is a concise, structured set of concrete gaps the paper leaves unresolved. Each point flags what is missing or uncertain and suggests an actionable direction for future research.
Scope and Model Coverage
- Limited model set and vintages: only models available before July 2024 (no newer “reasoning” models like OpenAI o-series, Claude 4, or later Gemini/Perplexity releases). Assess whether findings hold for newer checkpoints and architectures.
- No evaluation of chatbot front-ends: paper uses APIs, not consumer chat apps where most users interact (with additional UI guardrails, memory, personalization). Measure differences introduced by chat UI policies, conversation history, and session-level safety layers.
- Sparse inclusion of open-source models: aside from Perplexity’s Llama-based model, proprietary models dominate. Compare behavior of community models (e.g., Llama, Mistral, Qwen) in both offline and retrieval-augmented settings.
- English-only, US-only focus: no multilingual or non-US election contexts. Test robustness across languages (e.g., Spanish for US voters) and other democracies’ electoral contexts and norms.
Query Design and Apparatus Constraints
- Fixed, hand-crafted survey prohibits adaptive red-teaming: cannot probe model-specific weaknesses or react to evolving events and narratives. Add adaptive probes that branch on model responses and contemporaneous news cycles.
- Short, single-turn prompts and strict length caps (≤100 words; 128-token cutoff): may truncate reasoning, suppress citations, and bias refusal/compliance rates. Re-run with variable length budgets and multi-turn contexts to test sensitivity.
- Zero temperature (and 0.1 for agents): reduces variance and may mask realistic conversational behavior. Evaluate temperature effects on refusal, hallucination, and steerability.
- Simulated “online” agents via LangChain + Serper differ from native provider browsing tools: unknown how much agent design, tool prompting, and retrieval source affect outputs. Compare against models’ first-party browsing modes and multiple search engines.
- Search baseline design choices: only Google via ValueSerp, limited geolocations (five states), no time-of-day or A/B accounting. Expand to Bing, DuckDuckGo, diverse locales, and log SERP variations across conditions.
Measurement and Analysis Limitations
- Unattributed behavioral shifts: longitudinal drifts and jumps cannot be causally tied to specific model/policy updates or external events. Combine with provider change logs, canary prompts, and synthetic interventions to isolate causes.
- Limited validation of “belief inference”: the linear-equation approach using exit-poll answers is not validated for identifiability, stability, or sensitivity to prompt noise. Stress-test for ill-posedness, error propagation, and cross-question consistency; compare against post-election ground truth.
- No systematic hallucination or citation accuracy assessment: despite taxonomy goals, the paper does not quantify misinformation rates or source correctness (especially for “online” agents). Add human/labeled audits and automated claim checking.
- Embedding-based drift analysis not fully specified or validated: unclear which embeddings, thresholds, and semantic metrics define “meaningful” change. Benchmark against human judgments and task-level outcome differences.
- Minimal baseline set (32 items, weekly): underpowered to detect general capability drift. Increase baseline size and diversity (reasoning, knowledge, safety, calibration) with scheduled and randomized probes.
- Steerability measured via explicit self-descriptors (“I am Hispanic/Democrat”): may not reflect real-world, implicit or multi-turn personalization. Evaluate persona priming, conversation history, and intersectional attributes; test indirect and subtle signals.
- Lack of normative benchmarks for demographic parity: sensitivity is shown, but no target criteria for equal treatment, nor annotation of harmful stereotyping or disparate impacts. Define fairness metrics and acceptable variance bands.
- No inter-rater human evaluation: absence of human labels for bias, tone, misleading content, or helpfulness. Add expert and crowd annotations to calibrate automated measures.
External Validity and Generalization
- Real user behavior not modeled: no multi-turn sessions, memory effects, or UI affordances (autocomplete, suggested prompts). Simulate realistic chat sessions and measure persistence, contagion, and safety spillover across turns.
- No comparison to public opinion and media baselines: for candidate “adjective associations,” predictive questions, and issue framing, map model outputs to polling, media frames, and post-election results to quantify alignment or divergence.
- No downstream impact assessment: the paper measures model behavior, not effects on users (persuasion, polarization, or knowledge gain). Pair with user experiments or field studies to estimate causal influence.
Data, Transparency, and Reproducibility
- “Flagship” endpoints auto-update mid-paper: reduces reproducibility and confounds attributions. Archive exact model snapshots where possible; seek provider-signed hashes or model versioning for auditability.
- Limited release of agent traces: unclear if tool-use chains (queries, clicked results) are logged. Release agent intermediate steps, search queries, and retrieved documents to enable source-level audits.
- Missing error/missingness analysis: online agents had higher error rates; retries may bias samples. Report per-model/day missingness, retry rates, and any selection bias in final datasets.
- System prompt effects unquantified: custom system prompt (“answer in ≤100 words”) may alter safety posture and content. Ablate system prompts to estimate their causal effect on refusals, tone, and content.
Safety, Policy, and Misuse Evaluation Gaps
- Limited coverage of prohibited political-use cases: no systematic tests of targeted persuasion, voter suppression, tailored misinformation, or policy circumvention. Construct policy-aligned adversarial suites and measure compliance and leakage.
- No analysis of source diversity and media bias in retrieval: “online” systems may over-index certain outlets. Quantify outlet distribution, ideological skew, and citation redundancy across responses.
- No assessment of content provenance and watermarking: cannot distinguish AI-generated references from human sources. Evaluate provenance signals and detect synthetic-to-synthetic feedback loops.
- Lack of multimodal threat surface coverage: voice, image, and video modalities (e.g., deepfaked candidate audio) are out of scope. Extend to multimodal prompts and cross-modal risk evaluations.
Open Methodological Questions
- How to design longitudinal audits that separate endogenous updates from exogenous news shocks with causal clarity?
- What is a robust, validated framework for inferring “model beliefs” from indirect answers, and how to measure internal self-consistency across related queries?
- Which steering signals (explicit vs implicit, one-shot vs multi-turn) most strongly influence election-related outputs under modern safety policies?
- How should neutrality and fairness be operationalized for election guidance across demographics, and what trade-offs exist with helpfulness and personalization?
- What standards and infrastructure (version pinning, audit logs, provider disclosures) are needed to make election-season LLM audits reproducible and policy-relevant at scale?
Practical Applications
Overview
The paper provides a blueprint, dataset, and analyses for large-scale, longitudinal auditing of LLMs during sensitive public events (here, the 2024 US election). Its methodology—structured question taxonomy, prompt variations for steerability and instruction-following, online/offline model comparisons, and time-series response analysis—yields immediately deployable workflows for industry, academia, media, and policy. The dataset and apparatus can be generalized to other domains (public health, finance, education) where LLM behavior impacts real-world decision-making.
Immediate Applications
Below is a set of actionable applications that can be deployed now, leveraging the paper’s dataset, methods, and tooling patterns.
- Industry (software platforms, AI providers): Continuous LLM election-safety audits
- What: Stand up a monitoring pipeline using the paper’s taxonomy and prompt-variation harness to track drift, guardrail efficacy, and refusal patterns across models and updates.
- Sector: Software/platform governance, compliance.
- Tools/products/workflows: “LLM Election Monitor” dashboard; “Behavior Drift Alerts” (embedding-based similarity over time); “Guardrail Efficacy Tracker” (refusal-rate analytics).
- Assumptions/dependencies: API access; reproducible settings (low temperature, token limits); capacity to store and analyze responses; embedding choice affects drift signals.
- Industry (consumer AI products): Prompt-safety instrumentation for politically sensitive queries
- What: Implement “Election Query Safety Mode” that detects politically sensitive topics and constrains outputs; suppress demographic steering (“I am [Democrat]/[Hispanic]”) effects noted by the paper.
- Sector: Consumer AI, search/chat products.
- Tools/products/workflows: Prompt preprocessors; safety policies; refusal and neutral reasoning templates; logging for post-hoc audits.
- Assumptions/dependencies: Clear taxonomy of sensitive queries; UX to communicate refusals and provide trusted sources; policy and localization.
- Media and fact-checking organizations: LLM behavior tracking and misinformation watch
- What: Use the released dataset and querying pipeline to compare model outputs against trusted sources on election topics; flag shifts in responses and likely misinformation.
- Sector: Newsrooms, fact-checkers, civil society.
- Tools/products/workflows: “LLM Misinformation Watch” alerts; automated cross-check against curated knowledge bases; human-in-the-loop review queues.
- Assumptions/dependencies: Access to editorially controlled ground-truth repositories; process for rapid corrections; legal review for fair-use of model outputs.
- Academia (computational social science, NLP): Bias and drift measurement at scale
- What: Extend the longitudinal analyses (RQ1–RQ4) to quantify model drift, demographic steering sensitivity, and implicit beliefs; replicate across geographies and model families.
- Sector: Research.
- Tools/products/workflows: Open research benchmarks using the HF dataset; embedding-based change detection; steerability metrics; refusal-rate diagnostics per question category.
- Assumptions/dependencies: Stable API or model snapshots; clear statistical controls; ethical review for political research.
- Policy and regulators (FEC, FCC, NIST, state AGs): Audit-ready reporting requirements
- What: Define transparency expectations for election-related model behavior (e.g., periodic audits of refusal rates, guardrail changes, drift); encourage reporting of unannounced updates.
- Sector: Public policy, regulation.
- Tools/products/workflows: Standard “Event-Sensitive LLM Audit” template; reporting interfaces; early-warning channels for behavior shifts during elections.
- Assumptions/dependencies: Jurisdictional authority; standardized metrics; cooperation from providers; privacy-preserving reporting frameworks.
- Enterprise compliance (finance, insurance, telecom): Demographic steerability scans for customer-facing bots
- What: Test whether demographic self-identification triggers differential advice or tone; implement neutral-mode guardrails similar to election guardrails.
- Sector: Finance, insurance, telecom customer service.
- Tools/products/workflows: “Steerability Scanner” (prompt variations for demographics); compliance reports; remediation playbooks.
- Assumptions/dependencies: Legal interpretations of discrimination; precise scope of protected classes; alignment with internal ethics policies.
- Public-interest tech and civil society: Voter literacy tools
- What: Build lightweight browser extensions or chatbot wrappers that warn users when queries are politically sensitive and route them to curated sources.
- Sector: Daily life, civic tech.
- Tools/products/workflows: “Political Query Shield” extension; curated links to public broadcasters and election bureaus; simple guidance prompts.
- Assumptions/dependencies: Source curation; consent UX; non-intrusive defaults.
- Product management and vendor selection: Evidence-based LLM procurement
- What: Use observed refusal rates, sensitivity to steering, and drift profiles to select LLM vendors for election-adjacent use cases.
- Sector: Software procurement, IT.
- Tools/products/workflows: Comparative scorecards; risk acceptance matrices; SLAs requiring event-aware behaviors.
- Assumptions/dependencies: Up-to-date vendor data; internal risk appetite; budget for model redundancy.
- Health and crisis communications: Rapid repurposing of the apparatus to public health topics
- What: Apply the same longitudinal taxonomy to track LLM behavior during health emergencies (vaccines, outbreak response).
- Sector: Healthcare, public health communication.
- Tools/products/workflows: “Health Crisis AI Monitor”; medical misinformation watchlists; escalation to domain experts.
- Assumptions/dependencies: Clinical source-of-truth repositories; coordination with health authorities; domain-specific guardrails.
Long-Term Applications
The following applications require further research, standardization, scaling, or policy development before broad deployment.
- AI governance standards and certification (e.g., “Election-Safe LLM”)
- What: Develop a certification regime with standardized tests for neutrality, refusal behavior, steerability, and drift during elections and other high-stakes events.
- Sector: Policy, standards bodies (NIST/ISO), industry consortia.
- Tools/products/workflows: Conformance test suites; continuous audit feeds; public registries of certified models.
- Assumptions/dependencies: Multi-stakeholder consensus; interoperable metrics; international coordination.
- Adaptive, event-aware guardrail systems
- What: Guardrails that adjust safely to exogenous events (debates, scandals) while maintaining neutrality and helpfulness; dynamic safety policies driven by monitoring signals.
- Sector: AI safety engineering.
- Tools/products/workflows: Closed-loop monitoring → policy update pipelines; human governance boards; rollback mechanisms.
- Assumptions/dependencies: Robust change detection; governance processes; avoiding overfitting to media trends.
- Belief consistency metrics and training-time constraints
- What: Use multi-question inference (like the exit-poll linear system) to measure and improve internal consistency; incorporate consistency regularizers in training or post-training.
- Sector: AI research and model training.
- Tools/products/workflows: Consistency validators; synthetic counterfactual testbeds; training loss terms for belief coherence.
- Assumptions/dependencies: Access to training or alignment pipelines; acceptance of new objectives vs. utility trade-offs.
- Public opinion estimation frameworks with calibrated LLMs
- What: Combine LLM outputs with real polling and demographic models to produce calibrated, transparent estimates—explicitly not raw forecasts.
- Sector: Social science, polling analytics, media.
- Tools/products/workflows: Hybrid models; uncertainty quantification pipelines; public methodology disclosures.
- Assumptions/dependencies: High-quality ground truth; careful bias corrections; clear disclaimers on limits of LLMs.
- Cross-national and cross-lingual election monitoring networks
- What: Scale the methodology to multiple countries and languages; create distributed observatories of LLM behavior during global elections.
- Sector: International NGOs, academia, policy.
- Tools/products/workflows: Federated data collection; localized taxonomies; translation and cultural review.
- Assumptions/dependencies: Language coverage; local legal constraints; funding and governance.
- Sectoral expansion to other high-stakes domains (finance, education, energy)
- What: Finance: monitor investment advice bots during market stress; Education: ensure neutrality in civics curricula; Energy: crisis communications neutrality during outages.
- Sector: Finance, education, energy/utilities.
- Tools/products/workflows: Domain-specific taxonomies; event calendars; regulator-facing dashboards.
- Assumptions/dependencies: Domain ground truth; regulatory harmonization; privacy and safety requirements.
- Open-source “LLM Behavior Observatory”
- What: A community-run platform offering reusable taxonomies, prompt-variation harnesses, agents with search augmentation, and longitudinal datasets.
- Sector: Open-source ecosystem, research.
- Tools/products/workflows: Data pipelines (SGLang/LangChain), search adapters (Serper/ValueSerp), embeddings for drift analysis, standard report formats.
- Assumptions/dependencies: Sustained maintainers; data governance; compute and API credit sustainability.
- Regulatory data access and accountability mechanisms
- What: Policies granting regulators audited access to model behavior data and update logs during elections; clear reporting on unannounced changes.
- Sector: Governance, regulation.
- Tools/products/workflows: Secure data portals; legal frameworks; independent verification teams.
- Assumptions/dependencies: Provider cooperation; privacy protections; due process safeguards.
- Curriculum development for AI literacy in civics and media studies
- What: Integrate findings on LLM persuasion, refusal, and bias into education, teaching students critical engagement with AI.
- Sector: Education.
- Tools/products/workflows: Classroom modules; simulation exercises using the released dataset; teacher training materials.
- Assumptions/dependencies: Institutional buy-in; age-appropriate content; localization.
Cross-Cutting Assumptions and Dependencies
- Access and stability: Feasibility depends on consistent API access, clear model versioning/snapshots, and manageable rate limits.
- Costs and infrastructure: Longitudinal monitoring at scale requires budgets for tokens, search APIs, storage, and embeddings; smaller orgs may need shared infrastructure.
- Methodological choices: Embedding selection, token truncation, temperature settings, and prompt templates influence drift detection and comparability.
- Ground truth and evaluation: Many applications require curated sources or gold standards; without them, audits risk overfitting to model idiosyncrasies.
- Legal and ethical constraints: Political content, demographic testing, and data collection must align with privacy, anti-discrimination, and jurisdiction-specific regulations.
- Generalizability: While election-focused, methods must be adapted thoughtfully to domain-specific contexts (healthcare, finance) and non-US settings.
Glossary
- Counterfactuals: Hypothetical comparisons used to isolate effects by holding conditions constant. "Compared to red-teaming efforts, our choice to prompt the models on a fixed survey of questions allows us to study clean counterfactuals"
- Demographic steering: Influencing model responses by supplying demographic cues in the prompt. "illustrate the sensitivity of election-related responses to demographic steering"
- Electoral college vote: The allocation of electors’ votes determining the U.S. presidential winner. "election outcomes, including the popular vote, electoral college vote, and exit polls"
- Endogenous questions: Prompts designed so responses should be stable over time and reflect model-internal changes. "We refer to ``endogenous'' questions as those that should not reflect the time-dependent state of the {2024} US election"
- Exogenous events: External developments that may influence model outputs over time. "observe the effects of exogenous events (e.g., developments in the election)"
- Exogenous questions: Prompts designed to reflect real-world temporal developments. "Analogously,
exogenous'' questions are constructed such that their responses may reflect events or
shocks'' that are external to the models and the process we use to query them" - Exit polls: Post-vote surveys used to analyze voter behavior and attitudes. "These questions ask models to predict the results of exit polls by voter group"
- Flagship model checkpoint: The provider’s current default or main deployed model version. "we direct our queries to the ``flagship'' model checkpoint"
- Guardrails: Safety controls intended to restrict or shape model behavior on sensitive topics. "many models refuse to do so, presumably due to election-related guardrails"
- Guardrailing: The process of adding and tuning safety constraints around model outputs. "post-training, prompting, and guardrailing often cause chatbots to behave differently"
- Hallucination: Model generation of incorrect or non-existent information with apparent confidence. "hallucinating non-existent information"
- Instruction following: A model’s ability to obey explicit directions embedded in the prompt. "the instruction-following capabilities of the models (i.e., how well they follow directions)"
- LangChain: A framework for enabling LLMs to use external tools and orchestrate multi-step workflows. "via LangChain (a library enabling LLMs to call external tools)"
- Longitudinal study: An analysis that tracks subjects or systems repeatedly over an extended period. "we conduct a large-scale, longitudinal study of 12 models"
- Online models: Models configured to access and use web search during inference. "We refer to models as ``online'' if they have access to internet search"
- Post-training: Additional training or tuning after base model pretraining to shape behavior. "post-training, prompting, and guardrailing often cause chatbots to behave differently"
- Red-teaming: Adversarial probing to expose model failures or unsafe behaviors. "Compared to red-teaming efforts, our choice to prompt the models on a fixed survey of questions allows us to study clean counterfactuals"
- Response embeddings: Vector representations of outputs used to quantify and compare model behaviors. "We perform an analysis of response embeddings to quantify how LLM behaviors evolve over time across various question categories."
- Reward hacking: When models exploit objectives or evaluation metrics in unintended ways. "manipulate their behavior in a phenomenon known as reward hacking"
- Search-augmented LLM agents: LLM systems that integrate web search into multi-step reasoning pipelines. "we create search-augmented LLM agents using LangChain's ChatOpenAI, ChatAnthropic, and ChatVertexAI."
- SerperAPI: A service/API for programmatic Google Search queries. "From SerperAPI (which we used to equip models with search capabilities via LangChain)"
- SGLang: A library for efficient, batched LLM inference via APIs. "we use the SGLang library \cite{zheng2023sglang} to process the questions in batches."
- Steerability: The degree to which prompts can influence or direct model responses. "Some prompt variations test the ``steerability'' of the models based on demographic characteristics or political preference"
- Temperature (sampling): A randomness parameter in generation affecting diversity vs. determinism. "set the temperature of the model to 0.1"
- Tool use: The capability of LLM agents to invoke external tools (e.g., search) during inference. "The online models face errors more often than offline ones due to tool use"
- ValueSerp: An API used to retrieve Google Search results programmatically. "Finally, we use ValueSerp to query Google via API"
- Zero-temperature sampling: Deterministic generation by setting temperature to zero. "using zero-temperature sampling for reproducibility"
Collections
Sign up for free to add this paper to one or more collections.