Search Arena: LLM Benchmark

Updated 21 November 2025

Search Arena is a large-scale, open-source benchmark for evaluating search-augmented LLMs using rich, multi-turn conversational interactions.
It integrates extensive multilingual data, human preference votes, and detailed system traces to assess credibility and retrieval-augmented generation.
The dataset supports research on grounding, intent-aware response generation, and trust in LLM-mediated answer citation with real-world usage information.

Search Arena is a large-scale, open-source benchmark designed for systematic evaluation and analysis of search-augmented LLMs. The dataset uniquely combines extensive, multilingual multi-turn human–model interactions, explicit human preference annotations, detailed system traces, and rich metadata, addressing key gaps in previous LLM evaluation resources. Its explicit focus is on the groundedness, credibility, and usability of LLM outputs in information-seeking settings where web search is integrated into the model's inference loop. Through its scale, diversity, and annotation protocols, Search Arena enables principled studies of retrieval-augmented generation and human trust in LLM-mediated answer citation (Miroyan et al., 5 Jun 2025).

1. Composition, Scale, and Diversity

Search Arena comprises 24,069 user–model conversation sessions collected over a seven-week period, engaging 11,650 unique anonymized users from 136 countries. Of these, 22.4% are multi-turn conversations, with a total of approximately 60,000 user–system turns, reflecting realistic, dialogic interactions beyond static single-turn fact-checking.

The dataset encompasses 12,652 paired human preference votes ("battles") across 13 search-augmented LLM variants. Topic coverage is broad, with nine annotated user intent categories (Factual Lookup, Information Synthesis, Analysis, Recommendation, Explanation, Creative Generation, Guidance, Text Processing, Other) and clusters such as Technology Comparisons (22%), Market Analysis (12%), and Entertainment (10%) emerging from topic modeling (BERTopic + GPT-4). Multilinguality is a core attribute, with 71 prompt languages identified (English 58.3%, Russian 11.8%, Chinese 7.0%, and 11% as code-mixed/multilingual) (Miroyan et al., 5 Jun 2025).

2. Data Structure, Annotation Fields, and Example

Each JSON record in Search Arena is anchored by the conversation-level context, preference annotation, intent taxonomy, and multi-turn transcript with granular trace logs. The essential fields are as follows:

conversation_id (string)
user_id (anonymized string)
models: {A: model_name, B: model_name}
vote: {"A", "B", "Tie", "Both_bad"}
timestamp_start / timestamp_end (ISO 8601)
intent: {primary: label, secondary: label or "Unassigned"}
turns: array of {turn_index, speaker, message, search_results, citations, trace_log}
metadata: {prompt_language, prompt_length, topic_cluster, model_configuration}

Each model response includes the raw message, an array of search results ({url, title, snippet}), extracted citations (URL list), and, where available, a reasoning process log.

Example snippet:

{
  "conversation_id": "1234-abcd",
  "models": { "A": "sonar-pro-high", "B": "gemini-2.5-pro-grounding" },
  "vote": "A",
  "intent": { "primary": "Guidance", "secondary": "Unassigned" },
  "turns": [
    { "turn_index": 1, "speaker": "user", "message": "How do I install Docker on Ubuntu?" },
    { "turn_index": 2, "speaker": "A",
      "message": "First update... [1][2]...",
      "search_results": [ { "url":"https://docs.docker.com/engine/install/ubuntu/", "title":"Install Docker on Ubuntu", "snippet":"Step-by-step..." } ],
      "citations": ["https://docs.docker.com/..."],
      "trace_log": "Retrieved 5 pages → filtered → synthesized"
    },
    { "turn_index": 2, "speaker": "B", "...": "..." }
  ]
}

All preference modeling is formalized using the Bradley–Terry model to infer system rankings:

$P(\text{Model } i \succ j) = \frac{\exp(\theta_i)}{\exp(\theta_i) + \exp(\theta_j)}$

3. Annotation Methodology and Human Preference Protocol

Human annotation in Search Arena leverages crowd-sourcing via a web interface (Chatbot Arena UI), presenting users with two anonymized model outputs per session. Users vote "A", "B", "Tie", or "Both bad" at any conversation point, with side randomization to suppress positional bias. Annotation of user intent was initially seeded by co-author agreement (Cohen’s κ ≈ 0.65 single, 0.79 top-2 labels for 100 prompts), then expanded to all 24,000 prompts via GPT-4.1 with validation on multilingual samples (Cohen’s κ = 0.812 top-2). About 50% of conversations received at least one vote, providing 12,652 human preference judgments. All user data undergoes strict de-identification (Google DLP compliance), and no pre-release data was shared with model providers (Miroyan et al., 5 Jun 2025).

4. Citation Analysis, Attribution, and Trustworthiness

Search Arena systematically records model-generated citations—average 5–7 per response—categorizing cited domains into nine groups (news-US, foreign news, wiki, tech/code, community blogs, social media, gov/edu, academic, retail).

For approximately 100 conversations per intent category, cited URLs were automatically scraped. An LLM-driven pipeline decomposed each model output into claim–citation pairs, each labeled as Support, Irrelevant, or Contradict. Metrics from Bradley–Terry analyses quantify the effect of citation presence and domain:

Metric	Effect Coefficient $\beta$	Statistical Significance
Num. citations	+0.209	$p < 0.01$
Tech/platform	+0.073	95% CI (bootstrapped)
Community blogs	+0.061	95% CI (bootstrapped)
Social media	+0.057	95% CI (bootstrapped)
Wikipedia	–0.071	95% CI (bootstrapped)
Supporting attribution	+0.29	95% CI (bootstrapped)
Irrelevant attribution	+0.27	95% CI (bootstrapped)

Presence of any citation substantially boosts perceived credibility—regardless of actual support for the claim—highlighting a key trustworthiness concern intrinsic to retrieval-augmented LLMs (Miroyan et al., 5 Jun 2025). This effect varies by domain, with a notable preference for community-centric sources and a relative penalty for Wikipedia.

5. Experimental Protocols and Key Evaluation Findings

To evaluate model robustness across environments, a "cross-arena" experiment compared search-augmented and standard LLMs in both search- and chat-oriented settings ("Search Arena" and "Text Arena"). In both arenas, Gemini-2.5 Pro Experimental was deployed with or without web search.

Text Arena (544 battles):

45% ties, 26% search-preferred, 28% non-search (overall $p=0.244$ )
Factual Lookup: search advantage ( $p=0.012$ )
Information Synthesis: search trended positive ( $p=0.095$ )
Text Processing: non-search favored ( $p=0.077$ )

Search Arena (315 battles):

31% ties, 40% search-preferred, 29% non-search ( $p=0.009$ )
Factual Lookup: search strong advantage ( $p=5.8 \times 10^{-5}$ )
Information Synthesis: search advantage ( $p=0.092$ )

Results indicate that search-augmented LLMs do not suffer in standard chat settings and benefit substantially for information-seeking tasks where recent or authoritative content is crucial. By contrast, LLMs relying only on parametric memory underperform in search-intensive settings.

6. Access, Data Availability, and Licensing

Search Arena is fully open-sourced. The dataset (CC-BY) and code are available via:

GitHub: https://github.com/lmarena/search-arena
Hugging Face Datasets: https://huggingface.co/datasets/lmarena-ai/search-arena-24k

Researchers can load the dataset as follows:

1 2	from datasets import load_dataset ds = load_dataset("lmarena-ai/search-arena-24k")

Python utilities for parsing, leaderboard analysis (Elo, Bradley–Terry), and citation trace extraction are bundled. All records are anonymized and contain the complete schema described above (Miroyan et al., 5 Jun 2025).

7. Applications and Limitations

Search Arena enables benchmarking of search-augmented, multilingual LLMs in context-rich, multi-turn, and user-grounded settings. It supports studies of human preference drivers (citation count, domain, answer style), modeling of trust in LLM output, and development of more robust attribution protocols. Its comprehensive traces (search results, reasoning logs, user intents) facilitate research in LLM grounding, intent-aware response generation, and fine-tuning for improved information quality.

Limitations include possible self-selection bias in crowd-sourced voting, incomplete coverage of all domain-specific intents, and reliance on human click/vote proxies rather than direct factual correctness audits. Nevertheless, as the first large-scale, multi-intent, multi-language arena for search-augmented LLMs, it provides an authoritative foundation for further retrieval-augmented NLU research (Miroyan et al., 5 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Search Arena: Analyzing Search-Augmented LLMs (2025)

Follow Topic

Get notified by email when new papers are published related to Search Arena Dataset.