Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 441 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Synthetic Media Literacy Test

Updated 6 August 2025
  • The paper demonstrates that LLMs frequently fail to privilege credible sources, with models showing high hallucination rates even under explicit instructions.
  • The test evaluates source filtering by instructing models to ignore dubious inputs and resolve contradictions using metadata from both high-quality and synthetic sources.
  • The benchmark highlights that neither increased model size nor reasoning capability guarantees improved media literacy, underscoring a gap in effective source integration.

The Synthetic Media Literacy Test (SMeL Test) is a targeted benchmark designed to evaluate the capacity of LLMs to actively filter out untrustworthy information when presented in a retrieval-augmented context. It operationalizes "media literacy" for artificial intelligence systems, specifically testing whether models can identify and privilege credible sources over unreliable or fabricated content, mirroring the process by which human researchers evaluate web-based information sources (Ahdritz et al., 4 Aug 2025).

1. Definition and Purpose

The SMeL Test is conceptualized as a minimal benchmark that probes an LLM’s ability to distinguish between reliable and unreliable or synthetic information in multi-source contexts. It is motivated by the observation that the internet contains a high prevalence of unattributed, deliberately misleading, or otherwise untrustworthy material. The test evaluates whether instruction-tuned models can apply media literacy heuristics, such as source assessment and information filtering, in tasks that simulate real-world retrieval-augmented generation (RAG) workflows.

The stated objective is to determine the extent to which models can perform selective information integration when exposed to both high-quality sources (e.g., encyclopedias, reputable news organizations) and low-quality or synthetic sources (e.g., fan fiction, manifestos, unmoderated forums).

2. Benchmark Structure and Methodology

The SMeL Test comprises three primary subtasks, each focusing on a distinct facet of synthetic media literacy:

Task Challenge Desired LLM Response
Ignoring dubious sources Provided with an untrustworthy document and filler sources; factual question posed Abstain (e.g., "I don't know"), avoid parroting poor source
Resolving contradictions Given a trustworthy and an unreliable source with conflicting factual claims Choose the answer from the trustworthy source
Active filtering Asked to summarize a topic from a set with decoys and red herrings Summary reflects only information from reliable sources

For each task, prompts include explicit source metadata (e.g., URL, origin domain, or explicit reliability indicators) and instruct the model to disregard unreliable information. The test corpus is constructed from synthetic documents mimicking a range of domain styles, including Encyclopedia Britannica, New York Times, Wikipedia, Reddit, 4chan, fanfiction.net, and an "Unknown" uncontrolled category. Cross-validation is performed with real-world news articles from the ISOT Fake News Dataset to test generalization beyond synthetic examples.

Performance is evaluated using "hallucination rates"—the percentage of cases in which a model incorporates or repeats untrustworthy information, fails to abstain, or answers in a way inconsistent with the authoritative source. The error metric is reported alongside 95% confidence intervals.

3. Model Performance and Key Results

The principal findings from the benchmark include:

  • No model consistently privileges trustworthy sources: Across all subtasks, LLMs, including both conventional and reasoning-enabled models, frequently hallucinate or repeat unreliable information. In "Ignoring dubious sources" tasks, many models exhibit hallucination rates close to or exceeding 90% (e.g., Llama 3.3 70B: 90.5% ± 2.3), while the best-performing model (Gemini 2.5 Pro) reduced the rate to 37.3% ± 7.7%, which is still inadequate given the explicit instruction.
  • Reasoning capability increases, but does not guarantee, performance: Models that produce explicit reasoning traces (system 2 models) achieve superior performance in contradiction-resolution subtasks but still make systematic errors. For instance, in "Resolving contradictions," reasoning models consistently outperformed non-reasoning variants, but hallucination rates remained non-negligible.
  • Discrepancy between explicit source knowledge and answer behavior: A measurable and persistent gap exists between the model's internal source assessments ("system 2" judgments), where it can articulate which document is most trustworthy, and its actual answer, which often relies on snippets from lower-quality or manipulated sources. This reflects an incomplete synthesis of source evaluation and final answer generation.
  • Model scaling does not guarantee improved media literacy: Increasing model size within a family does not systematically reduce error rates. There are cases where larger, more capable models hallucinate as frequently as smaller models, contradicting expectations regarding scaling laws for media literacy behaviors.

A representative performance summary for "Ignoring dubious sources" is shown in the following table:

Model Hallucination Rate (4chan-like)
Llama 3.3 70B 90.5% ± 2.3
Gemini 2.5 Pro 37.3% ± 7.7

4. Statistical and Computational Characteristics

The SMeL Test employs standard statistical procedures for proportion estimation, reporting error rates and confidence intervals for each model/task pairing:

  • Error/hallucination rate: Measured as the proportion of erroneous completions divided by the total trials.
  • Confidence intervals: 95% CI, based on standard error of the mean proportion.
  • No complex mathematical formalism is required for scoring, but the paper is explicit about presenting rates and confidence data in a tabular format for empirical rigor.

Explicit formulas, such as argmax or argmin operators, are used to define answer selection in the presence of conflicting documents, although these are not the direct evaluation metrics.

5. Implications for Media Literacy, LLM Development, and Application

The SMeL Test provides evidence that instruction-tuned LLMs—even when prompted explicitly—struggle to integrate source-aware reasoning into their final outputs. This has several implications:

  • Hallucination is a persistent failure mode: Even when models "know" which sources are better, they often fail to integrate this knowledge into their generative pipeline, leading to dangerous propagation of misinformation or synthetic content. This form of hallucination is critical in retrieval-augmented generation and agentic browsing applications.
  • Reasoning mechanisms ameliorate but do not fully solve the problem: Models generating explicit reasoning traces are more likely to articulate correct source assessments, yet this is not reliably reflected in answers, highlighting the need for deeper alignment between model reasoning and generation modules.
  • Scaling and capability alone are insufficient: Improvements in scale or fine-tuning do not guarantee improvements in media literacy skills as measured by selective information integration or filtering capacity.
  • Benchmark as design guidance: The SMeL Test can reveal whether LLM interventions for media literacy (such as metadata conditioning, long-context anchoring, or source-aware retrievers) actually reduce error rates in a robust, quantifiable fashion.
  • Directions for future research: The authors suggest that future model training should directly address the synthesis of explicit source knowledge with final output, possibly via conditional pretraining on document metadata or improved multi-step instruction following. Extending the SMeL Test to more complex RAG settings and live web search scenarios is also a possible avenue.

6. Limitations and Considerations

Several limitations are identified in the test and its application:

  • Synthetic test corpus caveat: The benchmark relies heavily on synthetic documents emulating different domain styles, raising the question of whether models are sensitive to style rather than content. Results on real news datasets confirm similar patterns but cannot eliminate all domain adaptation concerns.
  • Simplicity of tasks: The benchmark focuses on isolated factual queries and brief document sets with a maximum of two conflicting sources. It has yet to be validated for more complex, multi-document, context-rich tasks that characterize actual web research or open-ended summarization.
  • Discrepancy between explicit source assessments and answer selection: The benchmark exposes a unique failure mode ("system 1" hallucination even in the presence of "system 2" knowledge) not directly addressed by conventional QA or summarization metrics.

7. Significance and Broader Impact

The introduction of the SMeL Test marks a step toward systematic and mechanistic evaluation of an AI agent's synthetic media literacy, operationalizing critical reading and source assessment at scale for LLMs. It specifically measures whether LLMs can:

  • Ignore or abstain from using information from low-quality or dubious sources,
  • Correctly resolve factual conflicts in the presence of contradictory evidence based on source credibility,
  • Actively filter contributions from unreliable documents in multi-source summaries.

This targeted evaluation is of central importance in designing safer and more robust AI assistants, particularly those deployed in RAG configurations, automated research assistants, and social media monitoring. Its design highlights the fundamental gap between explicit reasoning and output behavior in current models and establishes an empirical basis for future training and evaluation schemes focused on trustworthy information integration.

By exposing this form of hallucination, the SMeL Test provides a critical diagnostic tool as well as a development target for the next generation of trustworthy, source-aware LLMs (Ahdritz et al., 4 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube