Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

OLMoTrace: Tracing LM Outputs

Updated 31 July 2025
  • OLMoTrace is an open-source framework that maps language model outputs back to their original training data using maximal span matching.
  • The system employs a scalable extension of infini-gram indexing and parallel search techniques to achieve interactive tracing latency even at massive corpus scales.
  • OLMoTrace supports practical applications like fact checking, hallucination detection, and data provenance studies to enhance transparency in language models.

OLMoTrace is an open-source system built to trace the outputs of large LMs back to their full multi-trillion-token training data in real time. By efficiently locating verbatim spans of generated text in the training corpus and surfacing their sources, OLMoTrace provides unprecedented transparency into LM behavior, supporting fact verification, analysis of hallucination and creativity, and data provenance studies. The system is powered by an advanced, scalable extension of the infini-gram search engine, optimized for interactive latencies despite massive corpus scale (Liu et al., 9 Apr 2025).

1. System Architecture and Overview

OLMoTrace is an end-to-end framework that receives LM-generated text and returns span-level mappings to the original training data. The main operational steps are:

  1. Tokenization: Each LM output is parsed using a compatible tokenizer such as Llama-2’s.
  2. Maximal Span Matching: For every word-boundary output position, OLMoTrace computes the longest contiguous sequence (span) present verbatim in the indexed corpus.
  3. Corpus Retrieval: Each maximal span is used as a search key in the corpus index to retrieve all matching source document segments.
  4. Merging and Reranking: Overlapping and redundant matches from the training set are merged, and the most relevant sources are selected using BM25 scoring, ensuring high-quality provenance links.
  5. User Presentation: The interface, exemplified by the Ai2 Playground, allows users to inspect and highlight portions of generated text and navigate directly to supporting documents in the training dataset.

This workflow achieves clear, actionable mappings between generated content and its provenance, exposing the grounding or originality of a model’s output.

The core technical advance of OLMoTrace is its extension of the infini-gram search engine to efficiently index and search trillions of training tokens:

  • Suffix Array Indexing: The corpus is preprocessed into a global suffix array, sorting all suffixes lexicographically to enable fast exact match queries for arbitrary strings.
  • Parallel Maximal Prefix Search: For each output token boundary, a novel parallel algorithm issues a single suffix array query to find the maximal matching prefix. This reduces the naïve O(L2)\mathcal{O}(L^2) search cost (for output of length LL) to O(LlogN)\mathcal{O}(L \log N), where NN is the size of the training corpus.
  • Efficient I/O and Storage: The suffix array and required metadata are deployed on high-IOPS SSDs, mitigating I/O bottlenecks and supporting fast query response.
  • Span Filtering: Only spans starting at word boundaries are eligible. Span unigram probability pspan=tspanp(t)p_{\text{span}} = \prod_{t \in \text{span}} p(t) (where p(t)p(t) is the pretrained unigram probability) is used to prefer longer, less common spans. The system retains the top K0.05LK \approx 0.05 L for presentation.

This design delivers interactive tracing latency (about 4.5 seconds for a 450-token output), scaling to multi-trillion-token training sets.

3. Real-time Tracing Performance

OLMoTrace’s real-time capability arises from several synergistic optimizations:

  • Tokenwise Parallelization: Each candidate starting position in the LM output is traced for maximal matching independently and in parallel, amortizing processing time.
  • Optimized Data Structures: The suffix array structure enables rapid binary search for span matching, while SSD-backed storage allows for high-throughput, low-latency access.
  • Fast Reranking: After matches are surfaced, BM25 scoring is applied to select and prioritize relevant source documents for each matched span. BM25 scores are normalized to LM output length (max score 0.18×\approx 0.18 \times number of response characters) and bucketed for interpretability.
  • End-to-end Latency: The aggregate of parallelization, efficient indexing, and lightweight reranking yields consistently short tracing times even at scale.

This enables interactive inspection of model responses in practical, research, and deployment settings.

4. Applications: Fact Checking, Hallucination Detection, and Model Analysis

OLMoTrace is designed for a broad range of diagnostic and interpretability-focused applications:

  • Fact Checking: When the model produces factual statements, OLMoTrace can highlight supporting (or absent) evidence from the training set, facilitating trustworthiness assessment (e.g., tracing “The space needle was built for the 1962 World Fair” to its original context).
  • Creativity and Hallucination Analysis: The system identifies whether stylistically novel or “creative” generations are uniquely composed or merely regurgitations of training data, thus clarifying the boundaries of LM generalization.
  • Mathematical Solution Tracing: Steps in mathematical problem-solving, such as equations or combinatorial formulae, may be directly mapped to their appearances in the training corpus, revealing process reuse or independence.
  • Attribution and Data Provenance: Comprehensive tracking of generated output to source material supports attribution, auditing, and the paper of dataset influence on LM responses.

This breadth of use-cases empowers both downstream application developers and researchers seeking to understand LM internals.

5. Technical Specifications

Key technical details, all instantiated or referenced in the original system design, structure the trace process:

Component Role Complexity / Formula
Tokenization Segment LM output for matching
Suffix Array Search Find maximal span matches O(LlogN)\mathcal{O}(L \log N)
Span Probability Rank rarity/length of matched spans pspan=tspanp(t)p_{\text{span}} = \prod_{t \in \text{span}} p(t)
BM25 Reranking Score and bucket source document matches Score 0.18×\approx 0.18 \times chars
  • Eligibility: Only candidates at word boundaries are considered for matching.
  • Span Filtering and Thresholding: Lower span unigram probability (i.e., longer, less frequent) is preferred; only a small fixed fraction of spans are retained.
  • Parallelism: The design allows each output token (or word-boundary position) to be traced independently.

These specifications make OLMoTrace applicable for scalable deployment and experimentation in both research and operational settings.

6. Open Source Availability and Integration

OLMoTrace is released under Apache 2.0, fully open-source and publicly available. This supports:

  • Reproducibility: The implementation can be inspected, tested, and modified.
  • Extensibility: Developers can integrate OLMoTrace as a module in their LM deployments, provided access to compatible training data.
  • Transparency: The community gains tools for independent scrutiny of LM outputs, fostering robust attribution methods and bias detection.

The public release encourages collaborative improvement, extension to custom LLMs, and adaptation for varied corpus structures.

7. Future Directions

Current and anticipated lines for further development include:

  • Hyperparameter Optimization: Refinement of span selection filters, BM25 scoring, and merging strategies to optimize mappings based on human judgment and formal evaluation.
  • Modal and Domain Extensions: Applying the tracing paradigm to non-textual modalities or domain-specific corpora, or generalizing to LMs with divergent architectures or training setups.
  • Advanced Attribution: Possibility of blending span-tracing with causal methods, such as influence functions, for deeper mechanistic understanding.
  • Ethical Safeguards: Deeper integration with PII/redaction, copyright recognition, and toxicity filtering to ensure responsible reporting of training data segments.
  • Interpretability Research: Incorporation into broader toolkits for LM transparency, trust, and explainable AI.

These directions position OLMoTrace as a diagnostic foundation for deepening empirical and theoretical understanding of LLMs and their training influences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube