Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 87 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries (2409.12640v2)

Published 19 Sep 2024 in cs.CL and cs.LG

Abstract: We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for LLMs which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context LLM capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a synthetic evaluation framework that uses Latent Structure Queries to rigorously test long-context reasoning in large language models.
It presents three novel tasks—Latent List, MRCR, and IDK—that go beyond simple retrieval to assess complex synthesis and co-reference capabilities.
Empirical results reveal varied performance trends with early degradation in many models, highlighting opportunities for enhancing long-context model architectures.

Analyzing "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries"

In this paper titled "Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries," Kiran Vodrahalli et al. introduce a synthetic evaluation framework called Michelangelo designed to assess the long-context reasoning capabilities of LLMs. This benchmark is both minimal and inherently robust against leakage from pretraining data, providing high-resolution diagnostics for various well-known state-of-the-art LLMs.

Overview of Long-Context Evaluation

The primary motivation behind this work is to move beyond traditional needle-in-a-haystack retrieval tasks and establish benchmarks that rigorously test models' abilities to synthesize and reason across lengthy contexts. Michelangelo introduces three novel evaluations based on the Latent Structure Queries (LSQ) framework: Latent List, Multi-Round Co-reference Resolution (MRCR), and IDK. Each of these tasks is constructed to demand more than just simple retrieval, addressing the complex relationships between tokens spread out across lengthy contexts.

Latent Structure Queries (LSQ) Framework

The LSQ framework builds tasks by treating long context sequences as latent structures that models must internalize and query effectively. The central premise is analogous to Michelangelo’s artistry of chiseling away excess marble to reveal a latent structure - models must discard irrelevant context to extract and synthesize relevant information.

Key Properties of LSQ:

Extendability: Tasks can be extended to arbitrary lengths without altering the inherent complexity.
Complexity-Indexed: Complexity is measured by the number of relevant updates embedded within the context.
Irrelevant Filler Similarity: Context includes realistic irrelevant fillers, minimizing out-of-distribution artifacts.
Orthogonality: The tasks measure orthogonal dimensions of synthesis capability, thus capturing broad aspects of long-context understanding.

Evaluation Tasks and Their Insights

Latent List: Participants are asked to interpret a sequence of Python list operations. This task gauges a model’s understanding of code and its capability to maintain and update intermediate states within very large contexts. Complexity is modulated by varying the number of operations impacting the final result.
MRCR: This task evaluates a model's memory and understanding of conversational context by having it retrieve specific user queries embedded within long dialogues. It is particularly valuable for assessing the synthesis and ordering capabilities of models. With this task, the authors observed significant degradation in performance across models when context lengths exceeded specific thresholds (e.g., 32K tokens).
IDK: Designed to check whether models can identify unanswerable questions, "IDK" juxtaposes answerable retrieval tasks with unanswerable distractors. This ensures that models can distinguish when presented information does not suffice for an appropriate response.

Empirical Analysis and Observations

The paper evaluates several top-performing models, including variants from the Gemini, GPT, and Claude families. Results are presented for both 128K and 1M context lengths, with a focus on model degradation patterns. Key findings include:

Task Performance Trends: Different models excel in different evaluations. For instance, GPT models perform notably well on Latent List, while Gemini models excel in MRCR tasks when context length grows beyond 8K tokens.
Early Degradation: Many models show significant performance drop-offs early into the context length, specifically around 32K tokens.
Model Families and Performance: Several model families, particularly Gemini, demonstrated non-decreasing performance between 128K to 1M context lengths, suggesting notable consistency in their long-context reasoning robustness.

Implications and Future Directions

The Michelangelo benchmark presents significant practical and theoretical implications:

Better Model Comparison: Michelangelo's high-dimensional evaluation allows for a more nuanced comparison of LLM capabilities. This can better inform model selection for deployments requiring long-context reasoning.
Innovation in Model Architecture: Findings can drive innovation in model architectures to enhance long-context capabilities, potentially leading to models with improved reasoning capabilities over extended contexts.
Future Benchmark Extensions: The LSQ framework is generic and can be extended beyond the tasks introduced in this work. This creates a robust foundation for future comprehensive long-context evaluations.

Conclusion

Michelangelo establishes a crucial step forward in the assessment of long-context reasoning capabilities in LLMs. By introducing the LSQ framework and a suite of relevant tasks, the authors present a structured approach to evaluating synthesis and reasoning beyond mere information retrieval. The distinct performance patterns across differing tasks highlight the diversity of model capabilities and expose areas for improvement in modern LLMs. These contributions are instrumental for advancing the development and evaluation of models capable of handling increasingly sophisticated tasks requiring long-context understanding.