Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation (2501.05414v2)

Published 9 Jan 2025 in cs.CL

Abstract: Existing benchmarks for evaluating long-context LLMs (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluated 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Reasoning models achieve stronger overall performance in long-form generation, benefiting from long CoT training. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc.

Collections

Summary

The paper introduces LongProc, a benchmark designed to assess procedural generation across diverse, long-context tasks.
The paper shows that language models excel on short outputs but struggle with coherent, stateful reasoning on tasks up to 8K tokens.
The paper advocates refining attention mechanisms and state-tracking to enhance long-context model performance.

Benchmarking Long-Context LLMs on Procedural Generation Tasks: An Evaluation with LongProc

The advancement in the context window capacities of LLMs (LCLMs) necessitates a reevaluation of current benchmarking methodologies, primarily those emphasizing long-context recall. These benchmarks, unfortunately, tend to neglect the integration of dispersed information and long-form generation, integral to deploying LCLMs for practical applications. This paper introduces LongProc, a novel benchmark designed to fill this evaluation gap by focusing on long procedural generation tasks.

Overview of LongProc

LongProc presents a diverse set of six tasks: HTML to TSV, Pseudocode to Code, Path Traversal, Theory-of-Mind (ToM) Tracking, Countdown, and Travel Planning. These tasks are characterized by their need for procedural execution, where models must follow deterministic steps and generate structured outputs to perform the evaluations reliably through rule-based metrics. The nature of these tasks varies widely:

HTML to TSV involves extracting structured data from HTML pages.
Pseudocode to Code requires translating pseudocode into functional C++ code.
Path Traversal and ToM Tracking involve tracing paths or tracking beliefs in dynamic environments.
Countdown and Travel Planning simulate exhaustive search challenges, demanding models to explore multiple possible actions at each decision point.

Each task is tested across three difficulty levels, spanning outputs of 0.5K, 2K, and 8K tokens, to gauge the robustness of LCLMs under increasing complexity.

Empirical Findings and Analysis

The evaluation of 17 LCLMs, including both proprietary and open-weight models, reveals significant discrepancies in performance. While leading models like GPT-4o and Gemini-1.5-Pro demonstrate near-complete mastery on 0.5K tasks, their effectiveness wanes substantially at the 8K level. Notably, while all tested models assert context windows over 32K tokens, many falter markedly as task complexity surges, indicating difficulties in maintaining coherence over extensive sequences.

This performance degradation is particularly stark in tasks necessitating more nuanced deductive reasoning or stateful search execution, such as ToM Tracking and Travel Planning. Evidently, the models experience amplified per-entry error rates with extended sequences, pointing to inherent challenges in sustaining procedural fidelity over longer contexts. This analysis underscores the dire need for methodological enhancements in both model training and architectural strategies to support effectively longer procedural generations.

Implications and Future Directions

LongProc establishes a fresh benchmark standard, emphasizing procedural execution over conventional recall tasks, thus aligning closer to real-world applications and end-user requirements. The findings from LongProc expose the limitations of current LCLM strategies and highlight how existing models grapple with information integration and reasoning across extensive text spans.

Future progress in this area could focus on refining model architectures and training methodologies explicitly to handle complex procedural tasks. Specifically, improvements in attention mechanisms, state tracking capabilities, and memory integration could foster advancements in LCLM capabilities for compounded problem-solving contexts.

In summary, LongProc offers an insightful evaluation paradigm for LCLMs, asserting the value of procedural generation tests over traditional recall-focused assessments and suggesting substantial research opportunities in enhancing long-context understanding and generation capabilities in LLMs.