- The paper presents a benchmark evaluating large language models on generating integer sequences by synthesizing efficient Python code, integrating a novel automated mechanism to detect disallowed lookup tables using OEIS data.
- Experiments demonstrate reasoning-optimized models achieve superior accuracy (at least 63% on easy sequences) compared to general-purpose models (~57%), though all tested models show significant performance drops on harder sequences.
- A novel automated cheating detection system, validated with 86% human agreement, identifies lookup tables in generated code to ensure evaluated models rely on algorithmic reasoning rather than data memorization.
The paper presents a comprehensive benchmark designed to evaluate the capabilities of LLMs on integer sequence generation tasks. The main contributions involve both a novel task formulation—deriving efficient Python code to compute a given integer sequence—and the integration of an automated mechanism to detect cheating, specifically the use of disallowed lookup tables.
The benchmark uses integer sequences sourced from the Online Encyclopedia of Integer Sequences (OEIS), grouping them into two categories:
- Easy sequences: The first 250 sequences labeled as “easy” in OEIS.
- Hard sequences: The first 250 sequences labeled as “hard” in OEIS.
For each sequence s, the task is defined as generating a Python function fs that computes the n-th term of the sequence with the constraint fs(n)=s(n) for all n above a given offset i0. The function must be implemented without resorting to a lookup table, and its execution time must remain below a predefined limit T (e.g., T∈{0.5,4} seconds). This is formalized as follows:
fs:{i0+j}j=0∞→Z,withfs(n)=s(n)∀n≥i0
- fs(n) denotes the model-generated function.
- s(n) represents the true sequence value.
- T is the time limit for execution.
Benchmark Design and Evaluation Metrics
The benchmark assesses models based on three factors:
- Accuracy: Correctness of the produced sequence values.
- Efficiency: Compliance with the execution time constraint.
- Cheating Avoidance: Verification that the generated code does not contain a lookup table for sequence terms.
For each sequence s, an accuracy metric As(n) is recorded as:
As(n)={0if fs(n)=s(n), or ts>T, or if cheating is detected 1otherwise
The average accuracy is computed over all evaluations in the easy and hard sequence sets.
Experimental Setup and Results
A total of nine frontier models were evaluated, including:
- Reasoning-optimized models (o1-mini, o1-preview)
- General-purpose models (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet)
- Other competitive models (Llama 3.1 405b and 70b, Gemini 1.5 Pro and Flash)
The experiments compared performance under different time constraints (0.5s vs. 4s), with the discussion primarily emphasizing the 4s time limit due to minimal performance variance between the two. The summary of the findings includes:
- o1 Models: Achieved at least 63% accuracy on the easy sequences and 18% on the hard sequences. Notably, the o1-mini model demonstrated lower cheating rates (approximately 2% for easy sequences and 15.2% for hard sequences).
- General-Purpose Models: Models such as Claude 3.5 Sonnet and GPT-4o scored around 57% on easy sequences, with their performance dropping significantly on hard sequences (down to 10–11%).
- Other Models: Certain models like Llama 405b and Gemini 1.5 Pro reported performance below 50% on easy sequences and below 10% on hard sequences. The Gemini API also faced issues related to content filtering, potentially increasing its reliance on lookup tables.
Cheating Detection Mechanism
A novel component of the benchmark is its automated cheating detection mechanism, which leverages GPT-4o’s structured output capabilities (with a temperature setting of 0 for reproducibility). This system checks the generated code for disallowed elements, particularly lookup tables. The mechanism was validated through a comparative evaluation with human judgments, achieving an overall agreement rate of 86% and 94% on hard sequences. Notably:
- False Positives: GPT-4o tended to flag more codes as cheating compared to the human evaluator.
- False Negatives: No instances were noted where human evaluators flagged cheating that went undetected by GPT-4o.
Discussion and Limitations
The results underscore the superior performance of reasoning-optimized models in tasks that demand both algorithmic reasoning and efficient code synthesis. However, all evaluated models struggled with the hard sequences, indicating that significant challenges remain in generating complex algorithms under strict efficiency constraints. Key limitations include:
- Dataset Bias: Dependence solely on OEIS may introduce biases based on the types of sequences and their difficulty labels.
- Language Constraints: Restricting the task to Python limits the scope, particularly given Python’s inherent execution speed constraints.
- Cheating Detection Imperfections: While effective, the automated cheating detection mechanism is not infallible, as indicated by the 86% overall agreement with human evaluations.
- Resource Constraints: The imposed time limits may penalize algorithms that are correct but computationally intensive, especially for sequences with inherently high computational cost.
Future Research Directions
Several enhancements and extensions are suggested:
- Tool Integration: Incorporating retrieval-augmented generation (RAG) and external tools to allow models to access additional resources.
- Benchmark Expansion: Updating the benchmark with newly added OEIS sequences to continuously challenge models with sequences not seen during training.
- Cross-Language Evaluation: Extending the approach to other programming languages may provide a richer understanding of models’ computational reasoning abilities.
- Refinement of Cheating Detection: Further aligning automated methods with human judgment through few-shot in-context learning or alternative verification strategies.
In summary, the paper rigorously evaluates LLMs on integer sequence generation, demonstrating that while models with enhanced reasoning capabilities excel in many instances, considerable improvements are still needed to tackle algorithmically complex sequences efficiently and in compliance with execution constraints.