Benchmarking Large Language Models with Integer Sequence Generation Tasks

Published 7 Nov 2024 in cs.LG, cs.AI, and cs.SE | (2411.04372v1)

Abstract: This paper presents a novel benchmark where the LLM must write code that computes integer sequences from the Online Encyclopedia of Integer Sequences (OEIS), a widely-used resource for mathematical sequences. The benchmark is designed to evaluate both the correctness of the generated code and its computational efficiency. Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences. In order to ensure models do not exploit memorized sequence values, we introduce an automated cheating detection mechanism that flags the use of lookup tables and validated this automation against human cheating evaluations. This benchmark provides a meaningful challenge for current LLMs, offering insights into their mathematical reasoning and code writing capabilities, which can guide future research directions and model development in mathematical reasoning and code synthesis.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a benchmark evaluating large language models on generating integer sequences by synthesizing efficient Python code, integrating a novel automated mechanism to detect disallowed lookup tables using OEIS data.
Experiments demonstrate reasoning-optimized models achieve superior accuracy (at least 63% on easy sequences) compared to general-purpose models (~57%), though all tested models show significant performance drops on harder sequences.
A novel automated cheating detection system, validated with 86% human agreement, identifies lookup tables in generated code to ensure evaluated models rely on algorithmic reasoning rather than data memorization.

The paper presents a comprehensive benchmark designed to evaluate the capabilities of LLMs on integer sequence generation tasks. The main contributions involve both a novel task formulation—deriving efficient Python code to compute a given integer sequence—and the integration of an automated mechanism to detect cheating, specifically the use of disallowed lookup tables.

The benchmark uses integer sequences sourced from the Online Encyclopedia of Integer Sequences (OEIS), grouping them into two categories:

Easy sequences: The first 250 sequences labeled as “easy” in OEIS.
Hard sequences: The first 250 sequences labeled as “hard” in OEIS.

For each sequence $s$ , the task is defined as generating a Python function $f_s$ that computes the $n$ -th term of the sequence with the constraint $f_s(n) = s(n)$ for all $n$ above a given offset $i_0$ . The function must be implemented without resorting to a lookup table, and its execution time must remain below a predefined limit $T$ (e.g., $T \in \{0.5, 4\}$ seconds). This is formalized as follows:

$f_s: \{ i_0 + j \}_{j=0}^\infty \rightarrow \mathbb{Z}, \quad \text{with} \quad f_s(n) = s(n) \quad \forall n \geq i_0$

$f_s(n)$ denotes the model-generated function.
$s(n)$ represents the true sequence value.
$T$ is the time limit for execution.

Benchmark Design and Evaluation Metrics

The benchmark assesses models based on three factors:

Accuracy: Correctness of the produced sequence values.
Efficiency: Compliance with the execution time constraint.
Cheating Avoidance: Verification that the generated code does not contain a lookup table for sequence terms.

For each sequence $s$ , an accuracy metric $A_s(n)$ is recorded as:

$A_s(n) = \begin{cases} 0 & \text{if } f_s(n) \neq s(n) \text{, or } t_s > T \text{, or if cheating is detected} \ 1 & \text{otherwise} \end{cases}$

The average accuracy is computed over all evaluations in the easy and hard sequence sets.

Experimental Setup and Results

A total of nine frontier models were evaluated, including:

Reasoning-optimized models (o1-mini, o1-preview)
General-purpose models (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet)
Other competitive models (Llama 3.1 405b and 70b, Gemini 1.5 Pro and Flash)

The experiments compared performance under different time constraints (0.5s vs. 4s), with the discussion primarily emphasizing the 4s time limit due to minimal performance variance between the two. The summary of the findings includes:

o1 Models: Achieved at least 63% accuracy on the easy sequences and 18% on the hard sequences. Notably, the o1-mini model demonstrated lower cheating rates (approximately 2% for easy sequences and 15.2% for hard sequences).
General-Purpose Models: Models such as Claude 3.5 Sonnet and GPT-4o scored around 57% on easy sequences, with their performance dropping significantly on hard sequences (down to 10–11%).
Other Models: Certain models like Llama 405b and Gemini 1.5 Pro reported performance below 50% on easy sequences and below 10% on hard sequences. The Gemini API also faced issues related to content filtering, potentially increasing its reliance on lookup tables.

Cheating Detection Mechanism

A novel component of the benchmark is its automated cheating detection mechanism, which leverages GPT-4o’s structured output capabilities (with a temperature setting of 0 for reproducibility). This system checks the generated code for disallowed elements, particularly lookup tables. The mechanism was validated through a comparative evaluation with human judgments, achieving an overall agreement rate of 86% and 94% on hard sequences. Notably:

False Positives: GPT-4o tended to flag more codes as cheating compared to the human evaluator.
False Negatives: No instances were noted where human evaluators flagged cheating that went undetected by GPT-4o.

Discussion and Limitations

The results underscore the superior performance of reasoning-optimized models in tasks that demand both algorithmic reasoning and efficient code synthesis. However, all evaluated models struggled with the hard sequences, indicating that significant challenges remain in generating complex algorithms under strict efficiency constraints. Key limitations include:

Dataset Bias: Dependence solely on OEIS may introduce biases based on the types of sequences and their difficulty labels.
Language Constraints: Restricting the task to Python limits the scope, particularly given Python’s inherent execution speed constraints.
Cheating Detection Imperfections: While effective, the automated cheating detection mechanism is not infallible, as indicated by the 86% overall agreement with human evaluations.
Resource Constraints: The imposed time limits may penalize algorithms that are correct but computationally intensive, especially for sequences with inherently high computational cost.

Future Research Directions

Several enhancements and extensions are suggested:

Tool Integration: Incorporating retrieval-augmented generation (RAG) and external tools to allow models to access additional resources.
Benchmark Expansion: Updating the benchmark with newly added OEIS sequences to continuously challenge models with sequences not seen during training.
Cross-Language Evaluation: Extending the approach to other programming languages may provide a richer understanding of models’ computational reasoning abilities.
Refinement of Cheating Detection: Further aligning automated methods with human judgment through few-shot in-context learning or alternative verification strategies.

In summary, the paper rigorously evaluates LLMs on integer sequence generation, demonstrating that while models with enhanced reasoning capabilities excel in many instances, considerable improvements are still needed to tackle algorithmically complex sequences efficiently and in compliance with execution constraints.