Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering (2506.09050v1)

Published 10 Jun 2025 in cs.AI

Abstract: How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

Summary

  • The paper presents ALE-Bench, a new benchmark designed to evaluate AI systems on long-horizon, score-based algorithm engineering tasks.
  • It offers a rich dataset from AtCoder Heuristic Contests, complete with problem statements, visualization tools, and a Docker-based code sandbox.
  • Experiments demonstrate that iterative refinement boosts AI performance, though a significant gap remains compared to human experts.

ALE-Bench (2506.09050) is introduced as a new benchmark designed to evaluate AI systems on complex, score-based algorithmic programming tasks that require iterative solution refinement over long time horizons. Unlike traditional coding benchmarks that focus on short-duration, pass/fail problems, ALE-Bench addresses computationally hard optimization problems commonly found in algorithm engineering domains such as logistics, scheduling, and planning. The benchmark is based on problems from the AtCoder Heuristic Contest (AHC), a large score-based competition where participants iteratively improve their solutions over days or weeks.

The core goal of ALE-Bench is to simulate the human AHC contestant experience, allowing AI agents to receive a task, develop code, test it against example inputs, visualize results, and refine their approach based on feedback. This process emphasizes long-horizon reasoning and continuous improvement, which are critical capabilities for tackling real-world optimization challenges.

ALE-Bench Dataset and Implementation

The benchmark dataset comprises 40 past AHC problems released on Hugging Face. These problems cover diverse domains like routing, planning, and puzzle-solving. Each problem package includes:

  • Problem Statement: Markdown description with images.
  • Scorer: Rust program to evaluate solution code on input cases and calculate scores.
  • Visualizer: Tools (Rust-based for static images and web-based for interactive visualization) to display code behavior on inputs.
  • Leaderboard: Data from original contests to rank AI submissions against human performance.

A "lite" version with 10 representative problems is also provided for faster evaluation cycles. All data is officially licensed from AtCoder Inc.

The benchmark is implemented as a Python library providing an AlebenchSession object to orchestrate AI participation. Key actions available to an AI agent within a timed session include:

  • View Problem: Access problem statement and metadata (time/memory limits, objective, etc.).
  • Test Run: Execute code in a Docker-based Code Sandbox that replicates the AtCoder environment (supporting C++20, Python3, Rust) to obtain scores and feedback (compilation errors, TLE, MLE, runtime errors, score). Agents can test against predefined public test cases or generate and test custom inputs using the problem's visualizer/generator tool.
  • Visualization: Use provided tools to visualize solution execution on specific inputs, either via static images or an interactive web interface.
  • Submission: Submit the final solution code for private evaluation against a larger, hidden test set, which determines the final score, rank, and performance metric.

The Code Sandbox imposes standard resource limits (1 CPU, 2GiB RAM) per execution and allows parallel test case evaluation to speed up the process. Standardized AWS EC2 C6i instances are recommended for evaluation to ensure fair comparisons.

Evaluation Metrics

ALE-Bench uses metrics aligned with AHC to allow direct comparison with human performance:

  • Per-Problem:
    • Scorer score for each private test case.
    • Overall private evaluation score (sum of raw or normalized scores).
    • Rank relative to human participants in the original contest.
    • Performance: An Elo-like score (typically 0-3500+) derived from rank, providing a problem-agnostic measure.
  • Aggregated:
    • Average Performance: Mean performance score across all problems.
    • Performance Distribution: Percentage of problems achieving certain performance tiers (e.g., 400\geq 400 for Brown, 1600\geq 1600 for Blue, 2000\geq 2000 for Yellow, 2400\geq 2400 for Orange, 2800\geq 2800 for Red).
    • Rating: AtCoder's cumulative skill indicator, though Average Performance is recommended for AI evaluation as Rating can be skewed by a few high scores.

Regulations standardize the environment, permitted languages, execution time measurement, allowed external libraries, and prohibit human intervention or access to specific contest data (private test cases, detailed human scores) to ensure fair play.

ALE-Agent Prototype and Experiments

The paper introduces ALE-Agent as a prototype agent designed for the benchmark, combining LLMs with systematic search. Its core strategy uses a best-first search approach with beam expansion, where each node is a code solution. Nodes are prioritized based on acceptance ratio and score on public tests. The LLM refines solutions through dialogue, receiving context including current/best code, feedback, and targeted guidance prompts (e.g., for simulated annealing or beam search). Parallel execution of LLM calls and evaluations enhances throughput.

Experiments were conducted on Amazon EC2 C6i instances using C++20, Python3, and Rust. A total of 22 frontier LLMs were evaluated in two main settings:

  1. One-Shot Setting: Models were given up to five attempts per problem based on public test feedback, without extended iterative refinement.
    • Results (summarized in Tables 1, 2, A.1): The reasoning model o3-high achieved the highest average performance (1044), surpassing other models. Reasoning models generally outperformed non-reasoning ones. C++20 showed the highest average performance among languages. However, even the best models rarely achieved performance above 1600 (Blue tier), indicating a significant gap compared to human experts who consistently reach higher tiers. Costs varied significantly between models.
  2. Iterative-Refinement Setting: Models refined solutions over a longer period (up to 4 hours or original contest duration), continuously receiving public test feedback. A summarization strategy was used to manage context length.
    • Results (summarized in Table 3, Table A.4): Average performance across all models improved by over 400 points compared to the one-shot setting, demonstrating the value of iterative refinement. o4-mini-high performed best (Average Perf 1520, Rating 2104 - top 11.8% human percentile). All evaluated models achieved at least one performance score of 2000+ (Yellow tier), with o4-mini-high reaching 2000+ on 15% of problems and 2400+ (Orange tier) on 5%. Analysis of score progression (Figure 4, Figure A.2) showed continuous improvement and increasing code size, similar to humans.
  3. Scaffolding Evaluation: OpenHands (general-purpose agent) and ALE-Agent ablations were tested on the lite version.
    • Results (summarized in Table 4, Table A.3): OpenHands showed marginal improvement compared to a sequential refinement baseline, suggesting difficulties adapting to the iterative optimization process. ALE-Agent with its full configuration (+ Method 1 & 2, incorporating domain knowledge and search) achieved substantially higher performance than its base or + Method 1 configurations, reaching an average performance of 1879 and achieving 2000+ performance on 30% of lite problems. Notably, ALE-Agent with this configuration achieved a score equivalent to 5th place (Red tier) on AHC039 in the lite version evaluation (Table A.4).

Analysis and Comparison with Humans

While top AI models like o4-mini-high achieved ratings in the top human percentiles (e.g., 11.8%), a deeper analysis of performance distributions reveals a notable gap in consistency. AI excels on problems amenable to standard heuristic techniques (like simulated annealing), particularly in short contests where rapid iteration is beneficial. However, humans demonstrate greater consistency across diverse problem types and superior long-horizon reasoning for complex challenges often found in long contests. The rating system, designed for human contest patterns, may overstate AI capabilities compared to average performance and performance distribution analysis. Investigations found no significant evidence of training data contamination or plagiarism affecting the results (Figure 5). A correlation paper confirmed the lite version is a valid proxy for comparing AI systems, although it appears slightly harder on average than the full set (Figure A.3). An ALE-Agent prototype participated in AHC046 in real time, achieving 154th place (1915 performance).

The benchmark provides a platform for advancing AI in algorithm engineering, highlighting the need for further research to bridge the gap in long-horizon reasoning and broad problem-solving consistency compared to human experts. The paper acknowledges limitations, including relying on single runs for statistical robustness and the need for further verification of multimodal features and tool use by agents.

Youtube Logo Streamline Icon: https://streamlinehq.com