AlphaResearchComp Benchmark

Updated 19 November 2025

AlphaResearchComp is a rigorous benchmark that defines a reproducible evaluation environment for autonomous algorithm discovery on nontrivial mathematical and algorithmic tasks.
It aggregates eight diverse problems with executable verification and human-best reference metrics to objectively gauge innovation and performance.
The platform emphasizes transparency through open-source code, deterministic test cases, and precise measurement scripts to ensure reliable head-to-head comparisons.

AlphaResearchComp is a rigorous benchmark for evaluating autonomous algorithm discovery agents, designed to promote transparency, reproducibility, and competitiveness on challenging open-ended mathematical and algorithmic tasks. Introduced in conjunction with the AlphaResearch agent, the benchmark emphasizes executable correctness, human baseline comparability, and objective scoring across a suite of diverse algorithmic problems (Yu et al., 11 Nov 2025).

1. Motivation and Benchmarking Philosophy

AlphaResearchComp addresses a critical deficiency in prior AI-for-science benchmarks: the absence of a public, open-ended evaluation suite enabling head-to-head comparison between LLM-based researchers and established human-best algorithms. Previous systems enforced either pure execution-based verification, which limits exploration to variants of known solutions, or LLM-as-critic protocols, prone to feasibility violations. AlphaResearchComp's central objective is to establish a transparent, reproducible, and rigorous evaluation environment where superiority over human baselines is both meaningful and verifiable.

Key design goals include:

Enabling automatic, objective assessment on algorithmically nontrivial tasks.
Providing reference human-best records for true innovation detection.
Ensuring all problems are supported by open-source starter code, deterministic test cases, and executable measurement scripts.
Allowing results to be documented and externally reproducible down to the bit level.

2. Composition of the Problem Suite

AlphaResearchComp aggregates eight optimization and inference problems chosen for their mathematical depth, diversity of domains, and suitability for automated scoring. Each problem offers:

A clearly specified input/output protocol.
Executable correctness constraints.
A cited human-best score with a precise metric.

The problem suite spans geometric, combinatorial, harmonic analysis, and polynomial design challenges. Representative tasks include:

Packing Circles in a Unit Square: Maximize the sum of radii for $n$ disjoint circles with variable radii within $[0,1]^2$ ; e.g., for $n=26$ , the best-known human score is $2.634$.
Minimize Max-Min Distance Ratio: Arrange $n=16$ points in $[0,1]^2$ to minimize the ratio $R$ of maximum to minimum pairwise Euclidean distance (human best $R=12.89$ ).
Third-Order Autocorrelation Inequality: Find compactly supported real functions $f$ minimizing $C_3$ , the normalized maximum of $(f*f)(t)$ .
Spherical Code Construction: Place $n=30$ unit vectors on $S^2$ to maximize the minimum pairwise angle ( $\theta_{\min} \approx 0.673651$ rad).
Autoconvolution Peak Minimization: Design $f$ supported on $[-1/4,1/4]$ with unit $L^1$ norm to minimize the peak of its autoconvolution ( $C\approx0.755$ ).
Littlewood Polynomials: Choose coefficients $c_k\in\{\pm1\}$ to minimize $\|P_n\|_\infty$ for $n=512$ (Rudin–Shapiro: $\|P_{512}\|_\infty\le32$ ).
MSTD Sets: For $n=30$ , select $A\subset\{0,\ldots,N-1\}$ maximizing $|A+A|/|A-A|$ (human best $R\approx1.04$ ).

All problems are distributed with input generators and canonical baseline scripts (Yu et al., 11 Nov 2025).

Problem Name	Objective (max/min)	Human Best
Packing Circles	$\sum r_i$ (max)	$2.634$
Min Max-Min Ratio	$R$ (min)	$12.89$
Spherical Code	$\theta_{\min}$ (max)	$0.673651$
Littlewood Polynomials	$\\|P_n\\|_\infty^{-1}$ (max)	$0.03125$

3. Verification and Scoring Pipeline

Each problem is equipped with an executable "evaluation program" $\mathcal{E}(\cdot)$ comprising:

Verification module: Enforces all problem constraints strictly, e.g., geometric non-overlap, support and range checks, integer or real-valued restrictions. Invalid submissions (failures, constraint violations, or runtime errors) receive a failure score (usually zero or $-1$ ).
Measurement module: For valid outputs, computes the native objective (e.g., sum of radii, minimum angle, supremum norm) deterministically and exactly.

Candidate solutions are executed under a sandboxed Python interpreter to enforce safety and determinism. All scripts, example inputs, and validators are open-sourced, with fixed random seeds and explicit test instance splits.

4. Unified and Native Objective Metrics

Beyond native per-problem metrics, AlphaResearchComp defines a composite "excel@best" score to quantify progress over human-known reference solutions. The metric is:

$\text{excel@best} = \mathbb{E}_{\text{problems}} \left[\frac{\left|r_{\text{best}} - r_{\text{human}}\right| \cdot \mathbb{I}_d}{|r_{\text{human}}|}\right]$

where $\mathbb{I}_d=+1$ if higher is better and $\mathbb{I}_d=-1$ if lower is better.

Examples of native metrics:

Packing circles: $\sum_{i=1}^n r_i$ (maximize)
Spherical code: $\min_{i<j} \arccos \langle p_i, p_j \rangle$ (maximize)
Littlewood polynomials: $1/\|P_n\|_\infty$ (maximize)
MSTD sets: $|A+A|/|A-A|$ (maximize)

This structure enables precise, quantitative monitoring of innovation, restricting claims of outperformance strictly to cases where agent-generated solutions exceed rigorously vetted human baselines.

5. Reproducibility, Documentation, and Open-Source Protocols

AlphaResearchComp prioritizes bit-level reproducibility:

All test data, random seeds, code for evaluation and reward model training, and problem prompts are hosted on a public repository (github.com/answers111/alpha-research).
Complete input/output specifications are provided for each challenge.
GPU/CPU determinism flags and parameterization files are standardized and included.
The benchmark appendix documents all prompts, reward-model hyperparameters, and reviewer instructions, ensuring transparent peer review and evaluation.

6. Empirical Evaluation and Failure Analysis

AlphaResearchComp has been used to benchmark the AlphaResearch agent in direct head-to-head competition with human-best results (Yu et al., 11 Nov 2025):

AlphaResearch achieved a $2/8$ win-rate (excel@best $>0$ on two challenges), specifically producing new best-known solutions for "packing circles" ( $n=26$ : $2.634\to2.636$ , $+0.32\%$ ; $n=32$ : $2.936\to2.939$ , $+0.10\%$ ), exceeding both published human and prior AlphaEvolve baselines.
For the six remaining problems, the agent generated solutions that improved substantially relative to random initialization (e.g., min max-min ratio $15.55 \to 12.92$ ), though not surpassing human-best values.
Failure analysis highlighted that local search exhaustion, multimodal objective landscapes, precision demands, and reward-model pruning (rejecting 30–40% of candidate solutions of which $\sim$ 70% were genuinely infeasible) present persistent barriers to surpassing highly optimized human solutions.

7. Impact and Research Directions

AlphaResearchComp sets a methodological precedent for evaluating autonomous research agents, emphasizing genuine innovation over rote recombination or optimization within well-trod domains. By open-sourcing both bench and evaluation pipelines, it enables independent reproducibility and standardization of claims in algorithmic discovery with LLMs. The benchmark reveals current LLM-based methods' strengths in geometric packing and their limitations in higher-precision or combinatorially complex domains, clarifying future research requirements for discovery on intractable or non-differentiable search spaces (Yu et al., 11 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

AlphaResearch: Accelerating New Algorithm Discovery with Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to AlphaResearchComp Benchmark.