Comparison of hand-engineered inference strategies versus learned model-generated strategies

Determine how domain-specific, hand-engineered inference strategies for competitive programming (exemplified by the o1-ioi test-time pipeline that partitions IOI problems into subtasks, samples large candidate sets, clusters outputs on model-generated test inputs, and reranks submissions) compare to learned test-time reasoning strategies that are autonomously generated and executed by large reasoning models trained end-to-end via reinforcement learning (such as OpenAI o1 and o3).

Background

The paper examines performance of large reasoning models (LRMs) on competitive programming, contrasting specialized pipelines with general-purpose reinforcement learning approaches. OpenAI o1-ioi incorporates domain-specific, hand-engineered test-time strategies—such as subtask-based sampling, clustering using model-generated tests, and reranking—to compete in IOI 2024.

In contrast, the later o3 model relies on end-to-end RL to develop its own test-time reasoning behaviors (e.g., generating brute-force validators), avoiding manual inference heuristics. The authors explicitly raise the question of how these two paradigms compare, motivating their evaluations across CodeForces and IOI 2024.

References

An open question is how domain-specific, hand-engineered inference strategies compare to learned approaches that models generate and execute on their own.

— Competitive Programming with Large Reasoning Models (2502.06807 - OpenAI et al., 3 Feb 2025) in Section 1 (Introduction)

Comparison of hand-engineered inference strategies versus learned model-generated strategies

Background

References

Related Problems