Create a Video View Paper

Gold-Medal AI: Scaling Olympiad Reasoning with SU-01

This lightning talk explores how the SU-01 framework transforms a 30-billion-parameter language model into a gold-medal-level olympiad problem solver through a unified three-stage pipeline: reverse-perplexity curriculum learning, two-tier reinforcement learning with verifiable and proof-level rewards, and test-time scaling via iterative self-verification. We examine the architectural choices that enable compact models to achieve expert-level mathematical and scientific reasoning, culminating in performance matching top human competitors on International Mathematical Olympiad and USA Mathematical Olympiad problems.

Script

A 30 billion parameter model just matched the top human score on the USA Mathematical Olympiad, solving 10 of 12 problems and earning gold-medal status among 340 competitors. The SU-01 framework achieves this through simple, unified scaling rather than massive compute or exotic architectures.

The breakthrough begins with reverse-perplexity curriculum learning. Instead of training on easy examples first, the model starts with the most unfamiliar problems, those with highest perplexity, then consolidates on more familiar territory. This ordering prevents the long-chain-of-thought degradation that cripples conventional supervised fine-tuning.

Reinforcement learning then unfolds in two tiers. Coarse RL optimizes for verifiable answer correctness using hierarchical reward assignment, while refined RL shifts focus to holistic proof quality, using a generative reward model to score entire derivations. This two-stage design produces solutions that are both correct and auditable.

At inference time, the model enters iterative solve, verify, refine loops, allocating up to 100,000 tokens per problem. A solution is accepted only after surviving repeated model-driven verifications. This test-time scaling protocol transforms bronze-level 21-point direct generations into gold-level 35-point solutions.

The results reveal cross-domain generalization even where training signals were absent. SU-01 achieves 69 percent on chemistry olympiad problems and 25 percent on biology, despite reinforcement learning focused exclusively on mathematics and physics. The unified behavioral pipeline transfers reasoning structure across scientific domains.

This work dismantles the assumption that only frontier-scale models can achieve expert reasoning. Through curriculum-optimized learning, hierarchical reinforcement, and structured test-time computation, a compact 30 billion parameter model reaches human gold-medal performance. Explore the full technical pipeline and create your own video summaries at EmergentMind.com.