OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation
Abstract: Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
OpenDeepThink: A Simple Explanation
Overview
This paper introduces OpenDeepThink, a way to help AI models “think” better by trying many solution ideas in parallel and using head-to-head comparisons to find and improve the best ones. Instead of stretching one long chain of thought (which can go off-track early), OpenDeepThink grows a whole “population” of ideas, compares them like a tournament, and uses feedback from those comparisons to rewrite the weaker ones. It doesn’t need special test cases or a separate grading system to work.
The Key Questions
The authors focus on three simple questions:
- If an AI generates many possible answers, how can we pick the best one without a perfect answer key?
- Can we both select good ideas and improve (rewrite) the others at the same time?
- Can this be done quickly by running steps in parallel, rather than one-after-another?
How the Method Works
The big idea in plain words
Imagine a science fair where lots of students submit projects. Instead of one judge scoring each project alone, judges compare pairs of projects and say which one is better. From many such pairwise matches, you can build a fair ranking of all projects. Then you keep the top ones, give feedback to most of the rest, ask them to revise, and drop the worst. Repeat this for a few rounds, and you’ll likely end up with a much better winner.
Step-by-step pipeline
Here’s the process, simplified:
- Start: The AI creates several different candidate solutions (for example, 20).
- Compare: The same AI acts as a judge and compares random pairs of candidates, explaining which one seems more correct and why.
- Rank with Bradley–Terry: Using all those head-to-head results, a simple math model called Bradley–Terry turns pairwise wins/losses into a global ranking (like ranking teams based on match outcomes).
- Evolve:
- Keep the top 25% unchanged as “elites.”
- Take the top 75% (including elites) and rewrite them using the judge’s feedback (the AI is allowed to completely switch strategies, not just make tiny fixes).
- Drop the bottom 25%.
- Repeat: Do a few generations of compare → rank → evolve.
- Final pick: Run a denser round of pairwise comparisons and choose the top-ranked solution.
All the comparisons within a round happen in parallel, so the total “waiting time” is limited even if many candidates are involved.
What is the Bradley–Terry model?
- Think of sports: if Team A beats Team B and Team B beats Team C, you get hints about their strengths. Bradley–Terry uses many such pairwise results to assign a score to each “player” (here, a candidate solution) and produce a fair ranking, even if not everyone plays everyone.
- This is better than just counting raw wins, because it adjusts for how strong your opponents were.
Why pairwise judging instead of scoring each solution alone?
When an AI judges a single answer by itself (pointwise), it tends to be too positive and lets weak answers pass. But when it compares two answers (pairwise), it’s better at spotting differences. The paper shows that pairwise judging is much more accurate on their tests.
What does “mutation” mean here?
“Mutation” just means “rewrite with feedback.” The AI reads the critiques from the pairwise comparisons and uses them to fix or completely redo the solution—like revising an essay after comments from a teacher.
Time and compute budget (in simple terms)
- Per problem, the system makes about 285 AI calls total.
- These calls are grouped into 8 sequential rounds (everything inside each round runs at the same time), which takes about 27 minutes in their setup.
Main Findings and Why They Matter
Here are the main results, with a quick note on why they’re important:
- Big boost on coding problems: On a mix of 192 competitive programming tasks, OpenDeepThink raises Gemini 3.1 Pro’s effective Codeforces rating by about +405 Elo points. Elo ratings (like chess ratings) are a way to measure “skill level,” so +405 is a large jump.
- Works across models: The same settings help both weaker and stronger versions of the Gemini model without extra tuning. Weaker models benefit more from the “evolution” (rewriting), while stronger models benefit more from better selection.
- Best on objective tasks: On a cross-domain benchmark (HLE), the method improves in areas with clear right-or-wrong answers (math, biology, physics), but can get worse in subjective areas (humanities, social sciences). This shows the method depends on the judge being reliable in pairwise comparisons.
- Pairwise beats pointwise: In a controlled test of 500 pairs of solutions, pairwise judging was 86% accurate, while pointwise judging was only 59%. This confirms that comparing two answers is much better than scoring one in isolation.
- Feedback that points out mistakes is key: Negative feedback (what’s wrong) drives most of the improvement. Positive feedback doesn’t add much beyond what the model already knows from its own solution.
Implications and Potential Impact
What this could change
- Better reasoning without special graders: OpenDeepThink can select and improve solutions without hidden test cases or custom reward models, which makes it easier to apply in new domains (as long as correctness is judgeable).
- Parallel thinking as a new default: Instead of stretching one long chain of thought (which is brittle), future systems may regularly use populations of ideas, do head-to-head comparisons, and evolve solutions with targeted feedback.
- Practical guardrails: The method shines where “correctness” is objective. In subjective tasks, the soft “in-model” judge can mislead the system, so human oversight or external checks may be needed.
Limitations to keep in mind
- Costly: About 285 calls per problem is expensive and may be too slow for real-time use.
- Judge bias: The method is only as good as the AI judge. If the judge struggles (especially on subjective questions), the system can amplify errors.
- Model scope: The paper tests on Gemini-family models; results may differ on other AI models.
- Some settings were informally chosen: For example, keeping the top 25% as elites and allowing full rewrites helped in practice, but weren’t deeply ablated.
Bottom Line
OpenDeepThink is like running an idea tournament: many ideas compete head-to-head, a fair ranking system picks leaders, and focused feedback helps most of the rest improve. This approach delivers big gains on tasks with clear right-or-wrong answers (like competitive programming), transfers across model strengths, and avoids needing a separate verifier. Its success depends on reliable judging and comes with a compute cost, but it points to a promising future where AI “thinks in parallel,” compares, and evolves its answers to reach stronger reasoning.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and unanswered questions that future work could address:
- Model generalization: Do the gains transfer beyond the Gemini family to architecturally and vendor-diverse models (e.g., GPT, Claude, Llama, Qwen), and under what judge/generator pairings?
- Same-model judge and generator: How much does using the same LLM for both generation and judging bias outcomes? Evaluate cross-judge setups, judge ensembling, and asymmetric judge/generator capabilities.
- Subjective-domain reliability: Can we detect, at runtime, when pairwise judgment is unreliable (e.g., humanities/social sciences) and automatically switch to alternate strategies (e.g., majority vote, abstain, human-in-the-loop)?
- Lightweight verifier integration: What minimal, domain-agnostic verifier signals (unit tests, fuzzing, symbolic checks, calculators, retrieval, tool use) reduce judge bias while preserving the training-free setup?
- Active pairing policies: Replace random K-regular matchings with information-efficient pairing (e.g., adaptive tournaments, Swiss pairings, dueling bandits) to maximize Bradley–Terry (BT) information gain per comparison.
- Aggregation alternatives and uncertainty: Compare BT to Plackett–Luce, TrueSkill, Thurstone, spectral ranking, and BT variants with explicit tie models (Davidson, Rao–Kupper) and uncertainty quantification for confidence-aware stopping.
- Theoretical guarantees: Derive sample-complexity bounds and recovery conditions for selecting the true best candidate under noisy, biased pairwise judgments and finite comparisons.
- Hyperparameter sensitivity: Systematically ablate n, K, T, M (population size, comparisons per candidate, generations, final density), including elite fraction and discard ratio, to find compute-optimal regimes under fixed latency.
- Early stopping and budget adaptivity: Design confidence-based stopping rules and adaptive allocation of comparisons/mutations per problem to cut cost without harming accuracy.
- Final-round densification: Analyze whether partial densification earlier (e.g., mid-run dense comparison) improves selection vs. deferring all density to the final round; optimize M relative to n.
- Mutation operator ablations: Quantify the causal effect of “license-to-abandon,” elite preservation ratio, and bottom-quartile dropping on diversity and performance; identify regimes where local repair beats full rewrites.
- Critique quality and reliability: Measure factuality/precision of pairwise critiques used for mutation; develop filters or verification (e.g., tests, static analysis) to suppress hallucinated or misleading feedback.
- Negative vs. positive feedback: Validate at larger scale and across domains the finding that negative feedback drives rescues while positive adds little; test graded/targeted critique formats and opponent-weighted critique aggregation.
- Information overload in mutation: Design mechanisms (e.g., weighting by opponent BT score, summarization, top-k critique selection) to handle K>4 without degrading mutation quality.
- Diversity maintenance: Track and control population diversity (code edit distance, algorithmic novelty) to prevent mode collapse; explore diversity-promoting mutation or novelty search.
- Adversarial style gaming: Diagnose and mitigate solutions that optimize for judge preference (style) rather than correctness; penalize stylistic artifacts and reward evidence (tests passed, invariants).
- Cold-start failures: Develop methods (retrieval, curriculum decomposition, self-generated tests) that create new solution approaches when gen-0 has zero solves, since current evolution “amplifies partial competence” rather than induces new capabilities.
- Cross-domain breadth: Test on more languages (Python/Java/Rust), non-programming reasoning (formal math, proofs), multimodal tasks, and long-horizon agentic settings with partial verifiability.
- Tool-use constraints: Relax the “no tools” assumption to study how minimal tool access (compilers with sanitizers, input generators) changes selection and mutation efficacy and cost.
- Judge prompt design and bias: Systematically study prompt choices for pairwise judging (position bias, rationale requirements, calibration tokens) and their impact; quantify residual position bias despite randomization.
- Robustness to judge failures: Move beyond treating malformed judge outputs as ties; assess fallback strategies (re-ask with stricter format, alternate judge) and their effect on BT stability.
- BT fitting details: Explore regularization strength, convergence diagnostics, and warm-starts for faster, more stable L-BFGS fits; report confidence intervals over BT scores to guide stopping/mutation.
- Compute and systems trade-offs: Provide end-to-end cost–latency curves, carbon estimates, and parallelization/over-provisioning strategies; study failure tolerance and straggler mitigation in real deployments.
- Comparison to strong baselines under equal budget: Evaluate against trained reward models, verifier-guided best-of-N, and learned aggregators (e.g., SSA) with matched call budgets on the same benchmarks.
- HLE statistical power: Increase per-category sample sizes; construct controlled subsets with graded objectivity to more precisely map when pairwise judging helps or harms.
- Dataset reproducibility: Replace/augment NOI-119 (hidden judge) with public alternatives; expand CF-73 beyond 73 items; quantify and mitigate pretraining contamination more rigorously.
- Elo estimation validity: Stress-test the mapping from BT top-1 to Codeforces Elo (prior choice, independence assumptions); validate with live contests or alternative rating models.
- Temperature/decoding sensitivity: Ablate decoding parameters for both generation and judging; assess stability across seeds and reproducibility under stochasticity.
- Asynchronous pipelines: Test whether discarding late responses (as suggested) affects selection reliability and fairness; design schedulers that balance latency, diversity, and selection quality.
- Failure-mode taxonomy: Provide qualitative analyses of where BT selection errs (e.g., off-by-one logic, complexity misestimation, edge-case handling) to inform targeted mutation heuristics.
- Safety/fairness audits: Check whether judging amplifies biases (e.g., code style, naming conventions, language of comments) and propose mitigations (style-blind judging, code normalization).
- Standardized TLE handling: Address the 1% CF local-vs-official TLE discrepancy by standardizing timeouts/hardware or normalizing runtime, and measure sensitivity of conclusions to TLE thresholds.
Practical Applications
Immediate Applications
These applications can be deployed with current LLMs and infrastructure, especially in domains with objective correctness. They derive directly from OpenDeepThink’s population-based, pairwise Bradley–Terry (BT) selection and feedback-driven mutation.
- Sector: Software engineering (program synthesis, competitive programming)
- Use case: Backend “parallel reasoning” module for AI coding assistants that improves acceptance of algorithmic solutions without hidden test cases.
- Tools/workflows:
- Orchestrate n-way sampling → K pairwise comparisons per candidate with randomized order → BT ranking → feedback-driven mutation of top 75% → dense final comparisons for selection.
- Provide a microservice for BT aggregation and a “critique aggregator” to feed negative feedback into mutation prompts.
- Default budget: ~285 API calls/problem, 8 sequential rounds, ~27 minutes wall-clock under full parallelization (tunable).
- Assumptions/dependencies:
- Objective tasks with reliable LLM pairwise judging (e.g., data structures/algorithms, competitive programming).
- Base model must have non-trivial pass@1; the method amplifies partial competence more than it creates new capabilities.
- Parallel compute available; latency tolerance ≥ tens of minutes or use over-provisioning to cut tails.
- Sector: Software engineering (code review & patch triage)
- Use case: Rank and select among multiple LLM-generated bug fixes or micro-optimizations when tests are sparse or incomplete.
- Tools/workflows:
- Submit multiple patch proposals; run pairwise comparisons framed as “likelihood to pass code review/performance constraints”; aggregate via BT; rewrite top patches with aggregated negative critiques.
- Assumptions/dependencies:
- Works best when correctness/performance constraints can be judged relatively (e.g., clearer asymptotic/invariants).
- Introduce lightweight human-in-the-loop gating for production merges.
- Sector: Data/ML engineering
- Use case: Ranking and refining SQL/query generation, data transformation scripts, or ETL snippets when gold outputs are not fully available.
- Tools/workflows:
- Generate N candidate scripts; pairwise judge which is more likely to satisfy schema constraints/performance; BT rank; mutate top candidates with critique.
- Assumptions/dependencies:
- Objective criteria (schema conformance, runtime, deterministic outputs) are expressible in comparison prompts; add limited executable checks where safe.
- Sector: Education (CS/algorithms)
- Use case: Tutor that presents multiple solution strategies for a problem, explains pairwise critiques, and converges to an improved solution.
- Tools/workflows:
- Classroom/assignment assistant that shows students BT-ranked solutions and the negative feedback that guided mutations; supports reflection on failure modes (e.g., “TLE due to O(k2)”).
- Assumptions/dependencies:
- Problems with objective evaluation (programming, math). Keep subjective writing tasks out of this mode.
- Sector: Evaluation and benchmarking
- Use case: Model selection with limited labels using pairwise judgments and BT—portable “soft verifier” for open-ended outputs.
- Tools/workflows:
- Arena-style evaluation harness using pairwise LLM judgments with BT aggregation (similar to Chatbot Arena) to compare model variants or prompts.
- Deploy CF-73 dataset for competitive-programming evaluation with near-official agreement; integrate into CI for regular scorecards.
- Assumptions/dependencies:
- Pairwise judgement reliability varies by domain; objective tasks favored.
- Sector: Operations/Platform engineering
- Use case: Parallel inference scheduler for “8-stage” population reasoning.
- Tools/workflows:
- Batch-parallel orchestration with fixed sequential depth (8 rounds); early heavy mutation (gen-0→gen-1), followed by a denser final comparison (M≈10) to extract residual gains.
- Assumptions/dependencies:
- API rate limits, cost budgets, and timeout handling (retry invalid JSON once; otherwise tie). Randomize presentation order to reduce position bias.
- Sector: Research & academia
- Use case: Study of LLM-as-judge reliability and selection mechanisms; teaching demos on pairwise vs pointwise judging.
- Tools/workflows:
- Run the released pipeline and CF-73 to replicate results; ablate K, n, T, M; analyze when pairwise helps vs hurts across domains.
- Assumptions/dependencies:
- Availability of base models and compute; careful domain curation to avoid subjective tasks where degradation is likely.
- Sector: Product design for LLM UX
- Use case: Multi-draft email/answer assistance with “debate-style” pairwise selection in objective mini-tasks (e.g., math steps, structured answers).
- Tools/workflows:
- Generate several drafts; pairwise compare on explicit rubrics (“mathematical correctness,” “constraint satisfaction”), use BT to select and refine.
- Assumptions/dependencies:
- Keep subjective quality judgments minimal or add human confirmation; rely on negative feedback to drive rewrite prompts.
- Cross-cutting guidance for self-improvement loops
- Use case: Replace pointwise self-refinement with negative-feedback-driven mutation across a population.
- Tools/workflows:
- Incorporate only negative critiques into rewrite prompts to maximize rescue rate; permit “license to abandon” the current approach in mutation prompts.
- Assumptions/dependencies:
- Negative feedback dominates; positive feedback adds little; K≈4 pairwise contrasts per candidate is a good operating point before diminishing returns.
Long-Term Applications
These require further research, scaling, better judges, or hybrid verifiers—especially beyond programming or where correctness is ambiguous.
- Sector: General-purpose agentic systems (planning, robotics, operations research)
- Use case: Population-based planning with BT selection when dense, true reward signals are sparse or expensive.
- Tools/workflows:
- Parallel generation of plans/policies; pairwise judge with domain-specific rubrics; BT rank; mutate using critique to explore drastically different strategies.
- Assumptions/dependencies:
- Requires reliable comparative judging proxies or partial simulators; integrate external sensors/simulations as partial verifiers to regularize the soft verifier.
- Sector: Scientific computing and discovery
- Use case: Evolving programs/conjectures (e.g., symbolic math, algorithm discovery) with “soft verification” via pairwise assessment where ground truth is costly.
- Tools/workflows:
- Combine pairwise-BT with targeted simulations or spot-checks; escalate promising candidates to stronger/expensive evaluators.
- Assumptions/dependencies:
- Judge reliability must track objective criteria; hybridize with programmatic checks to avoid reinforcing spurious patterns.
- Sector: Content generation (reports, policy drafts, legal memos)
- Use case: Multi-draft refinement with human-in-the-loop pairwise judging and BT to converge on higher-quality outputs.
- Tools/workflows:
- Human reviewers provide pairwise preferences or rubric-based comparisons; BT aggregator ranks drafts; LLM mutates using human critiques.
- Assumptions/dependencies:
- Pure LLM-as-judge degrades on subjective tasks; require human preferences or calibrated reward models to guide selection safely.
- Sector: Model training and distillation
- Use case: Train compact “selector” or reward models from BT-aggregated pairwise judgments; distill population reasoning into single-pass models.
- Tools/workflows:
- Log pairwise comparisons, BT scores, and mutation histories; train learned aggregators or RLHF-style rewards; fine-tune models to emulate BT-selected outputs.
- Assumptions/dependencies:
- Needs large-scale, high-quality pairwise logs; care to prevent bias amplification.
- Sector: Governance, policy, and AI assurance
- Use case: Procurement and audit frameworks recommending pairwise/BT evaluation for objective AI tasks; red lines for subjective domains.
- Tools/workflows:
- Standardized evaluation protocols: require pairwise preference tests with BT aggregation for objective benchmarks; mandate human oversight when tasks are subjective or safety-critical.
- Assumptions/dependencies:
- Evidence base correlating judge reliability with domain objectivity; compute budgets and transparency around selection pipelines.
- Sector: Systems and hardware
- Use case: Specialized orchestration/accelerators for population-based, parallel test-time compute (PTC) with shallow sequential depth.
- Tools/workflows:
- Scheduler libraries (8-round pipelines), model-parallel backends, caching/early-stopping heuristics; potential hardware-aware batching for pairwise judging.
- Assumptions/dependencies:
- Economic viability (API cost, latency); support for high fan-out parallel calls.
- Sector: Data-centric AI
- Use case: BT-driven de-duplication and quality selection of synthetic datasets (e.g., instruction data), using pairwise comparisons between candidate data points.
- Tools/workflows:
- Generate multiple synthetic candidates per prompt; pairwise judge for correctness/consistency; keep elites and mutate middle ranks.
- Assumptions/dependencies:
- Strong rubrics for objective correctness; risk of bias propagation if judges are unreliable.
- Sector: Hybrid verification pipelines
- Use case: Two-stage selection where BT acts as a “soft triager” to narrow to a small set, followed by expensive/verifiable checks (tests, SMT solvers, simulators).
- Tools/workflows:
- Stage 1: Broad population + BT selection; Stage 2: Run rigorous verifiers on BT top-k; iterate with critiques.
- Assumptions/dependencies:
- Availability of partial verifiers; careful budgeting to maximize coverage under compute constraints.
- Sector: Education (beyond programming)
- Use case: Structured reasoning tutors in math/physics that evolve solutions and surface critiques; limited trials in biology.
- Tools/workflows:
- Multi-solution generation, pairwise critiques focused on objective rubrics, BT ranking, mutation with license to restart.
- Assumptions/dependencies:
- Maintain domain boundaries to objective topics; evaluate with ground-truth solutions for safety.
Notes on feasibility across all applications:
- Domain reliability is key: Gains concentrate where pairwise LLM judgment aligns with ground truth (mathematics, programming, physics); performance can degrade in subjective domains (humanities, social sciences) unless humans or calibrated reward models are in the loop.
- Cost/latency: ~285 API calls per problem in the reference configuration; requires parallelization and tolerance for ~27 minutes wall-clock or engineering to reduce latency (e.g., over-provisioning, early-drop of stragglers).
- Model dependence: Validated on Gemini-family models; transfer to other architectures likely but unproven without retuning.
- Population dynamics: The method amplifies existing capability; problems never solved at gen-0 are rarely “rescued” by evolution alone.
- Prompting details matter: Randomize presentation to reduce position bias; route negative critiques; preserve top quartile as elites; allow “abandon strategy” mutations. Hyperparameters (n, K≈4, T≈3, M≈10) provide a strong baseline but may need task-specific adjustment.
Glossary
- AC/WA: Abbreviations for Accepted and Wrong Answer verdicts returned by programming judges. "Verdict labels (AC/WA) are shown for post-hoc analysis only; the pipeline operates without access to any ground-truth signal."
- additive shift invariance: A property where adding a constant to all parameters does not change model likelihood or outcomes. "the BT log- likelihood is invariant under additive shifts of s(t)"
- Bernoulli: A probability model for binary outcomes (success/failure) used as a likelihood in inference. "the per-problem likelihood is Bernoulli"
- best-of-N sampling: Strategy that draws multiple candidates and picks the best by some selector. "Best-of-N sampling parallelizes naturally but shifts the bottleneck to selection."
- Binomial: A discrete distribution modeling the number of successes in a fixed number of independent trials. "the per-problem likelihood is Binomial with n = 20 independent gen-0 samples and k accepted"
- bootstrap resampling: A statistical method that estimates uncertainty by resampling data with replacement. "Confidence intervals are obtained by bootstrap resampling."
- Bradley-Terry aggregation: A paired-comparison model that converts pairwise win/loss/tie data into a global ranking. "Bradley-Terry aggregation [Bradley and Terry, 1952] of the comparison outcomes into a global ranking"
- Bradley-Terry score vector: The set of latent skill/quality scores estimated by the Bradley–Terry model. "we fit the Bradley-Terry (BT) score vector s(+) [Bradley and Terry, 1952] under"
- BT top-1: The accuracy of the highest-ranked candidate under Bradley–Terry aggregation. "while BT top-1 measures whether the Bradley-Terry winner is accepted"
- chain of thought: An explicit step-by-step reasoning trace generated by an LLM. "The dominant paradigm extends the model's chain of thought"
- Codeforces Elo: An Elo-style rating adapted to Codeforces problem difficulty and model solve rates. "OpenDeepThink raises Gemini 3.1. Pro's effective Codeforces Elo by +405 points"
- Cognitive Well: A failure mode where iterative refinement converges to a confident but wrong solution that the grader fails to reject. "identify this failure mode as the 'Cognitive Well'"
- elite preservation: An evolutionary strategy where top-performing candidates are carried forward unchanged. "the top 25% of candidates are preserved as elites"
- feedback-driven mutation: Updating candidates using targeted critiques or feedback to produce improved variants. "feedback-driven mutation of non-discarded candidates"
- Gaussian prior: A normal distribution used as a prior over parameters in Bayesian estimation. "under a Gaussian prior N (3100, 5002)"
- hyperparameters: Tunable configuration values controlling the algorithm’s behavior and budget. "The pipeline has four hyperparameters: population size n, per-generation comparisons per candidate K, number of evolution generations T, and final-round comparisons per candidate M."
- K-regular matching: A pairing scheme where each item is matched to exactly K others (degree K) without duplicates. "Pairwise com- parisons sample a random K-regular matching without self-pairs"
- L-BFGS: A limited-memory quasi-Newton optimization algorithm for large-scale problems. "with L-BFGS [Liu and Nocedal, 1989]"
- l2 penalty: Quadratic regularization term that discourages large parameter values for stability. "adding a small l2 penalty"
- logistic sigmoid: The S-shaped function used to map score differences to probabilities in Bradley–Terry. "o is the logistic sigmoid"
- majority vote: Aggregation by selecting the answer or label with the most votes among samples. "majority vote rises by +3.1 points"
- maximum a posteriori (MAP): Bayesian point estimation that maximizes the posterior probability. "We estimate Rmodel by maximum a posteriori (MAP) under a Gaussian prior"
- maximum-likelihood estimation: Parameter estimation by maximizing the likelihood of observed data. "Bradley-Terry maximum- likelihood estimation converts noisy pairwise votes into stable rankings at scale."
- Monte-Carlo simulation: Computational experiments using repeated random sampling to estimate metrics or dynamics. "Monte-Carlo simulation (500 trials per cell, 40 pre-judged candidates per problem)."
- online judge: An automated system that compiles and executes code against hidden tests, returning accept/reject. "a hidden online judge that returns binary accept/reject verdicts."
- oracle (pass@20): An upper bound based on whether any of N initial samples solve the problem. "Oracle, shown in gray, is the gen-0 pass@20 score"
- paired bootstrap: A bootstrap method resampling paired observations to assess confidence intervals on paired differences. "95% CI of gain: [16,39] pp, paired bootstrap"
- pairwise comparison: Evaluating two candidates against each other to decide which is better. "Pairwise comparison design."
- pass@1: Probability that a single random sample solves the problem (one-shot success rate). "Pass@1 is the empirical accept rate of a single unranked gen-0 sample"
- pass@20: Probability that at least one of 20 samples solves the problem (used as an oracle upper bound). "Oracle, shown in gray, is the gen-0 pass@20 score"
- pointwise scoring: Judging each candidate in isolation rather than comparatively, often with bias. "pointwise scores are noisy and positively biased"
- population-based: An approach maintaining and evolving multiple candidates in parallel rather than a single trajectory. "a population-based test-time compute framework"
- pretraining contamination: Leakage where evaluation items appear in training data, inflating apparent performance. "making pretraining contamination unlikely."
- regularized log-likelihood: A likelihood objective augmented with a penalty term to improve stability or generalization. "maximizing the regularized log-likelihood of the observed comparisons"
- Self-Refine: A self-improvement method where a model iteratively critiques and rewrites its own outputs. "six rounds of Self-Refine [Madaan et al., 2023]"
- self-consistency: A technique that samples multiple reasoning traces and selects the majority answer. "Self-consistency [Wang et al., 2022] parallelizes naturally by sampling multiple traces and selecting the majority answer"
- sequential depth: The number of serial model-calls (rounds) required, affecting wall-clock latency. "The pipeline's sequential depth is eight LLM calls."
- soft verifier: A learned or heuristic judging mechanism that approximates a verifier without ground-truth tests. "we exploit it here as a soft verifier"
- Time Limit Exceeded (TLE): A program fails due to exceeding the allowed execution time. "The 1% of disagreements between local and official verdicts are exclusively near-threshold TLE cases"
- value function: A learned estimator that scores partial reasoning states or steps in search. "searching over reasoning steps with a learned value function"
Collections
Sign up for free to add this paper to one or more collections.