LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
Abstract: As LLMs are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper introduces LongCoT, a big test (a benchmark) designed to see how well today’s AI LLMs can keep a clear, accurate chain of thought over a long time. Think of it like checking whether someone can follow a complicated set of instructions, make plans, remember important details, notice mistakes, and fix them—all while solving a multi-part problem that takes many steps.
What were they trying to find out?
The main question is simple: How accurately can top AI models think through many connected steps without losing track?
More specifically, the authors wanted to isolate “long-horizon chain-of-thought” reasoning. That means:
- Planning ahead and sometimes backtracking when a plan fails
- Remembering important facts from earlier steps
- Checking their own work and fixing errors
- Connecting later mistakes to earlier choices
They wanted to measure this ability directly, not by giving models lots of tools or super long inputs, but by focusing on whether the model itself can keep a long, coherent train of thought.
How did they test it?
They built a benchmark called LongCoT with 2,500 expert-made problems across five areas: mathematics, chemistry, computer science, chess, and logic. Each problem:
- Starts with a short prompt
- Has a final answer that can be automatically checked
- Requires many small, connected steps to reach the final answer
The trick is that each individual step is manageable for current models, but the whole chain is long and interdependent—like solving a puzzle where every piece affects the next.
Two kinds of problem structures
To make models reason over many steps, the authors used two templates:
- Explicit (compositional): The problem shows a map of sub-steps with arrows (like a recipe or a flowchart). Types include:
- Linear chains: Step A → B → C
- Graphs with merging branches: later steps need answers from multiple earlier steps
- Conditionals: different paths depending on earlier results (if/else)
- Forced backtracking: you must try inputs until you find one that produces the required output
- Implicit (procedural): The rules are given, but the model must discover the structure (like playing a game). Examples:
- Game trees (like chess, where each move branches)
- Constraint puzzles (like Sudoku)
- State-space searches (like navigating a maze)
- General search problems (finding the best sequence of actions)
In both styles, the difficulty comes from keeping the long chain of reasoning organized and accurate, not from any single step being too hard.
Domains covered
The problems span:
- Mathematics: competition-style subproblems chained together
- Chemistry: building and analyzing molecules over multiple reactions
- Computer Science: simulating programs and systems step-by-step
- Chess: best moves, planning, and retrograde puzzles
- Logic: planning, pathfinding, and constraint satisfaction puzzles
Important design choices
- Short inputs, long outputs: The starting prompt is short, but solving the problem usually requires tens of thousands of “tokens” (chunks of text), which is like writing dozens of pages of reasoning.
- No external tools in the main test: They wanted to test the model’s own reasoning, not its ability to use calculators, code runners, or special plugins.
- Automatic checking: Each final answer can be verified (e.g., does the molecule string match? Is the number right? Is the move sequence legal?).
- Controlled step difficulty: Each small step is solvable alone; the challenge is keeping everything straight over a long chain.
- Two tracks: The main benchmark (hard) and an easier subset called LongCoT-mini to compare more models.
What did they find?
- Current top models performed poorly on the full benchmark. The best model at release, GPT 5.2, got about 9.8% correct. Gemini 3 Pro got about 6.1%. Many other strong models were near zero.
- GPT 5.2 used about 62,000 output tokens per problem on average—that’s a very long chain of thought—yet still got fewer than 1 in 10 problems fully right.
- On the easier LongCoT-mini set, scores were higher (GPT 5.2 reached about 38.7%), which helps compare a wider range of models.
Why this matters: On many popular benchmarks, models score much higher. The sharp drop here suggests a core weakness: when the chain of thought gets very long, models often lose track, drift from the plan, forget earlier results, or fail to catch errors and backtrack.
Why does this matter?
If we want AI to handle complex, real-world tasks—like helping with research, managing long projects, or solving multi-step problems—it needs to reliably think over long stretches. LongCoT shows that:
- Today’s models are good at short or medium chains of thought but struggle when the chain becomes very long and tightly connected.
- Simply giving models more context or tools doesn’t fully solve the problem; the model needs better “mental stamina” for planning, remembering, and self-checking.
- This benchmark gives researchers a clear, measurable way to track progress on this specific skill.
In short, LongCoT highlights a crucial gap: making AI that can keep its head straight through long, complicated reasoning—like a careful student who can plan, check work, and fix mistakes over a long assignment. It sets a clear target for future improvements in AI reasoning.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and open questions the paper leaves unresolved, framed so future researchers can act on them.
- External validity: How well does LongCoT performance predict success on real enterprise/scientific workflows that require long-horizon reasoning (e.g., multi-day projects with changing goals, noisy specifications, and partial observability)?
- Domain coverage: To what extent do the five domains (math, chemistry, chess, CS, logic) capture the breadth of long-horizon reasoning needed in other areas (e.g., biology, finance, law, multimodal tasks)? What domains are missing?
- Template representativeness: Are the explicit DAG and implicit procedural templates representative of the dependency structures found in real tasks, or do they privilege algorithmic/clean decompositions? How to validate representativeness empirically?
- Single-step tractability claim: The paper asserts atomic steps are tractable; outside of math (95.7% subproblem accuracy for GPT 5.2), there is no systematic evidence across all domains/models. Provide per-domain, per-template step-level solvability audits for multiple models.
- Step-level supervision and partial credit: Accuracy is outcome-only. How can releasing or securely gating intermediate ground truths enable partial-credit scoring, error localization, and better attribution of failures to specific steps without enabling overfitting?
- Contamination and leakage: Math subproblems were checked for contamination, but other domains (chemistry, CS, logic, chess) lack rigorous contamination audits. What is the measured overlap with public corpora used in training frontier models?
- Ambiguity handling: Some tasks can admit multiple correct outputs (e.g., tie best moves in chess, equivalent SMILES forms, non-unique schedules). Quantify ambiguity rates and expand verifiers to handle equivalence classes robustly.
- Verifier robustness: What are the false positive/negative rates of domain verifiers (RDKit, Stockfish, programmatic checks)? How are canonicalization, format normalization, and floating-point tolerances handled to prevent spurious failures?
- Closed-source trace access: With no access to CoT traces for closed models, how can one instrument or approximate step-level behaviors (e.g., via structured prompting, proxy telemetry) to compare failure modes fairly across models?
- Graph property scaling laws: How does accuracy vary with graph depth, width, branching factor, cyclic dependencies, and presence of forced backtracking/conditionals? Establish scaling laws and thresholds where models “phase transition” from success to failure.
- Token–accuracy trade-off: What are the returns to additional output tokens (e.g., diminishing returns past certain budgets)? Identify optimal token budgets per template/graph property and quantify under- and over-generation risks.
- Context window and budget ablations: Performance is confounded by provider output limits and cost. Systematically vary context/output limits to measure sensitivity and identify bottlenecks caused by truncation or budget ceilings.
- Sampling and self-consistency: Due to cost, pass@k was not explored. Quantify reliability gains from sampling (e.g., self-consistency, diverse decoding), the sample complexity for success, and cost-effective strategies.
- Planning/self-evaluation metrics: Beyond final accuracy, devise metrics for plan quality, error detection, backtracking success, and state summarization/compression effectiveness; instrument models to capture these at scale.
- Oracle intermediate ablations: Provide controlled experiments where a subset of intermediate answers is supplied (oracle) to distinguish failures due to long-horizon dependency management from domain knowledge gaps.
- Tool/agent track design: The paper notes code execution helps procedural domains but not compositional ones, without comprehensive data. Establish standardized tool-use regimes (memory, search, calculators) and quantify systematic gains and offloading boundaries.
- External memory scaffolds: Evaluate whether persistent memory, vector databases, or structured state representations reduce plan drift and state loss; identify minimal scaffolds that still isolate model reasoning capability.
- Prompt robustness: Measure sensitivity to prompt formatting, instructions (e.g., plan-then-execute, periodic summarization), and state-tracking prompts. Identify robust prompting patterns that measurably reduce long-horizon errors.
- Error taxonomy and frequencies: The paper mentions common errors (plan drift, context loss) but lacks a quantitative taxonomy. Produce a cross-domain error inventory with frequencies, triggers, and recoverability statistics.
- Baselines and upper bounds: Provide algorithmic solver baselines (with tools) and human expert baselines to calibrate problem difficulty and set achievable upper bounds; compare LLMs’ gap to these baselines.
- LongCoT-mini characterization: Clarify how LongCoT-mini differs (graph properties, difficulty scaling, domain distribution) and whether it is predictive of performance on the full benchmark; publish cross-benchmark correlation analyses.
- Sustainability against overfitting: As models train on LongCoT, how will the benchmark remain informative? Develop parameterized “evergreen” generators and holdout templates to mitigate memorization and targeted fine-tuning.
- Multilingual and multimodal generalization: Evaluate whether long-horizon failures persist across languages and modalities (e.g., diagrams, molecule images) and whether multimodal context aids or harms long-output reasoning.
- Fairness across providers: Different models have distinct output limits, hidden scaffolds, or proprietary reasoning policies. Define normalization protocols to ensure fair comparison (e.g., matched token budgets, standardized temperature/top-p).
- Realistic agent settings: Bridge pure reasoning and deployed agents by defining tool-limited but realistic tracks (e.g., calculators allowed, restricted code execution) that still stress long-horizon dependency management.
- Release of intermediate answers: Decide whether and how to release step-level ground truths (e.g., encrypted, access-controlled) to enable research on error detection, curriculum learning, and supervised training without compromising benchmark integrity.
- Mechanistic interpretability: Open question on what internal mechanisms (attention patterns, state representations) drive long-horizon failures; instrument models to study state persistence, credit assignment, and error accumulation mechanistically.
- Transfer learning and fine-tuning: If models are fine-tuned on LongCoT-like templates, do improvements transfer to unrelated long-horizon tasks? Quantify cross-task transfer vs template-specific overfitting.
- Cost-aware evaluation: Given high costs, design lower-cost proxies or subsets that preserve ranking fidelity; evaluate stratified sampling schemes and their reliability.
- Impact of output token limits >128K: As output limits rise (e.g., 256K, 1M), do models naturally improve, or do failures persist due to cognitive/optimization limitations rather than truncation?
- Robust extraction of final answers: LLM-based extraction is a fallback. Quantify extraction errors and ensure that mis-extraction does not confound accuracy; release standardized parsers per domain.
- Parameter sensitivity of verifiers: Chess verifiers (engine depth, tablebase coverage) and chemistry verifiers (stereochemistry, tautomer handling) can alter correctness judgments. Document and ablate verifier parameter choices.
- Benchmark governance: Define versioning, licensing, and update policies to prevent unintended training leakage and maintain comparability across time; specify how new templates/problems are added without breaking historical results.
Practical Applications
Immediate Applications
The paper introduces LongCoT and LongCoT-mini as direct tools for measuring and stress-testing long-horizon chain-of-thought (CoT) reasoning in LLMs. Below are immediate, deployable applications that leverage the benchmark, its methodology, and the findings.
Industry
- Model selection and procurement gating for long-running AI features
- Sectors: Software, healthcare, finance, energy, enterprise IT
- Use: Add a “LongCoT score threshold” to model RFPs and internal evaluation checklists for assistants/agents expected to plan, backtrack, and maintain state over many steps.
- Tools/workflows: “LongCoT Gate” CI step in MLOps pipelines; LongCoT-mini for low-cost triage of open-source models; per-domain dashboards (math/chem/CS/logic) to map to business tasks.
- Assumptions/dependencies: Access to model APIs with sufficient output token limits; evaluation compute budget; acceptance that no-tool tests approximate underlying capability even if production agents use tools.
- Pre-deployment red-teaming for agentic systems
- Sectors: Web agents, code agents, data processing pipelines
- Use: Detect drift, premature convergence, uncorrected errors, or loss of state across long CoT traces before enabling autonomous modes.
- Tools/workflows: “HorizonGuard” suite that runs LongCoT scenarios, measures plan coherence and error recovery; integrates error taxonomies (planning, context loss, backtracking failure).
- Assumptions/dependencies: Some vendors do not expose CoT traces; outcomes can still be verified, but deeper diagnosis is easier with trace access.
- Benchmark-driven inference policy and cost planning
- Sectors: Cloud AI platforms, AIOps
- Use: Use LongCoT token lengths and accuracy curves to set max-output-token policies, early stopping criteria, and retry strategies for long-running tasks.
- Tools/workflows: “Horizon Budgeter” to forecast compute spend vs. accuracy for long outputs; automated guardrails that abort unproductive long traces.
- Assumptions/dependencies: Accurate token metering; stable provider pricing.
- Curriculum generation for training and finetuning “long-horizon stamina”
- Sectors: Foundation model labs, enterprise model teams
- Use: Generate LongCoT-style compositional DAGs and implicit search tasks at increasing scales to train models on planning, state maintenance, and backtracking.
- Tools/products: Curriculum generators derived from templates; RL/CoT distillation on verifiable problems; data synthesis for memory/credit-assignment training.
- Assumptions/dependencies: Risk of overfitting to benchmark style must be managed via holdout templates and domain diversification.
- QA for chemistry- and code-adjacent features
- Sectors: Pharma/chemicals, software engineering
- Use: Use chemistry DAGs as surrogates for multi-step synthesis planning QA; use CS templates to test exact long-step simulation (schedulers, distributed systems, type inference) before enabling automated changes.
- Tools/products: “TraceScope” that checks intermediate invariants; cross-validation with domain verifiers (RDKit, Stockfish, program simulators).
- Assumptions/dependencies: In production, tools will be used; nevertheless, underlying long-CoT capability correlates with robustness even with tools.
- Vendor benchmarking and model portfolio management
- Sectors: Enterprises using multiple LLMs
- Use: Maintain a leaderboard of vendor models on LongCoT and LongCoT-mini to allocate tasks: short-form vs. long-horizon workloads; route complex cases to the few models that perform best.
- Tools/workflows: Routing policies based on “horizon complexity scores”; A/B evaluations on LongCoT-mini for frequent audits.
- Assumptions/dependencies: Routing logic and latency constraints; ongoing dataset integrity.
Academia
- Measuring progress on fundamental long-horizon reasoning
- Use: Standardized, verifiable evaluation for research on memory architectures, planning modules, test-time search, and self-correction.
- Tools/workflows: Ablation suites that correlate architectural changes with LongCoT performance; research baselines and leaderboards by domain and horizon length.
- Assumptions/dependencies: Compute to run long-output evaluations; fair-use of benchmark.
- Diagnostics for failure modes and capability evaluations
- Use: Analyze failure types (plan drift, context loss, backtracking) across explicit/implicit templates to inform new methods (e.g., recurrent memory, credit assignment mechanisms).
- Tools/products: Error taxonomy tagging for open-source runs; visual analyzers for dependency graphs vs. predicted chains.
- Assumptions/dependencies: Access to reasoning traces preferred for granular analysis.
- Teaching and assessment in advanced problem solving
- Sectors: Education (CS, math, logic, chemistry)
- Use: Course modules on decomposition, planning, search; assignments with auto-verifiable answers; competitions.
- Tools/workflows: “LongCoT Classroom” subset with graded difficulty and instructor dashboards.
- Assumptions/dependencies: Ethical use in education; student originality safeguards.
Policy and Standards
- Evaluation standards for high-stakes, long-horizon deployments
- Sectors: Healthcare, finance, critical infrastructure, government
- Use: Include long-horizon reasoning metrics in safety certifications and procurement policies for agentic systems.
- Tools/workflows: “Long-horizon Reasoning Scorecard” attached to model attestations; minimum thresholds for autonomy levels.
- Assumptions/dependencies: Alignment between benchmark domains and target use cases; regulators accept long-output benchmarks as capability evidence.
- Incident reporting and risk categorization
- Use: Map deployment incidents to LongCoT failure categories (e.g., uncorrected long-range error propagation) to inform mitigation requirements.
- Tools/workflows: Taxonomy-aligned incident forms; postmortem checklists referencing explicit vs. implicit dependency failures.
- Assumptions/dependencies: Organizational maturity to capture and share incident data.
Daily Life
- Consumer assistant feature gating
- Use: Gate “auto-execute” or “long plan” features (trip planning, budgeting, home automation sequences) behind minimum LongCoT-mini performance.
- Tools/workflows: “Horizon Meter” UI that communicates model’s confidence vs. plan length; automatic chunking into verifiable sub-goals.
- Assumptions/dependencies: Token limits in consumer plans; UX for communicating limitations.
- Study aids for deep problem-solving practice
- Use: Generate progressive, multi-step practice sets with verifiable answers for math/logic/CS learning apps.
- Tools/products: Adaptive “long-horizon drills” that increase dependency depth and require backtracking.
- Assumptions/dependencies: Content licensing; alignment with curricula.
Long-Term Applications
These applications require further research, scaling, integration into broader systems, or advances in model capability (given sub-10% performance on LongCoT today).
Industry
- Reliable autonomous research and engineering assistants
- Sectors: Pharma/chemicals (reaction planning), materials, software, hardware design
- Use: Agents that sustain multi-hundred-step plans with self-detection and correction of errors; long-term state tracking without tool crutches.
- Tools/products: “Self-correcting planners” trained on LongCoT-like curricula plus real domain data; hybrid systems that interleave internal long-horizon reasoning with judicious tool use.
- Assumptions/dependencies: Significant model gains in planning, memory, and credit assignment; better integration between internal reasoning and tools.
- End-to-end project orchestration agents
- Sectors: Enterprise IT, consulting, construction planning, supply chain
- Use: Agents that decompose projects into DAGs, monitor progress, backtrack on dead-ends, and adapt plans over weeks.
- Tools/workflows: DAG-aware agents with explicit dependency modeling; persistent memory and audit trails; “HorizonOps” monitors for plan drift.
- Assumptions/dependencies: Advances in long-term memory and verification; organizational readiness for partial autonomy.
- Long-horizon robotics task planning
- Sectors: Manufacturing, logistics, home robotics
- Use: Hierarchical planners that maintain invariants across extended action sequences; online backtracking and error recovery.
- Tools/products: Bridge LongCoT-style evaluations with simulated embodied tasks (e.g., long procedural puzzles → multi-stage manipulation).
- Assumptions/dependencies: Robust grounding from CoT to action; safety assurances.
- Grid, scheduling, and operations optimization
- Sectors: Energy, transportation, cloud orchestration
- Use: Agents solving large CSPs/optimizations over long horizons, managing contingencies and re-planning.
- Tools/products: Integrated planners blending implicit search templates with domain solvers; verifiable plan-execution loops.
- Assumptions/dependencies: Hybrid architectures; strong interfaces between planners and real-time operations.
Academia
- Architectures with persistent internal state and credit assignment
- Use: New model classes (e.g., recurrent memory, episodic recall, differentiable planners) validated against LongCoT scaling curves.
- Tools/workflows: Bench-driven research cycles that connect architectural innovations to measurable long-horizon gains.
- Assumptions/dependencies: Open benchmarks continue to evolve to avoid overfitting.
- Formal verification of long CoT
- Use: Proof-carrying CoT, intermediate invariant checks, and certifiable backtracking strategies for high-assurance reasoning.
- Tools/products: CoT verifiers that attach proofs or checkable traces to long reasoning outputs.
- Assumptions/dependencies: Formal methods that remain tractable for very long outputs.
- Cross-domain transfer and domain-specific LongCoT variants
- Use: Extend template methodology to law, clinical decision support, cybersecurity incident response, and policy analysis.
- Tools/workflows: Template libraries with verifiers for new sectors; shared evaluation hubs.
- Assumptions/dependencies: Availability of domain experts and verifiable ground truths.
Policy and Standards
- Autonomy tiering and certification frameworks
- Use: Map LongCoT scores to allowed autonomy levels in regulated sectors (e.g., higher autonomy requires higher long-horizon reasoning competence).
- Tools/workflows: Multi-benchmark certification regimes (LongCoT + agentic + domain-specific tests); periodic re-certification.
- Assumptions/dependencies: Regulatory adoption; consensus on thresholds.
- Compute governance informed by horizon risk
- Use: Set policy on compute budgets and oversight proportional to measured long-horizon capability and task criticality.
- Tools/workflows: Risk-weighted compute/cost ceilings, mandatory auditing for tasks surpassing horizon thresholds.
- Assumptions/dependencies: Agreement on risk models and monitoring infrastructure.
Daily Life
- Personal “life project” copilots
- Use: Assistants that manage multi-month goals (career changes, degrees, renovations) with dependable planning, backtracking, and progress tracking.
- Tools/products: DAG-based personal project managers, timeline-aware memory, checkpointing and rollback in plans.
- Assumptions/dependencies: Mature long-horizon reliability; privacy-preserving persistent memory.
- Transparent autonomy with user-in-the-loop controls
- Use: Assistants that expose plan dependencies, highlight uncertainty, and recommend when to backtrack or seek human input.
- Tools/workflows: Plan visualization and “explain-why” backtracking triggers; mixed-initiative controls.
- Assumptions/dependencies: Usable UX for complex dependency graphs; accurate uncertainty estimation.
Notes on feasibility across applications
- Core dependency: Current frontier models achieve <10% on LongCoT; many long-term applications require substantial capability advances.
- Token and cost limits: LongCoT tasks often demand tens of thousands of output tokens; deployment must consider budget and latency.
- Tool interaction: While LongCoT isolates model-internal reasoning, most real systems will combine internal CoT with tools; benchmarks can still serve as capability floors, but expect domain-specific adaptation.
- Generalization risk: Avoid overfitting to specific templates by using held-out variants, evolving benchmarks, and multi-benchmark evaluation.
Glossary
- Adversarial branching: Branching in a game/search tree caused by opposing agents choosing adversarial moves. "a game tree with adversarial branching (e.g.\ chess)"
- Agentic benchmarks: Evaluations that measure models acting as agents with tools and workflows, not just pure reasoning. "Agentic benchmarks evaluate complex multi-step workflows, but domain-specific tool use and scaffolding dominate improvements"
- Blocks World: A classic AI planning domain involving stacking and unstacking blocks to reach a goal configuration. "planning tasks (Sokoban, Blocks World)"
- Chain-of-Thought (CoT): The explicit, step-by-step intermediate reasoning produced by a model. "planning and managing a long, complex chain-of-thought (CoT)."
- Cheminformatics: The use of computational methods and software for chemical data and molecular structures. "These are verified using cheminformatics tools"
- Compositional template: A template that explicitly composes subproblems with dependencies to form larger tasks. "A compositional template specifies an explicit dependency DAG."
- Compiler passes: Sequential transformations over program representations performed by a compiler. "and stepping through type inference or compiler passes."
- Constraint graph: A graph where nodes/variables are connected by constraints that restrict joint assignments. "These graphs can be DAGs, search trees, cyclic graphs, constraint graphs, or execution traces."
- Constraint satisfaction problem (CSP): A problem of assigning values to variables to satisfy a set of constraints. "or a set of constraints for CSPs"
- Credit assignment: Identifying which prior steps caused success or failure in long reasoning chains. "perform credit assignment by linking errors or progress to specific prior steps."
- Directed acyclic graph (DAG): A directed graph with no cycles, often used to represent dependencies. "explicit dependency DAG."
- Distributed memory systems: Computing architectures where memory is partitioned across multiple nodes. "simulating distributed memory systems"
- Endgame tablebases: Exhaustive databases of solved chess endgame positions with perfect play. "endgame tablebases"
- Execution traces: Sequences of states or operations produced by running a program or process. "These graphs can be DAGs, search trees, cyclic graphs, constraint graphs, or execution traces."
- Factor graph: A bipartite graph representing factorization of constraints or functions over variables. "a constraint/factor graph (e.g.\ Sudoku)"
- Game tree: A tree representing all possible move sequences in a game from a given state. "The dependency graph is a game tree that emerges from the rules."
- Max-flow: The optimization problem of finding the maximum feasible flow in a network. "executing graph algorithms (max-flow)"
- Memoization: Caching results of computations to avoid repeated work in recursive/search procedures. "prunable via minimax with memoization."
- Minimax: A game-theoretic search algorithm optimizing against a worst-case (adversarial) opponent. "prunable via minimax with memoization."
- Non-invertible function: A function that cannot be uniquely inverted to recover inputs from outputs. "non-invertible function."
- Parameterized templates: Problem templates with adjustable parameters to systematically generate instances. "parameterized templates"
- Pass@k: An evaluation metric measuring success within k independent attempts/samples. "pass@k or self-consistency experiments."
- RCSB Protein Data Bank: A public repository of experimentally determined 3D biomolecular structures. "RCSB Protein Data Bank"
- Retrograde analysis: Inferring preceding game moves or states from a given position. "retrograde analysis (determining which moves could have led to a board state)"
- SMILES strings: A line-notation format for representing molecular structures as strings. "Final answers are either SMILES strings"
- Sokoban: A planning puzzle where an agent pushes boxes to targets in a grid with strict movement constraints. "planning tasks (Sokoban, Blocks World)"
- State-transition graph: A graph whose nodes are states and edges are allowable transitions under rules. "a state-transition graph (planning)"
- Stockfish: An open-source chess engine used for analysis and problem generation/verification. "using Stockfish, endgame tablebases, and exhaustive enumeration of board states."
- Stereochemistry: The 3D spatial arrangement of atoms in molecules and its chemical implications. "stereochemistry analysis"
- Transition relation: A formal relation specifying valid moves between states in a system. "a transition relation for planning/simulation"
- Type inference: Automatically deducing the types of expressions in a programming language. "type inference"
- USPTO forward synthesis data: Reaction data from U.S. patent records used to validate synthetic routes. "USPTO forward synthesis data"
- Verifier: A function that checks whether a model’s final answer is correct for a given problem. "a domain-specific verifier "
- VLIW processor: Very Long Instruction Word architecture that issues multiple operations in parallel per instruction. "scheduling instructions on parallel architectures (VLIW processor)"
Collections
Sign up for free to add this paper to one or more collections.