Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Play & Iterative Conjecturing (STP)

Updated 31 May 2026
  • Self-Play and Iterative Conjecturing (STP) is a meta-learning framework where agents generate synthetic challenges at the edge of their abilities and iteratively refine their problem-solving skills.
  • It integrates probabilistic logic, reinforcement learning, and game-theoretic principles to drive self-play in automated theorem proving, vision-language reasoning, and LLM tasks.
  • Mechanisms such as MAP, SAM, and Wasserstein re-weighting ensure challenge diversity and curriculum adaptation, effectively preventing performance plateaus.

Self-Play and Iterative Conjecturing (STP) is a meta-learning paradigm inspired by human mathematical practice, where problem-solving agents iteratively propose new conjectures (subtasks, theorems, questions) at the edge of their current capability and simultaneously train themselves to solve these challenges, creating a closed reinforcement curriculum. This framework integrates ideas from probabilistic logic, reinforcement learning, curriculum generation, and self-improving AI, with formal instantiations in automated theorem proving, mathematical reasoning, vision-LLMs, and general LLM reasoning tasks. Distinctive features of STP include dynamic creation of synthetic challenges tailored to current proficiency, feedback-driven iterative improvement, dual learning roles (conjecturer/solver or challenger/solver), and regime- or architecture-specific diversity control to avoid degeneracies and performance plateaus.

1. Foundational Principles and Probabilistic Formalism

The canonical formal model for STP is grounded in the belief calculus over Hintikka distributive normal forms for first-order logic, as introduced in "On Learning to Prove" (Huang, 2019). Every sentence φ\varphi of quantifier-depth dd is uniquely decomposed into a disjunction of dd-depth constituents cc:

φ(d)(yˉ)cCon(d)(yˉ),cφc(yˉ)\varphi^{(d)}(\bar{y}) \equiv \bigvee_{c \in \text{Con}^{(d)}(\bar{y}),\: c \models \varphi} c(\bar{y})

Each constituent cc specifies a complete configuration of atomic and quantified witnesses ("possible world"), with the collection at depth dd providing an exhaustive and mutually exclusive partition. A global probability measure PP is then assigned to this refinement tree (with branching given by existential/quasi-quantifier depth), satisfying P(root)=1P(\text{root})=1 and recursive mass conservation across refinements.

The belief distribution over sentences, P(φ)=cφP(c)P(\varphi) = \sum_{c \models \varphi} P(c), supports nontrivial model uncertainty and plausible reasoning (distinct from logical omniscience). This state is embedded into a Hilbert space dd0, with each sentence mapped to an indicator function over infinite paths in the refinement tree. The induced inner product,

dd1

reflects mutual exclusivity and is fundamental in subsequent learning objectives.

Bayesian update (via "refute and rescale") is realized by renormalizing probability mass when constituents are discovered inconsistent: zeroing out descendants and reweighting the minimal subtree, ensuring the sum-to-unity constraint and soundness (no support for refuted worlds). This mechanism underlies the fundamental regret-free learning dynamic of STP (Huang, 2019).

2. Game-Theoretic and Reinforcement Learning Instantiations

Proof search and conjecture generation in STP are cast as a two-player perfect-information game, where:

  • Player C (Conjecturer/Challenger) selects existential extensions or poses a consistency challenge (claiming no model exists at a constituent).
  • Player V (Verifier) either refines further or defends via consistency, and the game proceeds until a successful challenge or defense is verified.

This setup translates to a self-play reinforcement learning loop where a joint policy-value network dd2 is updated from games between two agents playing from the root. Training alternates between trajectory generation, log-probability maximization for chosen moves (actions), and mean-squared error regression for state values:

dd3

where dd4 is the terminal game reward dd5 (Huang, 2019).

In high-dimensional, practical settings (e.g., LLM theorem proving), this idea is extended via an explicit challenger/solver or conjecturer/prover decomposition (Dong et al., 31 Jan 2025, Bailey et al., 22 Apr 2026, Li et al., 13 Feb 2026), and typically adopts roles such as:

  • Challenger (or Conjecturer): proposes new conjectures, questions, or constraints, often based on current solver failure modes or just-barely-proven cases.
  • Solver (or Prover): attempts to discharge both original and synthetic challenges, providing dense learning signal where standard expert iteration would yield only sparse (trivial) supervision.

The reward shaping is task-specific but generally targets proposals that maximize the expected learning signal for the current solver, often selecting for "barely provable" or maximally uncertain items to maintain curriculum pressure (Dong et al., 31 Jan 2025, Li et al., 13 Feb 2026).

3. Iterative Conjecturing and Diversity-Preserving Mechanisms

A recurrent challenge in STP is avoidance of performance plateaus, mode collapse, or the “diversity illusion,” where generated conjectures/questioning seemingly vary but exercise only restricted modes of reasoning (Li et al., 13 Feb 2026). Multiple mechanisms have been developed:

  • Memory-Augmented Penalty (MAP): Maintains a persistent memory bank of skill embeddings for generated conjectures, imposing penalties for within- and cross-iteration repetition as measured by similarity in a learned skill space. Dual penalties prevent both cyclical resurfacing and local batch collapse (Li et al., 13 Feb 2026).
  • Skill-Aware Measurement (SAM): Maps each generated conjecture to an abstracted code or "reasoning-skill" embedding (e.g., via code translation and embedding), ensuring that diversity is measured in terms of underlying cognitive skills exercised, not surface textual variety. This approach overcomes superficial diversity illusions (Li et al., 13 Feb 2026).
  • Wasserstein Re-Weighting: Enforces statistical balance and topical coverage in conjecture sampling by matching distribution over generated conjectures to that of the unresolved or original set (Dong et al., 31 Jan 2025).

Concretely, the STP curriculum loop can be framed as in the following schematic (see (Li et al., 13 Feb 2026)):

Phase Systemic Role Diversity Control
Challenger Generate new q MAP, SAM, Uncertainty Reward
Solver Train on (q,ŷ) Curriculum Filter, Replay
Memory Update Embed q to bank Cross-iteration Penalty

These mechanisms demonstrably result in monotonic increases in task performance over repeated self-play iterations, avoiding the typical plateau seen in earlier self-play methods (Li et al., 13 Feb 2026).

4. Algorithmic Frameworks and Training Dynamics

STP has been concretely instantiated across domains via recurrent feedback algorithms:

  • Self-Play Theorem Provers (STP) (Dong et al., 31 Jan 2025): Iterate between a conjecturer (LLM conditioned on seed theorems, proofs, and lemmas) and a prover (LLM generating proof candidates), using the formal checker to produce correctness signals, pass-rate filtering, and elegance/diversity filtering for constructive curriculum.
  • Self-Guided Self-Play (SGS) (Bailey et al., 22 Apr 2026): A tripartite architecture with Solver, Conjecturer, and Guide; the Guide uses relevance and clarity scoring to prevent the Conjecturer from optimizing for degenerate hardness that does not benefit the Solver. Rewards combine solve-rate and Guide assessment, with strong empirical evidence for sustained superlinear improvement and effective scaling.
  • Iterative Self-Play Policy Optimization (Iterative-SPO) (Wang et al., 29 Sep 2025): For vision-language tasks, alternates self-play clue-generation phases with RL on verifiable voting, gating progression using empirical performance metrics to inject new challenge and avoid stagnant equilibria.
  • IRIS (Interpolative Rényi Iterative Self-play) (Liao et al., 22 Apr 2026): In domains such as instruction tuning, introduces a continuous interpolation (dd6 parameter) between classic divergence-based self-play regimes, adaptively sharpening importance-weighting from KL (smoother, exploratory) to dd7 (sharper, exploitative), based on empirical reward gap. This allows the algorithm to self-calibrate through different learning stages, achieving monotonic improvement in empirical performance and robust data efficiency.

STP's self-play procedure is summarized as follows:

  1. Generate new conjectures/questions/sketches at the boundary of proven knowledge.
  2. Attempt to solve (prove) them, updating success/failure statistics.
  3. Select only the most informative (barely provable, highly uncertain, or maximally relevant) instances to update sub-models.
  4. Iteratively retrain both roles, controlling for diversity and relevance via explicit mechanisms.
  5. Repeat until external or internal convergence criteria (success rates, diversity or curriculum plateau).

5. Complexity, Scalability, and Trade-Offs

Theoretical models grounded in Hintikka trees reveal super-exponential scaling in the number of constituents (possible worlds) with quantifier depth and predicate arity (Huang, 2019). Practical implementations mitigate intractability using several strategies:

  • Filtration and Abstraction: Early stages use coarse partitions of possible worlds, only refining to finer distinctions as needed for soundness/completeness.
  • Cross-iteration diversity penalization (e.g., MAP) is simulated with bounded memory banks and approximate embeddings (Li et al., 13 Feb 2026).
  • Data efficiency enhancements: E.g., IRIS achieves higher performance with as little as 26k labeled samples compared to hundreds of thousands in conventional fine-tuning (Liao et al., 22 Apr 2026).
  • Parallelization: Many STP models (e.g., VLMs, LLMs) run millions of self-generated proof attempts, amortizing compute and enabling scaling to large corpora (Dong et al., 31 Jan 2025).

A persistent challenge is the balancing of memory/search budget, coverage, and curriculum granularity; abstractions sacrifice completeness until refined, while aggressive diversity penalties can impede efficient exploration.

6. Empirical Results and Domain Instantiations

STP and its variants have achieved leading results across a variety of domains:

  • In formal mathematics (Lean, Isabelle), pass rates nearly doubled over expert iteration baselines, with sustained improvement over 24+ iterations and new SoTA on miniF2F, ProofNet, and PutnamBench (Dong et al., 31 Jan 2025).
  • Vision-language reasoning: Vision-Zero (Iterative-SPO variant) achieved a 3% absolute improvement on math/reasoning transfer tasks and up to 10% gains over RL-only baselines in vision-centric benchmarks, with a 10–100dd8 reduction in human annotation cost (Wang et al., 29 Sep 2025).
  • General LLM reasoning benchmarks: IRIS produced average scores of 44.57% (Zephyr-7B) with only 1/4 annotated data, exhibiting continuous gain over iterations, in contrast with plateauing behaviors of prior SPIN, SPACE, and SPIF methods (Liao et al., 22 Apr 2026).
  • Skills curriculum learning: R-Diverse yielded monotonic iteration-wise improvement in Math AVG task (+10pp over 5 iterations vs. plateau and degradation in R-Zero), with ablations ascribing major gains to MAP, SAM, and experience replay (Li et al., 13 Feb 2026).

Empirical ablations consistently show that diversity penalization, relevance guidance, and curriculum adaptation are necessary for persistent improvement; naive self-play or RL plateaus and/or degenerates to trivial/degenerate tasks.

7. Theoretical Guarantees and Limitations

Key soundness and completeness properties rely on the underlying probabilistic framework:

  • Soundness: If the belief model does not collapse prematurely (never assigns zero probability to live constituents), the acceptance criterion (all mass on dd9-worlds) never admits inconsistency (Huang, 2019).
  • Completeness: Upon full convergence (all inconsistent subworlds zeroed), every valid formula is eventually proved.
  • Self-play convergence: Under idealized self-play, the joint distribution converges to the limiting belief model that recovers a complete and sound theorem prover (Huang, 2019).

Scalability is always constrained by the representational explosion in the underlying logic (super-exponential with quantifier depth and arity). In practice, empirical gains depend crucially on maintaining a curriculum of useful, non-degenerate challenges and effectively controlling surface and skill-based diversity.


STP, in sum, formalizes a curriculum-based meta-learning paradigm where process and representation co-evolve: agents self-generate conjectures and reasoning tasks targeted at their own cognitive frontier, learn from them via tightly coupled solver/conjecturer reinforcement, and enforce skill-based diversity to sustain progress. Contemporary approaches span mathematical logic, vision-language reasoning, and abstract problem solving, with formal and empirical foundations now deeply interconnected across the field (Huang, 2019, Dong et al., 31 Jan 2025, Bailey et al., 22 Apr 2026, Wang et al., 29 Sep 2025, Li et al., 13 Feb 2026, Liao et al., 22 Apr 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Play and Iterative Conjecturing (STP).