Self-Questioning Language Models (SQLM)

Updated 6 August 2025

Self-Questioning Language Models (SQLM) are a class of models that autonomously generate, solve, and verify their own challenges using a closed-loop reinforcement learning framework.
The framework employs an asymmetric self-play mechanism where a proposer creates problems and a solver answers them, utilizing techniques like majority voting and unit tests for verification.
Experimental results show significant improvements in arithmetic (79% to 95%), algebra (44% to 60%), and coding (32% to 39%) accuracy without relying on external curated datasets.

Self-Questioning LLMs (SQLM) are a class of LLMs that leverage internal self-generated questions and answers as a mechanism for autonomous learning, self-improvement, and robust reasoning. These models dispense with the need for additional curated supervision by generating and solving their own problems, judgments, or verification tasks in a closed-loop framework. The SQLM paradigm encompasses reinforcement learning–driven asymmetric self-play, dynamic curriculum creation, and robust self-verification strategies, all targeting the enhancement of reasoning skills across mathematical, scientific, and algorithmic domains.

1. Asymmetric Self-Play Framework

SQLM is instantiated as an asymmetric self-play setup in which a LLM operates simultaneously as a proposer and a solver. The proposer is given a high-level topic prompt (e.g., "algebra word problems") and required to generate a nontrivial question or problem. The solver, either an independent copy or the same model, is presented with this generated problem and tasked with producing an answer.

Interactions are structured as follows:

The proposer generates a diverse set of problems conditioned on the task prompt.
The solver attempts to solve each generated problem.
The quality of both the question and answer is assessed through automated criteria: for instance, majority voting over solutions (to ensure consensus), or using unit tests for executable domains such as code generation.

The system forms a closed self-improvement loop in which rewards feed back to both roles, enabling iterative enhancement via reinforcement learning.

2. Self-Generated Data and Mechanisms of Self-Questioning

The core mechanism is fully autonomous data generation:

Question Generation: The proposer outputs a candidate problem $x$ sampled from its policy $\pi_P(x|t)$ conditioned on the prompt $t$ .
Answer Generation: The solver answers the question using its policy $\pi_S(y|x)$ , providing multiple samples for majority voting or solution diversity.
Verification: In domains with low generator-verifier gap (e.g., arithmetic, algebra), the solver's correctness is heuristically determined via majority voting; for domains such as code, proposers generate unit tests and the solution's validity is checked by test-passing fraction.

Iteration over these steps yields a large corpus of self-labeled examples, which are then used to continually update the underlying models.

This approach is domain-agnostic and can be applied to arithmetic, algebraic, and programming tasks, among others.

3. Reward Structure and Reinforcement Training

The reward system is designed to both shape the difficulty of proposed problems and enforce correctness:

For arithmetic/algebra (small generator-verifier gap):
- The solver receives a reward $R_S(x, y_i) = 1$ if $y_i$ equals the majority answer $y_{maj}$ among $N$ samples, otherwise $0$.
- The proposer receives a reward $R_P(x) = 1$ if $0 < | \{ y_i : y_i = y_{maj} \} | < N$ , i.e., if the problem is neither so easy every solver copy agrees nor so hard that no consensus is reached.
For code (large generator-verifier gap):
- The proposer supplies unit tests; the solver's accuracy is $Pass(y_{pred}, Tests(x))$ , or the fraction of tests passed.
- The proposer is rewarded only on "medium-difficulty" examples ( $0 < Pass(y_{pred}, Tests(x)) < 1$ ), which incentivizes generation of challenging but solvable tasks.

Each role updates its respective policy using these reward signals, employing standard reinforcement learning objectives.

4. Experimental Benchmarks and Reported Results

SQLM was evaluated on three benchmarks:

Domain	Pre-SQLM Accuracy	Post-SQLM Accuracy
Arithmetic	~0.79	~0.95
Algebra	~0.44	~0.60
Coding	~0.32	~0.39

Arithmetic: Models trained exclusively with SQLM (no external examples) rose from approximately 79% to 95% accuracy on three-digit multiplication.
Algebra: Competency on OMEGA-style linear equations improved from ~44% to 60% after SQLM training.
Coding: The proposer's generation of unit tests enabled the solver to increase its accuracy on Codeforces problems from ~32% to 39%.

These results were obtained without external curated datasets.

5. Advantages and Methodological Implications

SQLM confers several critical advantages:

Self-Sufficiency: Models generate their own data, eliminating the need for continual external annotation or hand-curated benchmarks.
Dynamic Curriculum: Through reinforcement on intermediate-difficulty problems, the system is biased toward generating increasingly rich and non-trivial challenge distributions.
Iterative Self-Improvement: Both proposer and solver adapt over time, driving gains in reasoning, generalization, and robustness.
General Applicability: The mechanisms generalize across domains where correctness can be checked via consensus or concrete execution.
Automated Safety and Relevance: Future work can leverage internal scoring and filtering to automatically downweight problematic or irrelevant questions.

Forms such as majority voting, unit test–based grading, and policy gradient updates are all deployed in this framework.

6. Technical Formulation and Reinforcement Learning Details

The SQLM reward-driven loop can be written as:

$\begin{align*} \text{Solver policy:} & \quad \pi_S(y|x) \ \text{Proposer policy:} & \quad \pi_P(x|t) \ \text{Solver's expected reward:} & \quad \mathbb{E}_{x \sim \pi_P, y \sim \pi_S(\cdot|x)}[R_S(x, y)] \ \text{Proposer's expected reward:} & \quad \mathbb{E}_{x \sim \pi_P, y \sim \pi_S(\cdot|x)}[R_P(x, y)] \end{align*}$

where the rewards $R_S$ and $R_P$ are as outlined in Section 3.

Policy optimization proceeds via RL algorithms suited to discrete sequential decision making, with proposer and solver updated alternately.

7. Open Challenges and Prospective Directions

Several open challenges and research directions are identified:

Prompt Automation: Presently, prompt design remains manual; meta-learning may enable models to autonomously discover optimal prompting strategies.
Quality Safeguards: There is a need for automated mechanisms to filter, grade, and improve the diversity and safety of generated problems, particularly as domains broaden.
Systematic Error Mitigation: Because consensus serves as a surrogate for ground-truth, systematic biases are a risk. Hybrid schemes with small quantities of external gold supervision may help correct persistent model errors.
Scaling to Broader Domains: Extending SQLM to tasks with indefinite correctness (e.g., open-domain QA) will require new verification and reward mechanisms.

These issues highlight active fronts for further enhancement and guarantee robust SQLM deployment in real-world reasoning contexts.

In summary, Self-Questioning LLMs (SQLM) are a reinforcement learning–driven asymmetric self-play method in which LLMs induce their own curriculum by generating, verifying, and iteratively refining problems and solutions. Experimental findings indicate that SQLM can significantly enhance model performance in arithmetic, algebra, and code-generation domains, all without reliance on external labeled datasets. This positions SQLM as a pivotal framework for enabling autonomous, self-improving artificial reasoning systems (Chen et al., 5 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Self-Questioning Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Self-Questioning Language Models (SQLM).