Self-Questioning Language Models

Updated 4 July 2026

Self-questioning language models are systems that generate internal queries to guide reasoning and enhance answer quality.
They employ asymmetric self-play by assigning proposer and solver roles, leading to significant improvements in arithmetic, coding, and multimodal tasks.
These models act as diagnostic tools for uncertainty and introspection while exposing challenges like bias in question selection and limited self-awareness.

Searching arXiv for the focal paper and closely related work on self-questioning LLMs. Searching arXiv for the focal paper and closely related work on self-questioning LLMs. Self-questioning LLMs are systems in which a LLM generates questions, sub-questions, clarifications, or self-tests and then uses the resulting interaction to improve reasoning, personalize responses, align modalities, or estimate whether it knows enough to answer. Recent work treats self-questioning not as a single algorithm but as a family of mechanisms: asymmetric self-play without curated data, clarification-oriented dialog for underspecified tasks, multimodal decomposition into image- or video-grounded sub-questions, and self-evaluation pipelines that test models on questions derived from their own outputs (Chen et al., 5 Aug 2025, Andukuri et al., 2024, Hu et al., 6 Jan 2025, Seo et al., 18 Sep 2025).

1. Conceptual scope and research lineage

In current usage, self-questioning denotes a model behavior in which the model itself decides what intermediate information should be elicited before a final answer is produced. In text-only settings, this may mean generating a problem for itself, asking clarifying questions to recover latent preferences, or producing background questions whose answers activate otherwise underutilized internal knowledge. In multimodal settings, it commonly means iteratively generating image-aware or video-aware sub-questions and then answering them before final reasoning. In self-evaluation settings, it means generating content, generating questions about that content, and then testing whether the same model can answer those questions without rereading the original explanation or artifact (Chen et al., 5 Aug 2025, Wu et al., 18 May 2025, Tan et al., 2024).

The recent literature shows a progression from self-evaluation and selective generation toward explicit self-generated supervision. “Self-Evaluation Improves Selective Generation in LLMs” reformulated open-ended generation into token-level self-evaluation tasks such as multi-way comparison and point-wise true/false judgment (Ren et al., 2023). “STaR-GATE: Teaching LLMs to Ask Clarifying Questions” turned question-asking into a learnable skill for eliciting hidden user preferences (Andukuri et al., 2024). “SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant” and “Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild” treated question generation as an auxiliary or multi-round objective for visual grounding (Sun et al., 2024, Hu et al., 6 Jan 2025). “Self-Questioning LLMs” then proposed a full asymmetric self-play framework in which the model improves without curated datasets, starting from only a topic prompt (Chen et al., 5 Aug 2025).

This progression suggests that self-questioning is best understood as a control mechanism over intermediate supervision. The question itself can serve as curriculum, probe, retrieval query, decomposition step, or confidence diagnostic, depending on the training and verification regime.

2. Self-generated supervision and asymmetric training

The clearest formulation of self-questioning as self-generated supervision is Self-Questioning LLMs (SQLM). SQLM uses asymmetric self-play with two roles that share weights from the same pretrained LLM: a proposer $P$ , which is given a topic $t$ and generates a question $x$ , and a solver $S$ , which answers it with $y$ . The interaction is a one-step RL loop: $x \sim \pi_{P_t}(x)$ , $y \sim \pi_S(y \mid x)$ , a reward $r$ is computed without ground-truth labels, and both policies are optimized with a PPO-like RL setup via verl, with KL regularization, clipped ratio, short horizons, and alternating updates. The proposer is rewarded for questions that are solvable but not trivial, while the solver is rewarded by majority voting in arithmetic and algebra, or by unit-test pass rate in coding. This is organized around a generator–verifier gap: when verifying is about as hard as generation, SQLM uses internal agreement; when checking is easier, it uses external unit tests (Chen et al., 5 Aug 2025).

For coding, SQLM makes the verification channel explicit:

$\mathcal{R}_S\big(x, y\big) = \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big)\in[0,1].$

The proposer receives reward only when a task is neither solved perfectly nor failed completely:

$\mathcal{R}_P(x,y) = \begin{cases} 1 & \text{if } 0 < \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big) < 1,\ 0 & \text{otherwise.} \end{cases}$

The reported gains are substantial on the three studied domains. For three-digit multiplication, performance rises from a base of $t$ 0 to $t$ 1; for OMEGA linear equations, from $t$ 2 to $t$ 3; and for Codeforces on the Eurus-2 subset, from $t$ 4 to $t$ 5. The paper also reports that reinforcing only output formatting yields significantly smaller gains on arithmetic and algebra, and that generating a full dataset upfront reduces diversity and hurts learning relative to online proposer updates.

A different self-generated supervision loop appears in STaR-GATE. There, the trained policy is a Questioner $t$ 6 that asks open-ended clarifying questions to a Roleplayer $t$ 7 whose persona is hidden, while an Oracle $t$ 8 with access to the persona produces a gold personalized response used only for scoring and evaluation. Question quality is defined by whether a dialog trajectory increases the log probability that the original base model assigns to the Oracle’s gold response. The training loop samples up to $t$ 9 turns, filters the best of $x$ 0 simulated conversations per task–persona pair, appends a self-generated final answer for response regularization, masks roleplayer answers in the loss, and fine-tunes from the original base model at each iteration. After two iterations, answers from the STaR-GATE model are preferred over the initial model on $x$ 1 of tasks (Andukuri et al., 2024).

“Introspective Growth: Automatically Advancing LLM Expertise in Technology Judgment” applies self-questioning to pairwise differentiation among $x$ 2 pairs of post-2015 computer science patents. For each patent, the model generates six questions—three “surface-level” and three “deep”—and answers them either from its own knowledge or from scientific text retrieved by selecting the top-3 SPECTER2 chunks per question from up to 10 relevant arXiv CS papers. The final decision is made from the abstracts plus the QA pairs, with three independent judgments combined by confidence-weighted voting. The paper reports that self-talk improves over baseline, scientific QA improves further, and smaller models often generate more fundamental, more open-ended, better-aligned questions for mid-sized models than large models do (Wu et al., 18 May 2025).

3. Introspection, self-knowledge, and self-evaluation

A major line of work distinguishes self-questioning for improvement from self-questioning for introspection. “Quantifying Self-Awareness of Knowledge in LLMs” argues that hallucination prediction is only a valid proxy for self-awareness when it relies on model-side information rather than question-side shortcuts. The paper formalizes self-awareness as

$x$ 3

where $x$ 4 is the component of the internal signal attributable to the model’s own state rather than to question-side cues. It further proposes the Approximate Question-side Effect (AQE), based on

$x$ 5

so that

$x$ 6

Across datasets, AQE $x$ 7 is typically high—for example, ParaRel $x$ 8, HotpotQA $x$ 9, Mintaka $S$ 0, and Explain $S$ 1—showing that strong hallucination-prediction performance can be achieved by question-awareness alone. The same paper introduces SCAO, “Semantic Compression by Answering in One word,” which uses the thresholded mean of the top- $S$ 2 next-token probabilities for the first answer token and improves the use of model-side signals, especially under reduced question-side cues (Seo et al., 18 Sep 2025).

The Self-Execution Benchmark studies a related but more demanding question: whether a model can anticipate properties of its own future outputs without actually producing them. It evaluates self-association prediction, self-refusal prediction, and self-difficulty ranking. The results are weak. On the Association Test, most models are near chance, with the best model, o4-mini, at about $S$ 3 accuracy. On the Difficulty Assessment Test, many frontier models are around $S$ 4 ordering accuracy versus around $S$ 5 for smaller models, but reasoning-oriented models do not outperform their non-reasoning counterparts. Across the 15 models, only three surpassed $S$ 6 average across self-execution tasks, and the correlation between Rasch-modeled general ability and average self-execution score is only moderate, at approximately $S$ 7 (Ezra et al., 17 Aug 2025). This directly counters the misconception that stronger task performance automatically entails accurate introspective self-prediction.

Other work evaluates whether models can understand what they themselves create. “Can I understand what I create? Self-Knowledge Evaluation of LLMs” uses a two-step generate-then-verify framework and finds severe deficits on counting and indexing-like tasks. On total word count self-knowledge, the reported accuracies are GPT-4 $S$ 8, GPT-3.5 $S$ 9, Llama3 $y$ 0, Llama2 $y$ 1, Mistral $y$ 2, Gemma $y$ 3, and Qwen $y$ 4, while designated word count remains poor to modest. The same paper reports that fine-tuning on self-generated math data can improve GSM-8k for most tested models, especially when the model already has reasonable competence (Tan et al., 2024). “Explain-Query-Test” formalizes a related loop—explain, generate multiple-choice questions and paraphrases, then answer them without access to the explanation—and reports a statistically significant moderate correlation with MMLU-Pro performance, $y$ 5 with $y$ 6, together with category-wise evidence for an explanation–comprehension gap (Taghanaki et al., 20 Jan 2025).

A more deployment-oriented formulation appears in “Self-Evaluation Improves Selective Generation in LLMs.” There, self-questioning is turned into token-level self-evaluation: the model either selects the best candidate among sampled answers or judges candidate correctness with point-wise “Yes/No” prompts, optionally with a “None of the above” option to express uncertainty explicitly. On TruthfulQA with PaLM-2 Large, Sample and Select has Cal-AUC $y$ 7, while Sample and Select + NOTA reaches $y$ 8; Sample and Eval attains $y$ 9; and Hybrid + NOTA reaches $x \sim \pi_{P_t}(x)$ 0. The paper’s central claim is not that the model understands itself deeply, but that token-level self-evaluation is better calibrated than sequence-level likelihood for selective generation (Ren et al., 2023).

4. Multimodal self-questioning and grounded reasoning

Multimodal self-questioning extends the same idea to settings where the intermediate questions must be grounded in visual or video evidence. “Socratic Questioning: Learn to Self-guide Multimodal Reasoning in the Wild” operationalizes self-guided reasoning as a four-stage loop: self-ask, self-answer, consolidate and organize, and summarize and condense. The model uses ViT-L/14 as visual encoder, Vicuna as LLM, and a two-layer MLP adapter; a single shared LLM plays the roles of Question Generator, Question Answerer, and Visual Summarizer. CapQA, the paper’s multimodal mini-dataset, contains 982 images, with 882 train samples and 100 test samples. Relative to the LLaVA-1.5 baseline, LLaVA-1.5 + SQ training improves the hallucination score by $x \sim \pi_{P_t}(x)$ 1 and raises the question quality score from $x \sim \pi_{P_t}(x)$ 2 to $x \sim \pi_{P_t}(x)$ 3 on CapQA tests. The 3-turn inference mode further reduces hallucination by $x \sim \pi_{P_t}(x)$ 4 compared to 1-turn inference (Hu et al., 6 Jan 2025).

“Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment” proposes BoViLA, a self-training framework for VideoQA in which the same model alternates as Questioner and Answerer. The architecture combines a frozen LLaMA-7B decoder, ViT-L/14 visual encoder, a learnable linear map from visual features to the text embedding space, temporal embeddings, and an Evidential Deep Learning head for uncertainty-based soft filtering, with about $x \sim \pi_{P_t}(x)$ 5M trainable parameters. The key technical move is differentiable question generation via Gumbel-Softmax, so gradients from answering self-generated questions flow back to the Questioner. The total loss is

$x \sim \pi_{P_t}(x)$ 6

where $x \sim \pi_{P_t}(x)$ 7 is the EDL uncertainty. The paper reports strong or best performance across five VideoQA benchmarks, including STAR total $x \sim \pi_{P_t}(x)$ 8, DramaQA $x \sim \pi_{P_t}(x)$ 9, VLEP $y \sim \pi_S(y \mid x)$ 0, TVQA $y \sim \pi_S(y \mid x)$ 1, and How2QA $y \sim \pi_S(y \mid x)$ 2 (Chen et al., 2024).

“Instruction-tuned Self-Questioning Framework for Multimodal Reasoning” introduces SQ-InstructBLIP, a Questioner–Answerer–Reasoner system built on InstructBLIP-vicuna7b. The main question is $y \sim \pi_S(y \mid x)$ 3, the $y \sim \pi_S(y \mid x)$ 4-th sub-question is $y \sim \pi_S(y \mid x)$ 5, and the $y \sim \pi_S(y \mid x)$ 6-th sub-answer is $y \sim \pi_S(y \mid x)$ 7. In the main experiments, 3 sub-QAs are generated per question. On VQA-Introspect validation, open-ended accuracy rises from InstructBLIP $y \sim \pi_S(y \mid x)$ 8 to SQ-InstructBLIP $y \sim \pi_S(y \mid x)$ 9; on A-OKVQA validation, multiple-choice accuracy rises from $r$ 0 to $r$ 1. When the Reasoner is given ground-truth sub-QAs, VQA-Introspect reaches $r$ 2, which the paper presents as evidence of strong headroom if intermediate facts are accurate (Jang et al., 25 Sep 2025).

“SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant” treats question generation as an auxiliary supervision signal rather than a multi-turn inference procedure. It introduces a special token $r$ 3 that instructs the model to ask a question about the image, and it mixes standard QA turns with self-questioning turns during instruction tuning. The architecture includes CLIP ViT, a prototype extractor with $r$ 4 visual prototypes, a projector into the LLM token space, and LoRA adapters on both the vision encoder and the LLM. SQ-LLaVA-7B trained on LLaVA-v1.5 data outperforms LLaVA-v1.5-7B on 9 of 10 tasks, with a $r$ 5 absolute point improvement on LLaVA (in-the-wild), while the full configuration reports about $r$ 6 average improvement across selected benchmarks (Sun et al., 2024).

5. Failure modes, misconceptions, and limits

A recurring misconception is that self-questioning is equivalent to reliable self-awareness. The literature does not support that equation. AQE shows that much apparent success in hallucination prediction can be explained by question-side shortcuts rather than model-side introspection (Seo et al., 18 Sep 2025). The Self-Execution Benchmark shows that models generally perform poorly when asked to predict whether they will refuse, what associations they will make, or which problems will be easy for themselves (Ezra et al., 17 Aug 2025). “What Am I Missing? Question-Answering as Hidden State Probing” sharpens this point by separating diagnosis from correction: a probe on the student’s hidden state before and after generating a question predicts final correctness reasonably well, yet interventions are equally likely to harm correct trajectories as they are to recover incorrect ones. The paper therefore reports a detection-versus-recovery gap rather than robust self-repair (Luo et al., 29 May 2026).

Another major limitation is that self-generated QA is not neutral preprocessing. “Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA” argues that the generation step is an implicit policy over both question selection and answer generation. Coverage saturates early; different prompt seeds converge on the same document regions; anchor selection is driven by headings, lists or tables, dense numerics, citations, key–value fields, and parser or markup artifacts; and even a minimal formatting perturbation can redirect anchors. In one probe, prepending “MOST IMPORTANT PASSAGE IN THIS EXCERPT” to a low-salience paragraph raises its question-support hit rate from $r$ 7 to $r$ 8, an increase of $r$ 9 percentage points. On the answering side, embedded instruction-like passages produce high compliance, and the paper reports that keyword–regex filtering reduces mean injection compliance from $\mathcal{R}_S\big(x, y\big) = \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big)\in[0,1].$ 0 to $\mathcal{R}_S\big(x, y\big) = \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big)\in[0,1].$ 1 with $\mathcal{R}_S\big(x, y\big) = \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big)\in[0,1].$ 2 clean-text retention; the abstract summarizes this as reducing compliance from $\mathcal{R}_S\big(x, y\big) = \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big)\in[0,1].$ 3 to $\mathcal{R}_S\big(x, y\big) = \mathrm{Pass}\big(y,\mathrm{Tests}(x)\big)\in[0,1].$ 4 while retaining nearly all clean text (Alimaskina et al., 30 Jun 2026).

Task-specific limitations also remain significant. SQLM still requires prompt engineering to constrain output formats, especially for unit tests in coding; majority-vote consensus can be wrong; no safety or relevance filters are built in; and the proposer can generate unreasonable or unsafe questions (Chen et al., 5 Aug 2025). In SQ-InstructBLIP, Answerer errors can mislead the Reasoner, and open-ended evaluation suffers from synonym mismatches such as “cell phone” versus “mobile phone,” which exact-match metrics penalize (Jang et al., 25 Sep 2025). In multimodal settings more broadly, stronger regional grounding and better objectives for effective questions are still identified as open needs rather than solved components (Hu et al., 6 Jan 2025).

6. Prospects and significance

The papers in this area collectively point toward several research directions. SQLM identifies automated prompt evolution, semi-supervised anchors to mitigate error drift, and principled safety or relevance filters as future directions for self-supervised post-training (Chen et al., 5 Aug 2025). The AQE framework argues for dataset refinement strategies that remove binary formats, enforce out-of-domain splits, and repair broken questions, while SCAO shows that instruction-level semantic compression can make model-side confidence signals more salient (Seo et al., 18 Sep 2025). The Self-Execution Benchmark proposes explicit training for self-prediction and even hierarchical self-execution tools in which smaller specialized models predict the base model’s outcomes (Ezra et al., 17 Aug 2025). Multimodal work emphasizes stronger region-level grounding, better objectives for question utility, and more robust intermediate supervision (Hu et al., 6 Jan 2025). Synthetic-QA studies recommend externalizing anchor choice and sanitizing source text before either question generation or answering (Alimaskina et al., 30 Jun 2026).

Taken together, these results suggest that self-questioning serves three distinct scientific roles. First, it can generate supervision and curriculum, as in asymmetric self-play and clarification-oriented training. Second, it can expose or retrieve latent knowledge, especially when explicit background questions are inserted before judgment. Third, it can function as a diagnostic probe of uncertainty, calibration, and internal state. The same literature also shows that these roles should not be conflated. A model may ask useful questions without accurately knowing whether it knows; it may detect uncertainty without recovering from it; and it may improve on downstream tasks while remaining fragile to shortcut cues, answer hijacking, or biased evidence selection. Self-questioning LLMs therefore mark not a solved capability, but a research program at the intersection of self-supervised training, introspective evaluation, and controllable reasoning.