Latent Skill–Driven Retrieval (LaRS)
- LaRS is a compute-efficient framework that uses unsupervised latent variable modeling to capture and align reasoning skills for in-context learning.
- It leverages a conditional variational autoencoder and cosine similarity for selecting chain-of-thought demonstrations without manual labeling.
- Empirical evaluations on benchmarks like GSM8K and COGS show that LaRS outperforms traditional retrieval methods while reducing computational overhead.
Latent Skill–Driven Retrieval (LaRS) is a compute-efficient framework for in-context learning (ICL) selection in LLMs, targeted specifically at complex reasoning tasks that require multi-step, chain-of-thought (CoT) prompting. Rather than selecting demonstrations by direct input-question similarity or requiring human- or LLM-designed skill taxonomies, LaRS introduces an unsupervised latent-variable approach: it induces a continuous space of “reasoning skills” by modeling how rationales (stepwise explanations) for questions are distributed, and retrieves ICL demonstrations whose latent skills best align with the target question. The method is theoretically grounded, requires no auxiliary LLM calls or manual labeling, and outperforms prior retrieval approaches, notably on tasks where question-level similarity fails to capture needed reasoning structure (Xu et al., 2023).
1. Latent Variable Modeling of Reasoning Skills
LaRS posits that each rationale for a question is generated via a two-step process: sample a continuous latent skill vector (with in experiments), given under a reasoning-policy prior , and then synthesize from and according to . This generative formulation leads to the marginal rationale distribution
0
A conditional variational autoencoder (CVAE) operationalizes this model, employing three neural subnetworks:
- An encoder 1,
- A decoder 2,
- A reasoning-policy network 3.
The function 4 is a fixed pre-trained text encoder (e.g., SBERT, DeBERTa, or text-embedding-ada-002), mapping text to the embedding space 5. Both 6 and 7 are 2-layer MLPs that parameterize Gaussians in 8; 9 is a 2-layer MLP operating in 0. The CVAE parameters are fit by maximizing the evidence lower bound (ELBO): 1
2. Unsupervised Policy Learning for Skill Prediction
The reasoning-policy network 2 is trained with the CVAE KL divergence term, learning to predict the distribution over skills 3 given a new question 4. Architecturally, 5 outputs a mean 6 and log-variance 7. The reparameterization trick with 8 allows sampling 9 during training. At inference, the mean 0 is typically used as the estimated "target skill" of 1, though sampling is also possible to capture uncertainty.
This mechanism allows LaRS to model the skill requirements of a novel question without resorting to costly LLM-based skill annotation or human-designed taxonomies.
3. Latent Skill–Driven In-Context Retrieval
LaRS retrieves in-context examples by skill-space alignment rather than traditional text similarity. Given an input question 2, the policy network computes its target skill 3. Each candidate example 4 in the demonstration bank 5 is assigned a posterior skill 6 based on its encoded rationale.
Candidates are scored via cosine similarity: 7 and the top-8 scoring examples are selected for ICL demonstration. This procedure directly aligns the retrieval process with latent reasoning requirements of the new question.
4. Theoretical Foundations
Under two assumptions—(A) the example bank is an unbiased sample from the optimal rationale distribution 9, and (B) all rationales (human, example, LLM) are identically distributed conditioned on 0—the paper provides a convergence guarantee. Specifically, as the number of in-context examples 1, the retrieval rule
2
ensures that the in-context posterior prediction 3 converges to the optimal 4. This establishes that, in the idealized large-data regime, skill-based alignment delivers Bayes-optimal rationales (Xu et al., 2023).
5. Empirical Performance and Comparative Analysis
LaRS was evaluated across diverse benchmarks: TabMWP (semi-structured math), GSM8K (grade-school math), Spider (text-to-SQL), and COGS (semantic parsing), using several LLMs including GPT-3.5-turbo, text-davinci-003, Claude-v2, and Falcon-40B-Instruct. Results are reported in answer accuracy or execution accuracy.
The following table summarizes representative results for GPT-3.5-turbo (2–8 shots):
| Method | TabMWP | GSM8K | Spider | COGS |
|---|---|---|---|---|
| Random | 62.4 | 75.7 | 46.8 | 67.5 |
| Q-Retrieval | 72.3 | 75.6 | 49.9 | 88.5 |
| LaRS | 78.1 | 76.8 | 53.0 | 94.6 |
| Oracle-R* | 77.4 | 75.5 | 64.4 | 95.7 |
LaRS outperformed Q-Retrieval by as much as 15.7 percentage points (TabMWP) and up to 27.1 on COGS. Comparable gains were observed across other backbone models. Ablations confirm that LaRS is robust to embedding model choice, and maintains a consistent edge as the number of ICL examples increases.
LaRS also achieves substantial computational advantages, processing example banks four times faster and halving LLM inference calls during selection.
6. Practical Considerations and Limitations
LaRS ignores the ordering of selected demonstrations; incorporating heuristics for diversity or logical sequencing could plausibly offer further improvements. The decoder 5 operates on embeddings with a simple MLP; integrating prompt-tuning could strengthen the linkage between latent skill and rationale generation. The current approach invokes a single global skill vector per rationale, which may be insufficient for modeling multi-step reasoning involving distinct sub-skills; refining the latent structure to handle step-wise or hierarchical skills is an open direction. Scalability to extremely large demonstration banks and to more granular or discrete skill spaces remains open for investigation.
7. Relation to Broader Retrieval-Augmented Generation Research
LaRS is conceptually distinct from failure-state-aware retrieval augmentation techniques, such as Skill-RAG (Wei et al., 17 Apr 2026), which addresses post-retrieval misalignment at the pipeline level through skill routing and failure diagnosis rather than latent skill representation. While Skill-RAG identifies types of failure states in RAG and invokes explicitly coded remedial skills, LaRS focuses on unsupervised, latent skill discovery and alignment for demonstration retrieval in ICL. A plausible implication is that these approaches could be synergistically combined, e.g., by routing hard queries to latent skill retrieval modules. Both lines of work support the emerging thesis that effective retrieval for LLMs depends crucially on modeling the reasoning space, not simply surface-level similarity.