Latent Skill–Driven Retrieval (LaRS)

Updated 6 May 2026

LaRS is a compute-efficient framework that uses unsupervised latent variable modeling to capture and align reasoning skills for in-context learning.
It leverages a conditional variational autoencoder and cosine similarity for selecting chain-of-thought demonstrations without manual labeling.
Empirical evaluations on benchmarks like GSM8K and COGS show that LaRS outperforms traditional retrieval methods while reducing computational overhead.

Latent Skill–Driven Retrieval (LaRS) is a compute-efficient framework for in-context learning (ICL) selection in LLMs, targeted specifically at complex reasoning tasks that require multi-step, chain-of-thought (CoT) prompting. Rather than selecting demonstrations by direct input-question similarity or requiring human- or LLM-designed skill taxonomies, LaRS introduces an unsupervised latent-variable approach: it induces a continuous space of “reasoning skills” by modeling how rationales (stepwise explanations) for questions are distributed, and retrieves ICL demonstrations whose latent skills best align with the target question. The method is theoretically grounded, requires no auxiliary LLM calls or manual labeling, and outperforms prior retrieval approaches, notably on tasks where question-level similarity fails to capture needed reasoning structure (Xu et al., 2023).

1. Latent Variable Modeling of Reasoning Skills

LaRS posits that each rationale $R$ for a question $Q$ is generated via a two-step process: sample a continuous latent skill vector $z \in \mathbb{R}^d$ (with $d=128$ in experiments), given $Q$ under a reasoning-policy prior $P_H(z|Q)$ , and then synthesize $R$ from $z$ and $Q$ according to $P_H(R|z, Q)$ . This generative formulation leads to the marginal rationale distribution

$Q$ 0

A conditional variational autoencoder (CVAE) operationalizes this model, employing three neural subnetworks:

An encoder $Q$ 1,
A decoder $Q$ 2,
A reasoning-policy network $Q$ 3.

The function $Q$ 4 is a fixed pre-trained text encoder (e.g., SBERT, DeBERTa, or text-embedding-ada-002), mapping text to the embedding space $Q$ 5. Both $Q$ 6 and $Q$ 7 are 2-layer MLPs that parameterize Gaussians in $Q$ 8; $Q$ 9 is a 2-layer MLP operating in $z \in \mathbb{R}^d$ 0. The CVAE parameters are fit by maximizing the evidence lower bound (ELBO): $z \in \mathbb{R}^d$ 1

2. Unsupervised Policy Learning for Skill Prediction

The reasoning-policy network $z \in \mathbb{R}^d$ 2 is trained with the CVAE KL divergence term, learning to predict the distribution over skills $z \in \mathbb{R}^d$ 3 given a new question $z \in \mathbb{R}^d$ 4. Architecturally, $z \in \mathbb{R}^d$ 5 outputs a mean $z \in \mathbb{R}^d$ 6 and log-variance $z \in \mathbb{R}^d$ 7. The reparameterization trick with $z \in \mathbb{R}^d$ 8 allows sampling $z \in \mathbb{R}^d$ 9 during training. At inference, the mean $d=128$ 0 is typically used as the estimated "target skill" of $d=128$ 1, though sampling is also possible to capture uncertainty.

This mechanism allows LaRS to model the skill requirements of a novel question without resorting to costly LLM-based skill annotation or human-designed taxonomies.

3. Latent Skill–Driven In-Context Retrieval

LaRS retrieves in-context examples by skill-space alignment rather than traditional text similarity. Given an input question $d=128$ 2, the policy network computes its target skill $d=128$ 3. Each candidate example $d=128$ 4 in the demonstration bank $d=128$ 5 is assigned a posterior skill $d=128$ 6 based on its encoded rationale.

Candidates are scored via cosine similarity: $d=128$ 7 and the top- $d=128$ 8 scoring examples are selected for ICL demonstration. This procedure directly aligns the retrieval process with latent reasoning requirements of the new question.

4. Theoretical Foundations

Under two assumptions—(A) the example bank is an unbiased sample from the optimal rationale distribution $d=128$ 9, and (B) all rationales (human, example, LLM) are identically distributed conditioned on $Q$ 0—the paper provides a convergence guarantee. Specifically, as the number of in-context examples $Q$ 1, the retrieval rule

$Q$ 2

ensures that the in-context posterior prediction $Q$ 3 converges to the optimal $Q$ 4. This establishes that, in the idealized large-data regime, skill-based alignment delivers Bayes-optimal rationales (Xu et al., 2023).

5. Empirical Performance and Comparative Analysis

LaRS was evaluated across diverse benchmarks: TabMWP (semi-structured math), GSM8K (grade-school math), Spider (text-to-SQL), and COGS (semantic parsing), using several LLMs including GPT-3.5-turbo, text-davinci-003, Claude-v2, and Falcon-40B-Instruct. Results are reported in answer accuracy or execution accuracy.

The following table summarizes representative results for GPT-3.5-turbo (2–8 shots):

Method	TabMWP	GSM8K	Spider	COGS
Random	62.4	75.7	46.8	67.5
Q-Retrieval	72.3	75.6	49.9	88.5
LaRS	78.1	76.8	53.0	94.6
Oracle-R*	77.4	75.5	64.4	95.7

LaRS outperformed Q-Retrieval by as much as 15.7 percentage points (TabMWP) and up to 27.1 on COGS. Comparable gains were observed across other backbone models. Ablations confirm that LaRS is robust to embedding model choice, and maintains a consistent edge as the number of ICL examples increases.

LaRS also achieves substantial computational advantages, processing example banks four times faster and halving LLM inference calls during selection.

6. Practical Considerations and Limitations

LaRS ignores the ordering of selected demonstrations; incorporating heuristics for diversity or logical sequencing could plausibly offer further improvements. The decoder $Q$ 5 operates on embeddings with a simple MLP; integrating prompt-tuning could strengthen the linkage between latent skill and rationale generation. The current approach invokes a single global skill vector per rationale, which may be insufficient for modeling multi-step reasoning involving distinct sub-skills; refining the latent structure to handle step-wise or hierarchical skills is an open direction. Scalability to extremely large demonstration banks and to more granular or discrete skill spaces remains open for investigation.

7. Relation to Broader Retrieval-Augmented Generation Research

LaRS is conceptually distinct from failure-state-aware retrieval augmentation techniques, such as Skill-RAG (Wei et al., 17 Apr 2026), which addresses post-retrieval misalignment at the pipeline level through skill routing and failure diagnosis rather than latent skill representation. While Skill-RAG identifies types of failure states in RAG and invokes explicitly coded remedial skills, LaRS focuses on unsupervised, latent skill discovery and alignment for demonstration retrieval in ICL. A plausible implication is that these approaches could be synergistically combined, e.g., by routing hard queries to latent skill retrieval modules. Both lines of work support the emerging thesis that effective retrieval for LLMs depends crucially on modeling the reasoning space, not simply surface-level similarity.

Markdown Report Issue Upgrade to Chat

References (2)

LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning (2023)

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Skill–Driven Retrieval (LaRS).