SandwichR: Fast Query Correction Paradigm

Updated 14 January 2026

SandwichR is a query correction paradigm that decomposes outputs into initial correction, reasoning chain, and final correction to reduce latency.
It employs an Answer–Reasoning–Answer structure with a consistency-aware reinforcement learning objective to align rapid initial responses with in-depth reasoning.
The framework achieves state-of-the-art accuracy with a 40–70% reduction in inference latency, validated on multi-domain, error-injected query datasets.

Sandwich Reasoning (SandwichR) is a paradigm and training strategy for low-latency query correction in search pipelines. It addresses the latency bottleneck of Chain-of-Thought (CoT) reasoning, enabling LLMs to rapidly produce high-quality corrections while explicitly leveraging reasoning without incurring prohibitive computational costs. The framework is defined by its "Answer–Reasoning–Answer" output structure and a consistency-aware reinforcement learning objective, underpinned by targeted sampling and a newly constructed benchmark dataset. SandwichR achieves state-of-the-art accuracy with a 40–70% reduction in inference latency compared to standard reasoning-first approaches (Zhang et al., 7 Jan 2026).

1. "Answer–Reasoning–Answer" Inference Paradigm

SandwichR decomposes output generation for a noisy query $x$ into three sequential segments: initial correction ( $C_{\text{init}}$ ), explicit reasoning ( $R$ ), and final correction ( $C_{\text{final}}$ ), so that $y = [C_\text{init}; R; C_\text{final}]$ .

Initial Correction ( $C_\text{init}$ ): The model generates tokens $(y_1, \dots, y_k)$ conditioned solely on $x$ , yielding an immediate correction returned for downstream retrieval.
Reasoning Chain ( $R$ ): Tokens $(r_1, \dots, r_m)$ encode the model's analytic process, such as error location, type, and justification, marked for separate parsing within output tags (e.g., "<reasoning>…</reasoning>").
Final Correction ( $C_{\text{init}}$ 0): Conditioned on both $C_{\text{init}}$ 1 and the preceding $C_{\text{init}}$ 2, the model emits $C_{\text{init}}$ 3. This step aligns the initial answer with reasoning-derived insights and supervises the initial answer’s quality.

During real-time inference under tight token budgets, only $C_{\text{init}}$ 4 need be generated, ensuring low latency. The model’s training aligns the initial answer with downstream reasoning, so the emitted correction maintains reasoning-level accuracy even when later output segments are omitted.

2. Consistency-Aware Reinforcement Learning (RL) Objective

SandwichR employs a two-stage training. After supervised fine-tuning (SFT) teaches the model to format outputs as $C_{\text{init}}$ 5, RL is used to enforce consistency across segments and incentivize high-fidelity corrections.

Reward Composition:
- Accuracy Reward: $C_{\text{init}}$ 6 if $C_{\text{init}}$ 7; otherwise, $C_{\text{init}}$ 8, where $C_{\text{init}}$ 9 is the F0.5-score.
- Format × Consistency Reward: $R$ 0 iff output matches SandwichR structure and $R$ 1; 0 otherwise.
- Total reward:
$R$ 2
Gradient Update: The objective $R$ 3 is optimized via REINFORCE gradients:

$R$ 4

In practice, the policy gradient is stabilized using GRPO (a variant of PPO/GRECO), rewarding high $R$ 5, exact output format, and segment consistency. This alignment ensures that rapid initial corrections ( $R$ 6) reflect post-hoc reasoning ( $R$ 7, $R$ 8), resolving the latency-accuracy trade-off.

3. Margin-Based Rejection Sampling for RL Dataset

To concentrate RL signal on challenging ("borderline") samples, SandwichR adopts targeted instance filtering:

For each candidate input $R$ 9 in SFT pool, sample $C_{\text{final}}$ 0 outputs from the initial model. Compute pass@ $C_{\text{final}}$ 1: the count where $C_{\text{final}}$ 2.
Accept $C_{\text{final}}$ 3 into RL dataset iff pass@ $C_{\text{final}}$ 4; otherwise, reject.

This strategy restricts RL to queries where the model exhibits partial correctness, thus maximizing gradient leverage on inputs amenable to reasoning-driven improvement.

4. High-Quality Multi-Domain Query Correction Dataset

Recognizing the absence of dedicated benchmarks for complex query correction in real-world pipelines, the authors established a multi-domain, error-injected dataset using noisy transformations of real user queries:

Domains: E-commerce, Video, Medical.
Error synthesis: Each $C_{\text{final}}$ 5 via:
- Wrong Words (confusable substitution by phonetics/graphics),
- Missing Words (single word deletions),
- Disorder Words (swap of adjacent terms).
Each query contains exactly one error, consistent with short, realistic queries.
Reasoning annotations: Collected via GPT-4o to provide explicit reasoning traces.
Statistics:

Domain	#Train	#Dev	#Test	AvgLen
E-commerce	90,511	1,000	1,000	≈6.4
Video	88,736	1,000	1,000	≈7.1
Medical	94,176	1,000	1,000	≈16.1

This dataset enables systematic evaluation in realistic, multi-domain settings and supports the explicit reasoning trajectories integral to SandwichR.

5. Comparative Evaluation and Latency Analysis

SandwichR was evaluated using Qwen2.5-1.5B-Instruct with LoRA SFT and GRPO RL, compared against baselines: traditional seq2seq (mT5, BART SFT), CoT prompting (Ans–Rea, Rea–Ans), and larger LLMs (GPT-4o, QwQ-32B, Deepseek-R1, GrammarGPT-7B). Metrics included $C_{\text{final}}$ 6, accuracy, and query latency under strict token budgets (20 tokens for E-commerce/Video, 40 for Medical).

Model	E-com F0.5	E-com Acc	Video F0.5	Video Acc	Med F0.5	Med Acc
Ans–Rea SFT+RL	0.211	0.200	0.316	0.292	0.392	0.363
Rea–Ans SFT+RL	0.216	0.207	0.318	0.301	0.387	0.364
SandwichR SFT+RL	0.221	0.213	0.325	0.307	0.396	0.375

Latency evaluation (under limited token budgets):

Method	Budget	E-com Acc	E-com Time (s)	Video Acc	Video Time (s)	Med Acc	Med Time (s)
Rea–Ans	Full (256)	0.207	1.959	0.301	1.143	0.364	1.550
Rea–Ans	Limited	≈0.000	0.457	≈0.000	0.484	0.009	0.900
Ans–Rea	Limited	0.200	0.464	0.292	0.446	0.359	0.893
SandwichR	Limited	0.213	0.467	0.307	0.474	0.374	0.924

SandwichR achieves a 40–70% reduction in latency compared to standard CoT (Rea–Ans) with negligible drop in accuracy, thus resolving the fundamental trade-off in online query correction.

6. Training Procedure, Inference Pseudocode, and Complexity Profile

Training proceeds via SFT to learn the SandwichR format, followed by RL on a filtered dataset to maximize alignment of initial and reasoning-informed corrections. The computational bottleneck in autoregressive decoding is mitigated by early emission of $C_{\text{final}}$ 7, enabling inference in $C_{\text{final}}$ 8 time under a token budget $C_{\text{final}}$ 9.

SandwichR Training:

$y = [C_\text{init}; R; C_\text{final}]$ 1 SandwichR Inference:

$y = [C_\text{init}; R; C_\text{final}]$ 2

Standard CoT (Rea–Ans) under strict budgets cannot emit a corrected answer, driving accuracy to zero. SandwichR, by front-loading $y = [C_\text{init}; R; C_\text{final}]$ 0, achieves significant speedup while maintaining reasoning-level correction quality through consistency-aware training.

7. Context and Significance in Search Pipelines

SandwichR advances LLM-based search query correction under operational constraints where real-time accuracy is paramount. Its paradigm enables practical deployment by decoupling answer emission from exhaustive reasoning, then leveraging RL to ensure answer quality is not compromised. The approach is extensible to domains with structured outputs and similar latency demands. The introduction of a reasoning-annotated, error-injected benchmark sets a foundation for further research in domain-adaptive correction tasks and reinforces the utility of aligned multi-stage output in time-critical LLM applications (Zhang et al., 7 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Sandwich Reasoning: An Answer-Reasoning-Answer Approach for Low-Latency Query Correction (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sandwich Reasoning (SandwichR).