Papers
Topics
Authors
Recent
Search
2000 character limit reached

SandwichR: Fast Query Correction Paradigm

Updated 14 January 2026
  • SandwichR is a query correction paradigm that decomposes outputs into initial correction, reasoning chain, and final correction to reduce latency.
  • It employs an Answer–Reasoning–Answer structure with a consistency-aware reinforcement learning objective to align rapid initial responses with in-depth reasoning.
  • The framework achieves state-of-the-art accuracy with a 40–70% reduction in inference latency, validated on multi-domain, error-injected query datasets.

Sandwich Reasoning (SandwichR) is a paradigm and training strategy for low-latency query correction in search pipelines. It addresses the latency bottleneck of Chain-of-Thought (CoT) reasoning, enabling LLMs to rapidly produce high-quality corrections while explicitly leveraging reasoning without incurring prohibitive computational costs. The framework is defined by its "Answer–Reasoning–Answer" output structure and a consistency-aware reinforcement learning objective, underpinned by targeted sampling and a newly constructed benchmark dataset. SandwichR achieves state-of-the-art accuracy with a 40–70% reduction in inference latency compared to standard reasoning-first approaches (Zhang et al., 7 Jan 2026).

1. "Answer–Reasoning–Answer" Inference Paradigm

SandwichR decomposes output generation for a noisy query xx into three sequential segments: initial correction (CinitC_{\text{init}}), explicit reasoning (RR), and final correction (CfinalC_{\text{final}}), so that y=[Cinit;R;Cfinal]y = [C_\text{init}; R; C_\text{final}].

  • Initial Correction (CinitC_\text{init}): The model generates tokens (y1,,yk)(y_1, \dots, y_k) conditioned solely on xx, yielding an immediate correction returned for downstream retrieval.
  • Reasoning Chain (RR): Tokens (r1,,rm)(r_1, \dots, r_m) encode the model's analytic process, such as error location, type, and justification, marked for separate parsing within output tags (e.g., "<reasoning>…</reasoning>").
  • Final Correction (CfinalC_\text{final}): Conditioned on both xx and the preceding RR, the model emits (yk+m+1,,yk+m+l)(y_{k+m+1}, \dots, y_{k+m+l}). This step aligns the initial answer with reasoning-derived insights and supervises the initial answer’s quality.

During real-time inference under tight token budgets, only CinitC_\text{init} need be generated, ensuring low latency. The model’s training aligns the initial answer with downstream reasoning, so the emitted correction maintains reasoning-level accuracy even when later output segments are omitted.

2. Consistency-Aware Reinforcement Learning (RL) Objective

SandwichR employs a two-stage training. After supervised fine-tuning (SFT) teaches the model to format outputs as [Cinit;R;Cfinal][C_\text{init}; R; C_\text{final}], RL is used to enforce consistency across segments and incentivize high-fidelity corrections.

  • Reward Composition:
    • Accuracy Reward: Racc(Cinit)=0R_\text{acc}(C_\text{init}) = 0 if Cinit=QnoiseC_\text{init} = Q_\text{noise}; otherwise, F0.5(Cinit,Qclean)F_{0.5}(C_\text{init}, Q_\text{clean}), where F0.5F_{0.5} is the F0.5-score.
    • Format × Consistency Reward: Rfmt×Rcons=1R_\text{fmt} \times R_\text{cons} = 1 iff output matches SandwichR structure and Cinit=CfinalC_\text{init} = C_\text{final}; 0 otherwise.
    • Total reward:

    Rtotal(y)=waccRacc+wfmtRfmt+wconsRconsR_{\text{total}}(y) = w_{\text{acc}} R_{\text{acc}} + w_{\text{fmt}} R_{\text{fmt}} + w_{\text{cons}} R_{\text{cons}}

  • Gradient Update: The objective J(θ)=Eypθ(x)[Rtotal(y)]J(\theta) = E_{y \sim p_\theta(\cdot|x)}[R_\text{total}(y)] is optimized via REINFORCE gradients:

    θJ(θ)=Eypθ[Rtotal(y)θlogpθ(yx)]\nabla_\theta J(\theta) = E_{y \sim p_\theta} [R_\text{total}(y) \nabla_\theta \log p_\theta(y|x)]

In practice, the policy gradient is stabilized using GRPO (a variant of PPO/GRECO), rewarding high F0.5F_{0.5}, exact output format, and segment consistency. This alignment ensures that rapid initial corrections (CinitC_\text{init}) reflect post-hoc reasoning (RR, CfinalC_\text{final}), resolving the latency-accuracy trade-off.

3. Margin-Based Rejection Sampling for RL Dataset

To concentrate RL signal on challenging ("borderline") samples, SandwichR adopts targeted instance filtering:

  • For each candidate input xx in SFT pool, sample NN outputs from the initial model. Compute pass@NN: the count where F0.5(Cfinal,Qclean)>0F_{0.5}(C_\text{final}, Q_\text{clean}) > 0.

  • Accept xx into RL dataset iff pass@N>0N > 0; otherwise, reject.

This strategy restricts RL to queries where the model exhibits partial correctness, thus maximizing gradient leverage on inputs amenable to reasoning-driven improvement.

4. High-Quality Multi-Domain Query Correction Dataset

Recognizing the absence of dedicated benchmarks for complex query correction in real-world pipelines, the authors established a multi-domain, error-injected dataset using noisy transformations of real user queries:

  • Domains: E-commerce, Video, Medical.

  • Error synthesis: Each QcleanQnoiseQ_\text{clean} \rightarrow Q_\text{noise} via:

    • Wrong Words (confusable substitution by phonetics/graphics),
    • Missing Words (single word deletions),
    • Disorder Words (swap of adjacent terms).
  • Each query contains exactly one error, consistent with short, realistic queries.
  • Reasoning annotations: Collected via GPT-4o to provide explicit reasoning traces.
  • Statistics:
Domain #Train #Dev #Test AvgLen
E-commerce 90,511 1,000 1,000 ≈6.4
Video 88,736 1,000 1,000 ≈7.1
Medical 94,176 1,000 1,000 ≈16.1

This dataset enables systematic evaluation in realistic, multi-domain settings and supports the explicit reasoning trajectories integral to SandwichR.

5. Comparative Evaluation and Latency Analysis

SandwichR was evaluated using Qwen2.5-1.5B-Instruct with LoRA SFT and GRPO RL, compared against baselines: traditional seq2seq (mT5, BART SFT), CoT prompting (Ans–Rea, Rea–Ans), and larger LLMs (GPT-4o, QwQ-32B, Deepseek-R1, GrammarGPT-7B). Metrics included F0.5F_{0.5}, accuracy, and query latency under strict token budgets (20 tokens for E-commerce/Video, 40 for Medical).

Model E-com F0.5 E-com Acc Video F0.5 Video Acc Med F0.5 Med Acc
Ans–Rea SFT+RL 0.211 0.200 0.316 0.292 0.392 0.363
Rea–Ans SFT+RL 0.216 0.207 0.318 0.301 0.387 0.364
SandwichR SFT+RL 0.221 0.213 0.325 0.307 0.396 0.375

Latency evaluation (under limited token budgets):

Method Budget E-com Acc E-com Time (s) Video Acc Video Time (s) Med Acc Med Time (s)
Rea–Ans Full (256) 0.207 1.959 0.301 1.143 0.364 1.550
Rea–Ans Limited ≈0.000 0.457 ≈0.000 0.484 0.009 0.900
Ans–Rea Limited 0.200 0.464 0.292 0.446 0.359 0.893
SandwichR Limited 0.213 0.467 0.307 0.474 0.374 0.924

SandwichR achieves a 40–70% reduction in latency compared to standard CoT (Rea–Ans) with negligible drop in accuracy, thus resolving the fundamental trade-off in online query correction.

6. Training Procedure, Inference Pseudocode, and Complexity Profile

Training proceeds via SFT to learn the SandwichR format, followed by RL on a filtered dataset to maximize alignment of initial and reasoning-informed corrections. The computational bottleneck in autoregressive decoding is mitigated by early emission of CinitC_\text{init}, enabling inference in O(Cinit)O(|C_\text{init}|) time under a token budget KCinitK \gg |C_\text{init}|.

SandwichR Training:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
procedure TRAIN_SANDWICHR(D_train):
  # Stage 1: SFT
  for (x, y_clean) in D_train:
    prompt  format_SandwichR_prompt(x)
    target  [y_clean; reasoning_by_GPT4o; y_clean]
    θ  SFT_update(θ, prompt, target)
  end for

  # Stage 2: Consistency-Aware RL
  S_RL  {}
  for x in small_subset(D_train):
    samples = {model_sample(x) for i=1..N}
    if any(F0.5(sample.C_final, y_clean)>0):
      add x to S_RL
  end for

  for epoch in 1..E:
    for x in S_RL:
      y ~ p_θ(·|x)
      compute R_acc, R_fmt, R_cons, R_total
      θ  GRPO_update(θ, y, R_total)
    end for
  end for
end procedure
SandwichR Inference:

1
2
3
4
5
6
7
8
9
procedure INFER_SANDWICHR(x, budget):
  y_seq  []
  # Generate C_init
  while not end_of_C_init and token_count < budget:
    y_seq += sample_next_token()
  end while
  return y_seq  # use initial correction immediately
  # Optionally continue to produce R and C_final if tokens remain
end procedure

Standard CoT (Rea–Ans) under strict budgets cannot emit a corrected answer, driving accuracy to zero. SandwichR, by front-loading CinitC_\text{init}, achieves significant speedup while maintaining reasoning-level correction quality through consistency-aware training.

7. Context and Significance in Search Pipelines

SandwichR advances LLM-based search query correction under operational constraints where real-time accuracy is paramount. Its paradigm enables practical deployment by decoupling answer emission from exhaustive reasoning, then leveraging RL to ensure answer quality is not compromised. The approach is extensible to domains with structured outputs and similar latency demands. The introduction of a reasoning-annotated, error-injected benchmark sets a foundation for further research in domain-adaptive correction tasks and reinforces the utility of aligned multi-stage output in time-critical LLM applications (Zhang et al., 7 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sandwich Reasoning (SandwichR).