Decoding-Time Realignment (DeRa)

Updated 10 November 2025

Decoding-Time Realignment (DeRa) is a family of techniques that dynamically re-aligns signals, representations, or outputs at the decoding stage to correct errors and adjust model preferences.
It employs methods such as autocorrelation, logit mixing, dynamic time warping, and subspace projection to synchronize outputs across various applications like communications, language modeling, and genomic compression.
These techniques improve system performance by reducing latency and error, enhancing throughput in packet transmissions, and enabling flexible, decoder-side adaptations for diverse inference tasks.

Decoding-time Realignment (DeRa) refers to a family of techniques for realigning signals, representations, or hypotheses at the decoding or inference stage, optimizing for error correction, model alignment, efficiency, or adaptive preference integration. While originally appearing in channel coding for communications, DeRa has rapidly diversified, enabling dynamic control of alignment strength in LLMs, universal compatibility in speculative decoding, low-latency realignment in large-batch inference, decoder-side genomic compression, robust speech recognition via iterative hypothesis refinement, and more. Methods typically involve estimation or adaptation of synchronization parameters—timing, phase, embedding subspaces, or alignment between output spaces—at run-time, and perform decoding conditioned on these refined alignments.

1. Theoretical Foundations and General Principles

Decoding-time Realignment emerges wherever a system must decode output hypotheses or signals that are systematically misaligned—due to replicative transmissions, preference shift, output drifts, batch synchronization heterogeneity, or noise. The defining principle is to push major alignment, correction, or adaptation operations into the decoding/inference phase, often leveraging side information, raw signal properties, or model outputs that are inaccessible or expensive to integrate at the encoding or training phase.

The methodologies span:

Signal synchronization: Estimating and correcting offsets (timing/phase) between signals or packet replicas (Zidane et al., 2015).
Probabilistic realignment: Dynamically interpolating outputs/logits of multiple models, enabling fine-grained control of preference or alignment degree (Liu et al., 5 Feb 2024), or adapting for preference during generation (Zhang et al., 26 Feb 2025).
Latent or structural alignment: Mapping outputs across different tokenizations, vocabularies, or sequence schemas, e.g., via dynamic time warping for speculative decoding (Xiao et al., 17 Oct 2025).
Embedding space sanitization: Projecting model subspaces to suppress specific vulnerabilities or artifacts in model output (Kilictas et al., 22 Jun 2025).
Decoder-side side-information integration: Pushing alignment and error-correction for data sources into the decoding pipeline using distributed coding frameworks (Gershon et al., 2022).
Batch and parallel synchronization: Maintaining correctness in speculative decoding for batches with ragged prediction lengths by enforcing synchronization invariants via tensor realignment (Zhang et al., 26 Oct 2025).

A core unifying theme is the optimization of a realignment operator, derived from analytical modeling of the target domain (e.g., autocorrelation for signal alignment, minimum-distance mapping for sequence alignment, logit mixture for probabilistic control), and executed at low-latency during decoding or inference.

2. Signal Processing: Synchronization of Packet Replicas

The DeRa methodology for physical-layer communications, and specifically in the MARSALA protocol (Zidane et al., 2015), addresses the combination of multiple, asynchronously received packet replicas sent in a random-access communication scheme. Timing and phase offsets accrued across different transmissions degrade coherent combining.

The system models received replica signals as

$r₁(t) = y(t+T₁)\cdot e^{j[\varphi₁ + 2\pi \Delta f₁ t]} + n₁(t) + s₁(t)$

$rᵢ(t) = y(t - NᵢT_s + Tᵢ)\cdot e^{j[\varphiᵢ + 2\pi \Delta fᵢ t]} + nᵢ(t) + sᵢ(t)$

for AWGN channel and per-user timing, phase, and frequency offsets.

DeRa procedure:

Autocorrelation-based localization: Use cross-correlation between reference slot and potential replicas to identify all candidate replica slots and measure coarse offsets.
Fine synchronization: At each replica, estimate the timing offset ( $\Delta\tauᵢ$ ) by the position of correlation peak, and the phase difference ( $\Delta\varphiᵢ$ ) by the phase of the peak.
Digital compensation: Apply the inverse of the detected offset/phase to each sampled signal, realigning all signals for optimal coherent sum.
Residual error modeling: Quantify performance loss from residual misalignment analytically, showing that for typical SNRs and oversampling rates, DeRa reduces SNIR degradation to less than 0.5 dB and yields throughput gains of 25–40% versus prior methods.

The practical impact is extremely high in overloaded random-access systems, where the correct alignment at the decoder unlocks nearly the full theoretical gain from sending multiple packet replicas. Throughput and PLR improvements are most pronounced for three or more replicas under moderate or high SNR.

3. Language Modeling: Logit Mixing for Flexible Alignment Control

Decoding-time Realignment for LLMs (Liu et al., 5 Feb 2024) provides a fast, post-training mechanism to control the effective degree of alignment to human preferences or reference models. The canonical KL-regularized RLHF setup,

$\pi^* = \arg\max_\pi \mathbb{E}[r(x,y)] - \beta \mathbb{E}[\mathrm{KL}(\pi \| p_0)]$

produces a one-parameter family of aligned models for different values of $\beta$ , traditionally requiring full retraining for each regularization strength.

DeRa method:

Define a mixing coefficient $\lambda \ge 0$ .
At each token $t$ , combine reference logits $h^0_t$ and aligned logits $h^A_t$ as $(1-\lambda) h^0_t + \lambda h^A_t$ .
The output next-token distribution is $\mathrm{softmax}((1-\lambda) h^0_t + \lambda h^A_t)$ .
$\lambda=0$ yields the unaligned model, $\lambda=1$ the base aligned model, $\lambda>1$ extrapolates to even weaker regularization, $\lambda<1$ enforces stronger regularization.

Empirical studies confirm that DeRa's sweep over $\lambda$ reproduces the reward/divergence trade-off of costly multi- $\beta$ retraining, for tasks such as summarization, hallucination control, and chat alignment. The approach is highly practical: it requires no change to the underlying model, only dual-model inference per decode step. The main computational cost is doubled inference; accuracy tracks full retraining to within measurement error.

4. Universal Speculative Decoding: Sequence Realignment across Tokenizations

Standard speculative decoding methods assume that the draft and target model share the same vocabulary, limiting model pair selection and interoperability. DeRa, via the TokenTiming method (Xiao et al., 17 Oct 2025), relaxes this constraint by introducing a decoding-time realignment between the draft and target token sequences using dynamic time warping (DTW).

TokenTiming/DeRa steps:

Draft model generates $D=[d_1,\dots,d_m]$ under vocabulary $\mathcal V_d$ .
String corresponding to $D$ is re-tokenized under the target model's vocabulary, yielding $T = [t_1,\dots,t_n]$ .
DTW aligns $D$ and $T$ using Levenshtein edit distance; the optimal alignment $\pi^*$ defines a many-to-many mapping between draft and target tokens.
Draft probabilities are mapped and aggregated onto target tokens for probabilistic verification; acceptance is performed token-wise as usual.
This process is lossless and preserves the distribution correctly, enabling speculative decoding speedups without retraining or rematching vocabularies.

Empirical results demonstrate up to 1.57 $\times$ speedup (accept rate improvement 0.18 $\rightarrow$ 0.21) and parity with target-only decoding on output quality. When combined with batching and advanced scheduling, this approach removes a key bottleneck for generic cross-model acceleration in production.

5. Model Space Realignment: Subspace Purification and Embedding Surgery

Certain tokens induce undesirable effects on model generation—e.g., semantic drift via the em-dash in transformers—due to embedding entanglement and cumulative representation instability. DeRa, as generalized in (Kilictas et al., 22 Jun 2025), projects the embedding space to eliminate directions associated with problematic tokens, ensuring the model's behavior remains coherent upon such token suppression.

Pipeline:

Clause purification: Remove all instances of the problematic token (e.g., em-dash) from the generation prefix at each step.
Embedding realignment: Compute the normalized embedding direction $u = e_{§}/\|e_{§}\|_2$ ; project the embedding matrix using $P=I-uu^{\top}$ so all embeddings are orthogonal to $u$ .
Decoding with projected embeddings and purified history, ensuring neither accidental generation nor latent impact of the problematic token.
Convergence: Projection reaches a fixed point in one step; further decoding remains in the invariant subspace.
Empirical gains: Perplexity increase <0.2; em-dash generation rate drops from 12% to 0.1%; topic coherence and human preference increase by 32% and 74% respectively.

This suggests that DeRa formalism is both general and effective for surgical manipulation of LM behavior at decode-time without retraining.

6. Decoder-Side Alignment in Genomic Compression

In distributed genomic compression (Gershon et al., 2022), DeRa enables reference-free, low-overhead read encoding by delegating all alignment and error correction operations to the decoder, where full side information (the reference genome) is available.

Framework:

Reads from the unknown genome are each compressed into: a short identifier (bit-sample), inner code syndrome for substitution error correction, a validation syndrome, and a batchwise outer-code syndrome.
At decoding, the reference genome is used to identify candidate alignments for each read using a novel shift-compensating distance for substitution and single deletion errors; only successful candidates that pass inner code error correction and validation are retained.
An outer Reed–Solomon code corrects residual errors or erasures over the batch.
Performance matches or exceeds competing compressors, especially at low coverage, and CPU cost is shifted almost entirely to the decoder.

This paradigm exemplifies a reduction of encoder complexity at the cost of decoder realignment, enabled by effective, problem-tailored side-information utilization and error-correcting codes.

7. Correctness and Overhead in Batched/Parallel Decoding

In large batch serving of LLMs, speculative decoding creates "ragged tensors" when each sequence in the batch accepts a variable number of tokens per speculative step, corrupting positional, attention, and KV-cache invariants necessary for model correctness (Zhang et al., 26 Oct 2025). Decoding-time Realignment comprises tensor operations that:

Unpad and repad every sequence in the batch to restore rectangular shape and correct position IDs.
Update the KV-cache and attention mask to match true prefix lengths per sequence.
Preserve token-wise output equivalence to standard AR decoding.

Empirical measurements reveal that this DeRa process may comprise up to 40% of compute at batch size 8, dominating latency for large-scale inference. The EXSpec scheduler reduces DeRa calls by grouping same-length sequences dynamically, demonstrating that smarter scheduling can sidestep most synchronization overhead without performance loss.

Summary Table: Core Domains and DeRa Mechanisms

Domain	Key DeRa Principle	Primary Methodology
Satellite comms / signals	Physical signal sync	Autocorrelation, cross-corr.
LLM alignment	Logit mixing	Linear interpolation
Speculative decoding (LMs)	Sequence alignment	Dynamic time warping
Embedding control (LMs)	Subspace projection	Linear embedding purification
Genomic compression	Decoder-side alignment	Shift-compensating distance
Batch inference (LLMs)	Synchronization	Tensor realignment, padding

DeRa constitutes an increasingly central paradigm for runtime adaptation, robustness, and cross-distribution interoperability across diverse computational disciplines. Its hallmark is the transition of expensive or uncertain alignment, correcting, or preference integration procedures from costly pre-processing or retraining to efficient, problem-specific realignment at inference or decoding.