Rephrasing Adapter in Sequence Models

Updated 11 November 2025

Rephrasing adapter is an auxiliary module that modifies target representations to reduce modality gaps in sequence modeling.
It is implemented in two main paradigms: a reference rephraser for non-autoregressive translation and a prompt rephrasing adapter for text-to-video generation.
Empirical results show significant BLEU gains in NAT and improved video quality metrics, highlighting its effectiveness in semantic realignment.

A rephrasing adapter is an auxiliary module employed in sequence modeling pipelines to dynamically adjust target representations or conditioning contexts, thereby narrowing the mismatch (“modality gap”) between model outputs and training or inference conditions. Recent studies have instantiated rephrasing adapters in two principal paradigms: (1) as a reference rephraser in non-autoregressive translation models to mitigate multi-modality in target sequences (Shao et al., 2022), and (2) as a prompt rephrasing adapter bridging LLMs and diffusion models for text-to-video (T2V) generation, enabling more semantically faithful video outputs from concise or under-specified prompts (Zhang et al., 6 Nov 2025).

1. Fundamental Motivations and Problem Setting

Rephrasing adapters address the challenge that arises when a model’s output must be supervised or conditioned relative to a “hard” reference that does not fully represent the set of valid outputs. This is prevalent in tasks exhibiting multiplicity of correct answers: non-autoregressive neural machine translation (NAT), where multiple surface realizations are possible for a given meaning, and generative cross-modal models, where user prompts are concise but the intended semantic payload is expansive or context-dependent. The rephrasing adapter explicitly re-aligns or expands the reference or conditioning signal to better fit the output mode or input prompt without altering task semantics.

2. Architectural Realizations

2.1 Reference Rephraser for NAT

The reference rephraser comprises a shallow (two-layer) non-autoregressive Transformer decoder. Its inputs are the original reference token embeddings (serving as the decoder’s “self” input) and the final hidden states of the NAT decoder (cross-attended). Unlike standard encoder-decoder setups, there is no dedicated encoder for the rephraser; the reference stream provides its own context. The module outputs a distribution $P_r(Y_r\mid X,Y)$ over a sequence $Y_r$ of length $T$ , matching the reference (Shao et al., 2022).

2.2 Rephrasing Adapter for Text-to-Video

In RISE-T2V, the Rephrasing Adapter (RA) is a two-layer feed-forward module receiving the hidden states $h'_{\mathrm{reph}}$ (corresponding to the rephrased sub-sequence) from a decoder-only LLM (e.g., LLaMA2-Chat) after processing a concatenation of an instruction prompt and the original user prompt. The adapter applies two linear transformations with GeLU activation, producing $c = f_{\mathrm{RA}}(h'_{\mathrm{reph}})$ . This conditioning vector is injected into each cross-attention block of the video U-Net or DiT, such that at diffusion step $t$ , the model’s denoiser attends to $c$ alongside video features (Zhang et al., 6 Nov 2025).

Application	Module Form	Input(s)	Output
NAT (Shao et al., 2022)	2-layer Transformer	Reference embedding, NAT dec	Rephrased sequence
T2V (RISE-T2V) (Zhang et al., 6 Nov 2025)	2-layer MLP	LLM next-token hidden states	Conditioning vector

3. Training and Optimization Strategies

3.1 Reference Rephraser for NAT

Training consists of several stages:

Pre-training of NAT on conventional cross-entropy loss with the original reference.
Pre-training of the rephraser by maximum likelihood to reconstruct both the original reference $Y$ and the NAT output $\hat Y$ .
Fine-tuning both models jointly; NAT continues to minimize cross-entropy on rephraser outputs $\hat Y_r$ , while the rephraser maximizes interpolated reward via REINFORCE. Reward $\mathcal{R}(Y_r) = \alpha\mathcal{R}_{\mathrm{sim}}(Y_r) + (1-\alpha)\mathcal{R}_{\mathrm{loss}}(Y_r)$ mixes token-level similarity to the reference (sentence-level BLEU) and ease of NAT prediction (negative NAT loss). $\alpha$ is annealed (from 0.75 to 0.5).

REINFORCE gradients: $\nabla_{\theta_r}\mathcal{J} = \mathbb{E}_{Y_r\sim P_r}\left[\nabla_{\theta_r}\log P_r(Y_r|X,Y)\,\mathcal{R}(Y_r)\right]$ A variance-reducing baseline is computed using two additional samples.

3.2 Rephrasing Adapter for T2V

Training follows a two-stage protocol:

Stage 1: On large-scale image–text data, the LLM and diffusion backbone are frozen while only the RA and LoRA adapters in the U-Net’s cross-attention are updated. The loss is mean squared error between the true DDPM noise and the model prediction, conditioned on the rephrased context: $L_{\mathrm{stage1}} = \mathbb{E}_{x_0,t,\epsilon}\left\|\epsilon - \epsilon_{\theta}(x_t, t; f_{\mathrm{RA}}(h'_{\mathrm{reph}}))\right\|^2$
Stage 2: For video fine-tuning, $f_{\mathrm{RA}}$ is frozen and only LoRA modules are updated using the same loss but over video frames.

Generic pseudocode for Stage 1 (see (Zhang et al., 6 Nov 2025)):

for each batch of (y_inst, y_ori, x_img) in image–text data:
    # LLM forward; extract rephrased hidden states
    reph_text, h_all = LLM.generate_and_output_hidden_states(concatenate(y_inst, y_ori))
    h_reph = extract_reph_segment(h_all)
    c = W2 · GeLU(W1 · h_reph + b1) + b2
    
    # DDPM noise and denoising
    t = sample_uniform(1, T)
    epsilon = torch.randn_like(x_img)
    x_t = sqrt(alpha_bar_t)*x_img + sqrt(1-alpha_bar_t)*epsilon
    epsilon_hat = UNet_LoRA(x_t, t, cond=c)
    loss = (epsilon - epsilon_hat).pow(2).mean()
    # Backpropagate on RA and LoRA params

4. Empirical Impact and Ablation

4.1 Non-Autoregressive Translation

On WMT14 En↔De and WMT16 En↔Ro:

Vanilla NAT with a rephraser: +4.9 BLEU (En→De: 20.42→25.33), speed 15.6× faster than AT.
CMLM + rephraser: +8.4 BLEU (18.21→26.65).
CTC + rephraser: +1.0 BLEU.
CMLM with a rephraser outperforms all prior NAT on raw data by 10+ BLEU (Shao et al., 2022).

Token entropy and repetition rates dropped significantly, indicating improved confidence and diversity. Architecture ablations showed that alternative rephraser designs or increased depth did not outperform the two-layer decoder form.

4.2 Text-to-Video Generation

On VBench (800 prompts, 8 categories):

AnimateDiff (simple prompts): aesthetic 6.39, text 31.36, motion 0.983, average rank 4.33.
AnimateDiff + “off-line” LLM rewrite: aesthetic 6.54, text 32.69, motion 0.981, rank 3.33.
RISE-AnimateDiff (simple prompts + RA): aesthetic 6.61, text 31.95, motion 0.984, rank 2.00.

RA yields T2V quality competitive with state-of-the-art dense-prompt systems without prompt engineering. User studies (n=300) show RISE-AnimateDiff was preferred for aesthetic (49.0% vs 36.7%), temporal (54.0% vs 33.3%), and text (52.7% vs 50.0%). Training RA on true next-token hidden states (not just encoded prompts) was necessary for video sharpness and faithfulness (Zhang et al., 6 Nov 2025).

5. Broader Applications and Limitations

The core principle underpinning both classes of rephrasing adapters is general: whenever substantial output ambiguity, multi-modal supervision, or under-constrained conditioning exists, an adapter trained to bridge references or prompts to the model’s output/conditioning manifold can reduce the modality gap. Potential extensions encompass abstractive summarization (multiple valid compressions), paraphrase or style transfer, speech generation and recognition (variable surface forms), and code generation (diverse executions with semantic equivalence). In each case, the adapter is trained, commonly via reinforcement learning or differentiable surrogates, to yield rephrasings that are both semantically conformant and loss-reducing relative to the main network.

6. Implementation Details and Integration

Hyperparameters and Design

Parameter	NAT Rephraser	T2V Rephrasing Adapter
Layers	2 (Transformer)	2 (MLP)
Hidden size	as NAT decoder	4096 (LLM), 768 (RA output)
Optimizer	—	AdamW, weight decay 0.01
Learning rate	—	Stage 1: $5 \times 10^{-5}$
Pretraining steps	300K (NAT)	12M image–text pairs
Fine-tuning steps	30K	1M video–text pairs

Integration Points

NAT: The rephraser is only active during training; inference is unaffected.
T2V: RA is injected as conditioning in all cross-attention blocks of the diffusion model; the module is inference-time efficient as it relies on fast LLM embedding passes and a shallow MLP.

A plausible implication is that such adapters can facilitate plug-and-play augmentation of models in low-latency, high-throughput settings, and can be adopted across architectures and tasks (Zhang et al., 6 Nov 2025).

Rephrasing adapters differ from classical order-agnostic losses (e.g., AXE, OAXE), paraphraser post-processing, or simple prompt engineering. They offer dynamic, learned adaptation of the reference or input context tailored to the model’s current outputs or the characteristics of a user prompt, without requiring multiple “oracle” references or manual intervention. Empirically, they deliver robustness and measurable gains across modalities and data regimes, outperforming prior baselines and even matching autoregressive teacher models in some regimes (Shao et al., 2022 Zhang et al., 6 Nov 2025). Use of adaptive reward blending and architectural simplicity ensures stable optimization and transferability.

In summary, the rephrasing adapter constitutes a flexible and efficient mechanism for bridging semantic gaps in supervised or conditioned sequence generation, applicable to both text and multimodal generative pipelines.

PDF Markdown Chat (Pro)

References (2)

Rephrasing the Reference for Non-Autoregressive Machine Translation (2022)

RISE-T2V: Rephrasing and Injecting Semantics with LLM for Expansive Text-to-Video Generation (2025)

Follow Topic

Get notified by email when new papers are published related to Rephrasing Adapter.