Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reversible Duplex Conformer: Invertible S2ST

Updated 29 January 2026
  • Reversible Duplex Conformer is a fully invertible Conformer encoder–decoder that enables symmetric bidirectional speech-to-speech translation via a single shared stack.
  • It employs reversible residual connections and paired diffusion processes to ensure cycle consistency and efficient parameter sharing across language directions.
  • The architecture demonstrates improved performance and faster inference compared to traditional systems by reducing model duplication and leveraging cyclic mappings.

A Reversible Duplex Conformer (RDC) is a fully invertible Conformer encoder–decoder architecture designed as the core of the Duplex Diffusion Model (DDM), which achieves high-fidelity, bidirectional speech-to-speech translation (S2ST) via parameter sharing and cycle-consistent mappings. RDC enables symmetric translation between language pairs by "flipping" the input and output ends of the same model, as opposed to traditional systems that require directional or duplicated models. RDC is constructed from reversible Conformer layers and supports @@@@1@@@@ diffusion probabilistic processes for joint modeling of two languages’ acoustics, enabling explicit invertibility and consistent performance across both directions (Wu, 2023).

1. Conceptual Foundation and Bidirectional Modeling

S2ST inherently involves two natural directions (e.g., English→Spanish and Spanish→English). Conventional approaches often employ two separate models or a multitask architecture with asymmetric parameter sharing. By contrast, the Duplex Diffusion Model addresses these inefficiencies through a single RDC stack equipped with bidirectional diffusion modules, enabling joint learning of both directions. The RDC’s invertibility ensures that either direction can be modeled using the same stack, controlled solely by input/output swaps—facilitating truly reversible translation workflows (Wu, 2023).

2. Reversible Duplex Conformer Layer Construction

The RDC is composed of LL even layers, each splitting its activations into two halves for source (X) and target (Y) languages. Half the layers are "forward" blocks; the other half are "reverse" blocks, arranged symmetrically. Layer construction is as follows:

  • Forward Block: Input x1=[x1(1);x1(2)]x_{\ell-1} = [x_{\ell-1}^{(1)} ; x_{\ell-1}^{(2)}], where each x(i)Rhx^{(i)} \in \mathbb{R}^h. Submodules include:
    • FFN with residual connection (and 30% dropout)
    • MHSA with relative positional embedding
    • CNN block (PW→GLU→DW→BN→Swish→PW)
  • Reverse Block: Inversion achieved by subtracting residual components in mirror order.

Because the stack is palindromic with respect to forward and reverse modules (FFN↔MHSA↔CNN↔FFN), the mapping fθf_\theta is involutive: fθ1=fθf_\theta^{-1} = f_\theta, guaranteeing fθ(fθ(x))=xf_\theta(f_\theta(x)) = x and cycle consistency.

The table below summarizes the core transformation logic:

Block Type Main Operations Inversion Method
Forward FFN, MHSA, CNN (with residuals) N/A (primal direction)
Reverse Residual subtraction in mirror order Ensures exact recovery
Symmetric Stack Layers palindromic: L/2L/2 forward + L/2L/2 reverse Allows fθ1=fθf_\theta^{-1} = f_\theta

This design, combined with reversible residual connections, realizes a true bijection between language representations.

3. Mathematical Reversibility and Cycle Consistency

RDC enforces a pair of cycle-consistent, fully invertible mappings between the latent spaces XX (source) and YY (target). Denoting the RDC mapping as fθ:XYf_\theta: X \leftrightarrow Y, the requirements are:

  • Involution: fθ=(fθ)1f_\theta = (f_\theta)^{-1}
  • Cycle consistency:

xX:fθ(fθ(x))=x,yY:fθ(fθ(y))=y\forall x \in X:\quad f_\theta(f_\theta(x)) = x, \quad \forall y \in Y:\quad f_\theta(f_\theta(y)) = y

  • Layerwise implementation:
    • Forward: z=F(x)=x+R(x)z = F(x) = x + R(x)
    • Reverse: x=F1(z)=zR(x)x = F^{-1}(z) = z - R(x)

All submodules (FFN, MHSA, CNN) are wrapped in reversible residuals, enabling exact inversion at every layer and maintaining the involutive property throughout the architecture (Wu, 2023).

4. Duplex Diffusion Coupling and Training Procedure

DDM defines two discrete-time diffusion processes: one per language. For each, the forward (noising) process is:

q(xtxt1)=N(1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)

with αt=1βt\alpha_t = 1 - \beta_t, αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i, and the marginal xt=αˉtx0+1αˉtεx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon, εN(0,I)\varepsilon \sim \mathcal{N}(0,I).

The parameterized denoising step is implemented by RDC with cross-attention to the opposing clean end:

pθ(xt1xt,y0)=N(μθ(xt,t;y0),Σθ(t)I)p_\theta(x_{t-1}\mid x_t, y_0) = \mathcal{N}\left(\mu_\theta(x_t, t; y_0), \Sigma_\theta(t)I\right)

where μθ\mu_\theta is computed as a function of xt,t,y0x_t, t, y_0 and the predicted noise by the RDC.

The joint training procedure consists of:

  1. Sampling waveform pairs (wavX,wavY)(wav_X, wav_Y) and extracting features via pre-trained wav2vec/HuBERT.
  2. Drawing random tt values and independent Gaussian noise for each side.
  3. Computing noised representations xt,ytx_t, y_t.
  4. Predicting noise on each side with RDC.
  5. Minimizing a diffusion loss:

Ldiff=Et[εxε^x2+εyε^y2]L_{diff} = \mathbb{E}_t \left[ \|\varepsilon_x - \hat{\varepsilon}_x\|^2 + \|\varepsilon_y - \hat{\varepsilon}_y\|^2 \right]

Additional losses include length modeling (CTC or MSE), forward–backward agreement (FBA) at each layer, and cycle-consistency distance on the clean stack. These objectives are weighted to yield the final loss function for either unit- or spectrogram-based training.

5. Inference and Symmetry of Generation

Inference leverages the inherent symmetry of RDC and DDM. Translation from X to Y proceeds as:

  1. Feature extraction: x0Ex(wavX)x_0 \leftarrow E_x(wav_X).
  2. Initialize yTN(0,I)y_T \sim \mathcal{N}(0,I).
  3. For t=Tt = T down to 1:
    • Predict noise ε^yfθ(yt;tx0)\hat{\varepsilon}_y \leftarrow f_\theta(y_t; t \mid x_0).
    • Compute yt1y_{t-1} using the learned mean and variance.
  4. Synthesize waveform via a vocoder: wavYVocoder(y0)wav_Y \leftarrow Vocoder(y_0).

Reversing direction for Y→X is achieved by swapping encoders and vocoders, as all parameters are shared and the mapping is involutive. This results in nearly identical computational graph and efficiency in both directions (Wu, 2023).

6. Implementation Strategies and Training Techniques

RDC-based systems use several architectural and procedural optimizations:

  • Symmetric parameter sharing: All Conformer weights are fully shared between directions.
  • Distinct noise schedules: Allows adaptation to language-specific sequence lengths/dynamics.
  • Adaptive upsampling: Convolutional modules ensure cross-attention keys/values are length-matched.
  • Self-supervised pretraining: wav2vec/HuBERT initialization accelerates convergence and reduces data demands.
  • Curriculum-style training: Initial pretraining with only clean-data losses, followed by joint training with diffusion losses, and optional RDC fine-tuning for improved decoding.

Such strategies maintain architectural efficiency and maximize performance benefits of the reversible, cycle-consistent design (Wu, 2023).

7. Comparison to Conventional Conformer-Based S2ST

A standard Conformer-based S2ST typically involves distinct encoder–decoder pairs per direction, restricted to unidirectional training (X→Y or Y→X), with cross-entropy or MSE losses and no support for invertibility, cycle consistency, or coupled diffusion processes.

In contrast, RDC+DDM:

  1. Shares all parameters across directions.
  2. Guarantees fθ=fθ1f_\theta = f_\theta^{-1} and explicit cycle consistency.
  3. Models bidirectional fine-grained acoustics via joint diffusion.
  4. Halves total parameter count and training runs.
  5. Achieves 1–3 ASR-BLEU point improvements over baselines (Translatotron2, S2UT, UnitY) and up to 1.7× faster inference by enabling single-pass reversible decoding.

Empirical results on datasets (Fisher Es→En, CVSS-C, multi-domain En↔Es) confirm these advantages (Wu, 2023). A plausible implication is that such fully-shared, symmetry-driven architectures provide a robust foundation for efficient, high-fidelity, and direction-agnostic speech-to-speech translation in multilingual systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reversible Duplex Conformer.