Reversible Duplex Conformer: Invertible S2ST

Updated 29 January 2026

Reversible Duplex Conformer is a fully invertible Conformer encoder–decoder that enables symmetric bidirectional speech-to-speech translation via a single shared stack.
It employs reversible residual connections and paired diffusion processes to ensure cycle consistency and efficient parameter sharing across language directions.
The architecture demonstrates improved performance and faster inference compared to traditional systems by reducing model duplication and leveraging cyclic mappings.

A Reversible Duplex Conformer (RDC) is a fully invertible Conformer encoder–decoder architecture designed as the core of the Duplex Diffusion Model (DDM), which achieves high-fidelity, bidirectional speech-to-speech translation (S2ST) via parameter sharing and cycle-consistent mappings. RDC enables symmetric translation between language pairs by "flipping" the input and output ends of the same model, as opposed to traditional systems that require directional or duplicated models. RDC is constructed from reversible Conformer layers and supports @@@@1@@@@ diffusion probabilistic processes for joint modeling of two languages’ acoustics, enabling explicit invertibility and consistent performance across both directions (Wu, 2023).

1. Conceptual Foundation and Bidirectional Modeling

S2ST inherently involves two natural directions (e.g., English→Spanish and Spanish→English). Conventional approaches often employ two separate models or a multitask architecture with asymmetric parameter sharing. By contrast, the Duplex Diffusion Model addresses these inefficiencies through a single RDC stack equipped with bidirectional diffusion modules, enabling joint learning of both directions. The RDC’s invertibility ensures that either direction can be modeled using the same stack, controlled solely by input/output swaps—facilitating truly reversible translation workflows (Wu, 2023).

2. Reversible Duplex Conformer Layer Construction

The RDC is composed of $L$ even layers, each splitting its activations into two halves for source (X) and target (Y) languages. Half the layers are "forward" blocks; the other half are "reverse" blocks, arranged symmetrically. Layer construction is as follows:

Forward Block: Input $x_{\ell-1} = [x_{\ell-1}^{(1)} ; x_{\ell-1}^{(2)}]$ $x_{ℓ - 1} = [x_{ℓ - 1}^{(1)}; x_{ℓ - 1}^{(2)}]$ , where each $x^{(i)} \in \mathbb{R}^h$ $x^{(i)} \in R^{h}$ . Submodules include:
- FFN with residual connection (and 30% dropout)
- MHSA with relative positional embedding
- CNN block (PW→GLU→DW→BN→Swish→PW)
Reverse Block: Inversion achieved by subtracting residual components in mirror order.

Because the stack is palindromic with respect to forward and reverse modules (FFN↔MHSA↔CNN↔FFN), the mapping $f_\theta$ is involutive: $f_\theta^{-1} = f_\theta$ , guaranteeing $f_\theta(f_\theta(x)) = x$ and cycle consistency.

The table below summarizes the core transformation logic:

Block Type	Main Operations	Inversion Method
Forward	FFN, MHSA, CNN (with residuals)	N/A (primal direction)
Reverse	Residual subtraction in mirror order	Ensures exact recovery
Symmetric Stack	Layers palindromic: $L/2$ forward + $L/2$ reverse	Allows $f_\theta^{-1} = f_\theta$

This design, combined with reversible residual connections, realizes a true bijection between language representations.

3. Mathematical Reversibility and Cycle Consistency

RDC enforces a pair of cycle-consistent, fully invertible mappings between the latent spaces $X$ (source) and $Y$ (target). Denoting the RDC mapping as $f_\theta: X \leftrightarrow Y$ , the requirements are:

Involution: $f_\theta = (f_\theta)^{-1}$
Cycle consistency:

$\forall x \in X:\quad f_\theta(f_\theta(x)) = x, \quad \forall y \in Y:\quad f_\theta(f_\theta(y)) = y$

Layerwise implementation:
- Forward: $z = F(x) = x + R(x)$
- Reverse: $x = F^{-1}(z) = z - R(x)$

All submodules (FFN, MHSA, CNN) are wrapped in reversible residuals, enabling exact inversion at every layer and maintaining the involutive property throughout the architecture (Wu, 2023).

4. Duplex Diffusion Coupling and Training Procedure

DDM defines two discrete-time diffusion processes: one per language. For each, the forward (noising) process is:

$q(x_t \mid x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)$

with $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ , and the marginal $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\varepsilon$ , $\varepsilon \sim \mathcal{N}(0,I)$ .

The parameterized denoising step is implemented by RDC with cross-attention to the opposing clean end:

$p_\theta(x_{t-1}\mid x_t, y_0) = \mathcal{N}\left(\mu_\theta(x_t, t; y_0), \Sigma_\theta(t)I\right)$

where $\mu_\theta$ is computed as a function of $x_t, t, y_0$ and the predicted noise by the RDC.

The joint training procedure consists of:

Sampling waveform pairs $(wav_X, wav_Y)$ and extracting features via pre-trained wav2vec/HuBERT.
Drawing random $t$ values and independent Gaussian noise for each side.
Computing noised representations $x_t, y_t$ .
Predicting noise on each side with RDC.
Minimizing a diffusion loss:

$L_{diff} = \mathbb{E}_t \left[ \|\varepsilon_x - \hat{\varepsilon}_x\|^2 + \|\varepsilon_y - \hat{\varepsilon}_y\|^2 \right]$

Additional losses include length modeling (CTC or MSE), forward–backward agreement (FBA) at each layer, and cycle-consistency distance on the clean stack. These objectives are weighted to yield the final loss function for either unit- or spectrogram-based training.

5. Inference and Symmetry of Generation

Inference leverages the inherent symmetry of RDC and DDM. Translation from X to Y proceeds as:

Feature extraction: $x_0 \leftarrow E_x(wav_X)$ .
Initialize $y_T \sim \mathcal{N}(0,I)$ .
For $t = T$ $t = T$ down to 1:
- Predict noise $\hat{\varepsilon}_y \leftarrow f_\theta(y_t; t \mid x_0)$ .
- Compute $y_{t-1}$ using the learned mean and variance.
Synthesize waveform via a vocoder: $wav_Y \leftarrow Vocoder(y_0)$ .

Reversing direction for Y→X is achieved by swapping encoders and vocoders, as all parameters are shared and the mapping is involutive. This results in nearly identical computational graph and efficiency in both directions (Wu, 2023).

6. Implementation Strategies and Training Techniques

RDC-based systems use several architectural and procedural optimizations:

Symmetric parameter sharing: All Conformer weights are fully shared between directions.
Distinct noise schedules: Allows adaptation to language-specific sequence lengths/dynamics.
Adaptive upsampling: Convolutional modules ensure cross-attention keys/values are length-matched.
Self-supervised pretraining: wav2vec/HuBERT initialization accelerates convergence and reduces data demands.
Curriculum-style training: Initial pretraining with only clean-data losses, followed by joint training with diffusion losses, and optional RDC fine-tuning for improved decoding.

Such strategies maintain architectural efficiency and maximize performance benefits of the reversible, cycle-consistent design (Wu, 2023).

7. Comparison to Conventional Conformer-Based S2ST

A standard Conformer-based S2ST typically involves distinct encoder–decoder pairs per direction, restricted to unidirectional training (X→Y or Y→X), with cross-entropy or MSE losses and no support for invertibility, cycle consistency, or coupled diffusion processes.

In contrast, RDC+DDM:

Shares all parameters across directions.
Guarantees $f_\theta = f_\theta^{-1}$ and explicit cycle consistency.
Models bidirectional fine-grained acoustics via joint diffusion.
Halves total parameter count and training runs.
Achieves 1–3 ASR-BLEU point improvements over baselines (Translatotron2, S2UT, UnitY) and up to 1.7× faster inference by enabling single-pass reversible decoding.

Empirical results on datasets (Fisher Es→En, CVSS-C, multi-domain En↔Es) confirm these advantages (Wu, 2023). A plausible implication is that such fully-shared, symmetry-driven architectures provide a robust foundation for efficient, high-fidelity, and direction-agnostic speech-to-speech translation in multilingual systems.

Markdown Report Issue Upgrade to Chat

References (1)

Duplex Diffusion Models Improve Speech-to-Speech Translation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reversible Duplex Conformer.