Reversible Duplex Conformer: Invertible S2ST
- Reversible Duplex Conformer is a fully invertible Conformer encoder–decoder that enables symmetric bidirectional speech-to-speech translation via a single shared stack.
- It employs reversible residual connections and paired diffusion processes to ensure cycle consistency and efficient parameter sharing across language directions.
- The architecture demonstrates improved performance and faster inference compared to traditional systems by reducing model duplication and leveraging cyclic mappings.
A Reversible Duplex Conformer (RDC) is a fully invertible Conformer encoder–decoder architecture designed as the core of the Duplex Diffusion Model (DDM), which achieves high-fidelity, bidirectional speech-to-speech translation (S2ST) via parameter sharing and cycle-consistent mappings. RDC enables symmetric translation between language pairs by "flipping" the input and output ends of the same model, as opposed to traditional systems that require directional or duplicated models. RDC is constructed from reversible Conformer layers and supports @@@@1@@@@ diffusion probabilistic processes for joint modeling of two languages’ acoustics, enabling explicit invertibility and consistent performance across both directions (Wu, 2023).
1. Conceptual Foundation and Bidirectional Modeling
S2ST inherently involves two natural directions (e.g., English→Spanish and Spanish→English). Conventional approaches often employ two separate models or a multitask architecture with asymmetric parameter sharing. By contrast, the Duplex Diffusion Model addresses these inefficiencies through a single RDC stack equipped with bidirectional diffusion modules, enabling joint learning of both directions. The RDC’s invertibility ensures that either direction can be modeled using the same stack, controlled solely by input/output swaps—facilitating truly reversible translation workflows (Wu, 2023).
2. Reversible Duplex Conformer Layer Construction
The RDC is composed of even layers, each splitting its activations into two halves for source (X) and target (Y) languages. Half the layers are "forward" blocks; the other half are "reverse" blocks, arranged symmetrically. Layer construction is as follows:
- Forward Block: Input , where each . Submodules include:
- Reverse Block: Inversion achieved by subtracting residual components in mirror order.
Because the stack is palindromic with respect to forward and reverse modules (FFN↔MHSA↔CNN↔FFN), the mapping is involutive: , guaranteeing and cycle consistency.
The table below summarizes the core transformation logic:
| Block Type | Main Operations | Inversion Method |
|---|---|---|
| Forward | FFN, MHSA, CNN (with residuals) | N/A (primal direction) |
| Reverse | Residual subtraction in mirror order | Ensures exact recovery |
| Symmetric Stack | Layers palindromic: forward + reverse | Allows |
This design, combined with reversible residual connections, realizes a true bijection between language representations.
3. Mathematical Reversibility and Cycle Consistency
RDC enforces a pair of cycle-consistent, fully invertible mappings between the latent spaces (source) and (target). Denoting the RDC mapping as , the requirements are:
- Involution:
- Cycle consistency:
- Layerwise implementation:
- Forward:
- Reverse:
All submodules (FFN, MHSA, CNN) are wrapped in reversible residuals, enabling exact inversion at every layer and maintaining the involutive property throughout the architecture (Wu, 2023).
4. Duplex Diffusion Coupling and Training Procedure
DDM defines two discrete-time diffusion processes: one per language. For each, the forward (noising) process is:
with , , and the marginal , .
The parameterized denoising step is implemented by RDC with cross-attention to the opposing clean end:
where is computed as a function of and the predicted noise by the RDC.
The joint training procedure consists of:
- Sampling waveform pairs and extracting features via pre-trained wav2vec/HuBERT.
- Drawing random values and independent Gaussian noise for each side.
- Computing noised representations .
- Predicting noise on each side with RDC.
- Minimizing a diffusion loss:
Additional losses include length modeling (CTC or MSE), forward–backward agreement (FBA) at each layer, and cycle-consistency distance on the clean stack. These objectives are weighted to yield the final loss function for either unit- or spectrogram-based training.
5. Inference and Symmetry of Generation
Inference leverages the inherent symmetry of RDC and DDM. Translation from X to Y proceeds as:
- Feature extraction: .
- Initialize .
- For down to 1:
- Predict noise .
- Compute using the learned mean and variance.
- Synthesize waveform via a vocoder: .
Reversing direction for Y→X is achieved by swapping encoders and vocoders, as all parameters are shared and the mapping is involutive. This results in nearly identical computational graph and efficiency in both directions (Wu, 2023).
6. Implementation Strategies and Training Techniques
RDC-based systems use several architectural and procedural optimizations:
- Symmetric parameter sharing: All Conformer weights are fully shared between directions.
- Distinct noise schedules: Allows adaptation to language-specific sequence lengths/dynamics.
- Adaptive upsampling: Convolutional modules ensure cross-attention keys/values are length-matched.
- Self-supervised pretraining: wav2vec/HuBERT initialization accelerates convergence and reduces data demands.
- Curriculum-style training: Initial pretraining with only clean-data losses, followed by joint training with diffusion losses, and optional RDC fine-tuning for improved decoding.
Such strategies maintain architectural efficiency and maximize performance benefits of the reversible, cycle-consistent design (Wu, 2023).
7. Comparison to Conventional Conformer-Based S2ST
A standard Conformer-based S2ST typically involves distinct encoder–decoder pairs per direction, restricted to unidirectional training (X→Y or Y→X), with cross-entropy or MSE losses and no support for invertibility, cycle consistency, or coupled diffusion processes.
In contrast, RDC+DDM:
- Shares all parameters across directions.
- Guarantees and explicit cycle consistency.
- Models bidirectional fine-grained acoustics via joint diffusion.
- Halves total parameter count and training runs.
- Achieves 1–3 ASR-BLEU point improvements over baselines (Translatotron2, S2UT, UnitY) and up to 1.7× faster inference by enabling single-pass reversible decoding.
Empirical results on datasets (Fisher Es→En, CVSS-C, multi-domain En↔Es) confirm these advantages (Wu, 2023). A plausible implication is that such fully-shared, symmetry-driven architectures provide a robust foundation for efficient, high-fidelity, and direction-agnostic speech-to-speech translation in multilingual systems.