RevFFN: Invertible FFN for S2ST
- RevFFN is a reversible feed-forward network module that enables exact invertibility in duplex Conformer architectures for bidirectional speech-to-speech translation.
- It employs an additive residual structure across FFN, multi-head self-attention, and CNN submodules to ensure precise cycle consistency and efficient parameter sharing.
- Integration with duplex diffusion objectives improves ASR-BLEU scores by 1-2 points and reduces decoding time by approximately 1.7× compared to separate unidirectional models.
RevFFN (Reversible Feed-Forward Network) is a core architectural component within the Reversible Duplex Conformer design, which underpins state-of-the-art bidirectional speech-to-speech translation (S2ST) systems. It enables exact invertibility within the backbone of a single, unified network, allowing forward and reverse mappings between speech domains with strict cycle consistency. This invertible construction both improves computational efficiency—by obviating the need for two separate unidirectional models—and enforces strong regularization via cycle-consistent training, when combined with duplex diffusion objectives (Wu, 2023).
1. Role of RevFFN in the Reversible Duplex Conformer
Within the reversible duplex Conformer, each layer is designed to be strictly invertible, with parameters fully shared between forward (source target) and reverse (target source) computations. At the heart of each such layer is a specialized set of submodules, including two Macaron-style feed-forward networks (Editor’s term: “RevFFN”), each applied with a 0.5 residual scaling. This substructure, along with parallel Multi-Head Self-Attention (MHSA) and convolutional operations, is responsible for both the expressivity and the bijective property of the entire layer stack.
2. Architectural Details and Bijective Mechanism
Each reversible Conformer layer operates on an input tensor , partitioned along the feature axis into two halves and . The forward (“building block” ) computation is defined as follows (applied in sequence, always with Pre-LN):
The corresponding reverse block , used for the target source path, symmetrically subtracts each residual in exact reverse order:
The invertibility derives from the additive residual structure: Feed-Forward, MHSA, and CNN submodules are applied only in such a way that their contributions can be exactly “undone.” As a result, no intermediate activations require storage, giving computational advantages for deep models, following principles introduced by invertible neural networks (cf. Gomez et al., 2017 as referenced in the primary source) (Wu, 2023).
3. Invertibility, Cycle Consistency, and Reversible Computation
Denoting the full forward mapping as and the reverse as , the global mappings are
These satisfy and vice versa, so that and . This strict equivalence provides exact cycle consistency. The reversible construction ensures that all information preserved by the feed-forward, attention, and convolutional operations is theoretically recoverable in both translation directions.
4. Integration with Duplex Diffusion Objectives
The RevFFN-empowered reversible backbone is combined with independent diffusion processes on the source and target sides:
- Source diffusion operates on source embeddings .
- Target diffusion operates on target embeddings .
Training alternates or sums diffusion losses for both directions: the model predicts added noise conditioned on the opposite-side embeddings, with mean-squared-error (MSE) losses for both reconstruction objectives. This duplex approach both regularizes the translation mapping and produces high-fidelity speech output for both directions in S2ST (Wu, 2023).
5. Training and Inference Workflow
The RevFFN structure enables unified training and inference. All Conformer parameters, including RevFFN weights, are shared; only the direction of computation differs (by port flipping). During inference, encoding is performed via pretrained self-supervised models (e.g., wav2vec2.0 or HuBERT), and the network alternately runs for or translation by passing data through the corresponding reversible path. Decoding uses -means and a DiffWave vocoder. Upsampling by a small convolutional network is applied to sequence alignment where necessary. Two independent noise schedules are maintained for source and target sides to further stabilize the bidirectional learning process (Wu, 2023).
6. Empirical Performance and Comparative Analysis
Compared to standard Conformer-based S2ST architectures—which require independent models or multitask training with no weight sharing—the reversible duplex Conformer (with RevFFN) achieves:
- A single model for both translation directions, with no increase in parameter count.
- Strict layer-wise invertibility and exact cycle consistency.
- Significant improvements in ASR-BLEU (–$2$ BLEU over UnitY baseline).
- lower decoding time relative to two independent models.
A summary table situates the reversible duplex Conformer in comparison to the standard Conformer-based S2ST architecture (Wu, 2023):
| Model Type | Number of Models | Directionality | Cycle Consistency |
|---|---|---|---|
| Vanilla Conformer S2ST | 2 | Unidirectional | No |
| Reversible Duplex Conformer + Diffusion | 1 | Bidirectional | Yes (exact) |
7. Architectural Considerations and Implementation Practices
Several stabilization techniques are applied within the RevFFN paradigm:
- Pre-LN is used at each submodule for training stability.
- Inputs are upsampled by a small CNN to ensure cross-attention length compatibility.
- The architecture leverages pretrained self-supervised encoders for initialization.
- Reversible block symmetry eliminates extra memory requirements for activations.
- Independent noise schedules allow for flexibility in bidirectional diffusion optimization (Wu, 2023).
This integrated, invertible feed-forward design is critical to the efficiency and high performance demonstrated by duplex diffusion models in bidirectional speech-to-speech translation.