RevFFN: Invertible FFN for S2ST

Updated 29 January 2026

RevFFN is a reversible feed-forward network module that enables exact invertibility in duplex Conformer architectures for bidirectional speech-to-speech translation.
It employs an additive residual structure across FFN, multi-head self-attention, and CNN submodules to ensure precise cycle consistency and efficient parameter sharing.
Integration with duplex diffusion objectives improves ASR-BLEU scores by 1-2 points and reduces decoding time by approximately 1.7× compared to separate unidirectional models.

RevFFN (Reversible Feed-Forward Network) is a core architectural component within the Reversible Duplex Conformer design, which underpins state-of-the-art bidirectional speech-to-speech translation (S2ST) systems. It enables exact invertibility within the backbone of a single, unified network, allowing forward and reverse mappings between speech domains with strict cycle consistency. This invertible construction both improves computational efficiency—by obviating the need for two separate unidirectional models—and enforces strong regularization via cycle-consistent training, when combined with duplex diffusion objectives (Wu, 2023).

1. Role of RevFFN in the Reversible Duplex Conformer

Within the reversible duplex Conformer, each layer is designed to be strictly invertible, with parameters fully shared between forward (source $\rightarrow$ target) and reverse (target $\rightarrow$ source) computations. At the heart of each such layer is a specialized set of submodules, including two Macaron-style feed-forward networks (Editor’s term: “RevFFN”), each applied with a 0.5 residual scaling. This substructure, along with parallel Multi-Head Self-Attention (MHSA) and convolutional operations, is responsible for both the expressivity and the bijective property of the entire layer stack.

2. Architectural Details and Bijective Mechanism

Each reversible Conformer layer operates on an input tensor $H^{l-1}$ , partitioned along the feature axis into two halves $H^{l-1}(1)$ and $H^{l-1}(2)$ . The forward (“building block” $F^l$ ) computation is defined as follows (applied in sequence, always with Pre-LN):

$y(1) = x(1) + 0.5\,\mathrm{FFN}_1(x(2))$
$y(2) = x(2) + \mathrm{MHSA}(y(1))$
$z(1) = y(1) + \mathrm{CNN}(y(2))$
$z(2) = y(2) + 0.5\,\mathrm{FFN}_2(z(1))$

The corresponding reverse block $(F^l)^{-1}$ , used for the target $\rightarrow$ source path, symmetrically subtracts each residual in exact reverse order:

$y(2) = z(2) - 0.5\,\mathrm{FFN}_2(z(1))$
$y(1) = z(1) - \mathrm{CNN}(y(2))$
$x(2) = y(2) - \mathrm{MHSA}(y(1))$
$x(1) = y(1) - 0.5\,\mathrm{FFN}_1(x(2))$

The invertibility derives from the additive residual structure: Feed-Forward, MHSA, and CNN submodules are applied only in such a way that their contributions can be exactly “undone.” As a result, no intermediate activations require storage, giving computational advantages for deep models, following principles introduced by invertible neural networks (cf. Gomez et al., 2017 as referenced in the primary source) (Wu, 2023).

3. Invertibility, Cycle Consistency, and Reversible Computation

Denoting the full forward mapping as $f_e(x)$ and the reverse as $f_o(z)$ , the global mappings are

$f_e(x) = F^L \circ \ldots \circ F^1(x)$
$f_o(z) = (F^{1})^{-1} \circ \ldots \circ (F^{L})^{-1}(z)$

These satisfy $f_e = (f_o)^{-1}$ and vice versa, so that $f_e(f_o(z)) = z$ and $f_o(f_e(x)) = x$ . This strict equivalence provides exact cycle consistency. The reversible construction ensures that all information preserved by the feed-forward, attention, and convolutional operations is theoretically recoverable in both translation directions.

4. Integration with Duplex Diffusion Objectives

The RevFFN-empowered reversible backbone is combined with independent diffusion processes on the source and target sides:

Source diffusion $q_x(\cdot)$ operates on source embeddings $x_0$ .
Target diffusion $q_y(\cdot)$ operates on target embeddings $y_0$ .

Training alternates or sums diffusion losses for both directions: the model predicts added noise conditioned on the opposite-side embeddings, with mean-squared-error (MSE) losses for both reconstruction objectives. This duplex approach both regularizes the translation mapping and produces high-fidelity speech output for both directions in S2ST (Wu, 2023).

5. Training and Inference Workflow

The RevFFN structure enables unified training and inference. All Conformer parameters, including RevFFN weights, are shared; only the direction of computation differs (by port flipping). During inference, encoding is performed via pretrained self-supervised models (e.g., wav2vec2.0 or HuBERT), and the network alternately runs for $x \rightarrow y$ or $y \rightarrow x$ translation by passing data through the corresponding reversible path. Decoding uses $k$ -means and a DiffWave vocoder. Upsampling by a small convolutional network is applied to sequence alignment where necessary. Two independent noise schedules are maintained for source and target sides to further stabilize the bidirectional learning process (Wu, 2023).

6. Empirical Performance and Comparative Analysis

Compared to standard Conformer-based S2ST architectures—which require independent models or multitask training with no weight sharing—the reversible duplex Conformer (with RevFFN) achieves:

A single model for both translation directions, with no increase in parameter count.
Strict layer-wise invertibility and exact cycle consistency.
Significant improvements in ASR-BLEU ( $+1$ –$2$ BLEU over UnitY baseline).
$\sim 1.7\times$ lower decoding time relative to two independent models.

A summary table situates the reversible duplex Conformer in comparison to the standard Conformer-based S2ST architecture (Wu, 2023):

Model Type	Number of Models	Directionality	Cycle Consistency
Vanilla Conformer S2ST	2	Unidirectional	No
Reversible Duplex Conformer + Diffusion	1	Bidirectional	Yes (exact)

7. Architectural Considerations and Implementation Practices

Several stabilization techniques are applied within the RevFFN paradigm:

Pre-LN is used at each submodule for training stability.
Inputs are upsampled by a small CNN to ensure cross-attention length compatibility.
The architecture leverages pretrained self-supervised encoders for initialization.
Reversible block symmetry eliminates extra memory requirements for activations.
Independent noise schedules allow for flexibility in bidirectional diffusion optimization (Wu, 2023).

This integrated, invertible feed-forward design is critical to the efficiency and high performance demonstrated by duplex diffusion models in bidirectional speech-to-speech translation.

Markdown Report Issue Upgrade to Chat

References (1)

Duplex Diffusion Models Improve Speech-to-Speech Translation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RevFFN.

RevFFN: Invertible FFN for S2ST

1. Role of RevFFN in the Reversible Duplex Conformer

2. Architectural Details and Bijective Mechanism

3. Invertibility, Cycle Consistency, and Reversible Computation

4. Integration with Duplex Diffusion Objectives

5. Training and Inference Workflow

6. Empirical Performance and Comparative Analysis

7. Architectural Considerations and Implementation Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RevFFN: Invertible FFN for S2ST

1. Role of RevFFN in the Reversible Duplex Conformer

2. Architectural Details and Bijective Mechanism

3. Invertibility, Cycle Consistency, and Reversible Computation

4. Integration with Duplex Diffusion Objectives

5. Training and Inference Workflow

6. Empirical Performance and Comparative Analysis

7. Architectural Considerations and Implementation Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research