Papers
Topics
Authors
Recent
Search
2000 character limit reached

RevFFN: Invertible FFN for S2ST

Updated 29 January 2026
  • RevFFN is a reversible feed-forward network module that enables exact invertibility in duplex Conformer architectures for bidirectional speech-to-speech translation.
  • It employs an additive residual structure across FFN, multi-head self-attention, and CNN submodules to ensure precise cycle consistency and efficient parameter sharing.
  • Integration with duplex diffusion objectives improves ASR-BLEU scores by 1-2 points and reduces decoding time by approximately 1.7× compared to separate unidirectional models.

RevFFN (Reversible Feed-Forward Network) is a core architectural component within the Reversible Duplex Conformer design, which underpins state-of-the-art bidirectional speech-to-speech translation (S2ST) systems. It enables exact invertibility within the backbone of a single, unified network, allowing forward and reverse mappings between speech domains with strict cycle consistency. This invertible construction both improves computational efficiency—by obviating the need for two separate unidirectional models—and enforces strong regularization via cycle-consistent training, when combined with duplex diffusion objectives (Wu, 2023).

1. Role of RevFFN in the Reversible Duplex Conformer

Within the reversible duplex Conformer, each layer is designed to be strictly invertible, with parameters fully shared between forward (source \rightarrow target) and reverse (target \rightarrow source) computations. At the heart of each such layer is a specialized set of submodules, including two Macaron-style feed-forward networks (Editor’s term: “RevFFN”), each applied with a 0.5 residual scaling. This substructure, along with parallel Multi-Head Self-Attention (MHSA) and convolutional operations, is responsible for both the expressivity and the bijective property of the entire layer stack.

2. Architectural Details and Bijective Mechanism

Each reversible Conformer layer operates on an input tensor Hl1H^{l-1}, partitioned along the feature axis into two halves Hl1(1)H^{l-1}(1) and Hl1(2)H^{l-1}(2). The forward (“building block” FlF^l) computation is defined as follows (applied in sequence, always with Pre-LN):

  1. y(1)=x(1)+0.5FFN1(x(2))y(1) = x(1) + 0.5\,\mathrm{FFN}_1(x(2))
  2. y(2)=x(2)+MHSA(y(1))y(2) = x(2) + \mathrm{MHSA}(y(1))
  3. z(1)=y(1)+CNN(y(2))z(1) = y(1) + \mathrm{CNN}(y(2))
  4. z(2)=y(2)+0.5FFN2(z(1))z(2) = y(2) + 0.5\,\mathrm{FFN}_2(z(1))

The corresponding reverse block (Fl)1(F^l)^{-1}, used for the target \rightarrow source path, symmetrically subtracts each residual in exact reverse order:

  1. y(2)=z(2)0.5FFN2(z(1))y(2) = z(2) - 0.5\,\mathrm{FFN}_2(z(1))
  2. y(1)=z(1)CNN(y(2))y(1) = z(1) - \mathrm{CNN}(y(2))
  3. x(2)=y(2)MHSA(y(1))x(2) = y(2) - \mathrm{MHSA}(y(1))
  4. x(1)=y(1)0.5FFN1(x(2))x(1) = y(1) - 0.5\,\mathrm{FFN}_1(x(2))

The invertibility derives from the additive residual structure: Feed-Forward, MHSA, and CNN submodules are applied only in such a way that their contributions can be exactly “undone.” As a result, no intermediate activations require storage, giving computational advantages for deep models, following principles introduced by invertible neural networks (cf. Gomez et al., 2017 as referenced in the primary source) (Wu, 2023).

3. Invertibility, Cycle Consistency, and Reversible Computation

Denoting the full forward mapping as fe(x)f_e(x) and the reverse as fo(z)f_o(z), the global mappings are

  • fe(x)=FLF1(x)f_e(x) = F^L \circ \ldots \circ F^1(x)
  • fo(z)=(F1)1(FL)1(z)f_o(z) = (F^{1})^{-1} \circ \ldots \circ (F^{L})^{-1}(z)

These satisfy fe=(fo)1f_e = (f_o)^{-1} and vice versa, so that fe(fo(z))=zf_e(f_o(z)) = z and fo(fe(x))=xf_o(f_e(x)) = x. This strict equivalence provides exact cycle consistency. The reversible construction ensures that all information preserved by the feed-forward, attention, and convolutional operations is theoretically recoverable in both translation directions.

4. Integration with Duplex Diffusion Objectives

The RevFFN-empowered reversible backbone is combined with independent diffusion processes on the source and target sides:

  • Source diffusion qx()q_x(\cdot) operates on source embeddings x0x_0.
  • Target diffusion qy()q_y(\cdot) operates on target embeddings y0y_0.

Training alternates or sums diffusion losses for both directions: the model predicts added noise conditioned on the opposite-side embeddings, with mean-squared-error (MSE) losses for both reconstruction objectives. This duplex approach both regularizes the translation mapping and produces high-fidelity speech output for both directions in S2ST (Wu, 2023).

5. Training and Inference Workflow

The RevFFN structure enables unified training and inference. All Conformer parameters, including RevFFN weights, are shared; only the direction of computation differs (by port flipping). During inference, encoding is performed via pretrained self-supervised models (e.g., wav2vec2.0 or HuBERT), and the network alternately runs for xyx \rightarrow y or yxy \rightarrow x translation by passing data through the corresponding reversible path. Decoding uses kk-means and a DiffWave vocoder. Upsampling by a small convolutional network is applied to sequence alignment where necessary. Two independent noise schedules are maintained for source and target sides to further stabilize the bidirectional learning process (Wu, 2023).

6. Empirical Performance and Comparative Analysis

Compared to standard Conformer-based S2ST architectures—which require independent models or multitask training with no weight sharing—the reversible duplex Conformer (with RevFFN) achieves:

  • A single model for both translation directions, with no increase in parameter count.
  • Strict layer-wise invertibility and exact cycle consistency.
  • Significant improvements in ASR-BLEU (+1+1–$2$ BLEU over UnitY baseline).
  • 1.7×\sim 1.7\times lower decoding time relative to two independent models.

A summary table situates the reversible duplex Conformer in comparison to the standard Conformer-based S2ST architecture (Wu, 2023):

Model Type Number of Models Directionality Cycle Consistency
Vanilla Conformer S2ST 2 Unidirectional No
Reversible Duplex Conformer + Diffusion 1 Bidirectional Yes (exact)

7. Architectural Considerations and Implementation Practices

Several stabilization techniques are applied within the RevFFN paradigm:

  • Pre-LN is used at each submodule for training stability.
  • Inputs are upsampled by a small CNN to ensure cross-attention length compatibility.
  • The architecture leverages pretrained self-supervised encoders for initialization.
  • Reversible block symmetry eliminates extra memory requirements for activations.
  • Independent noise schedules allow for flexibility in bidirectional diffusion optimization (Wu, 2023).

This integrated, invertible feed-forward design is critical to the efficiency and high performance demonstrated by duplex diffusion models in bidirectional speech-to-speech translation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RevFFN.