Mamba-FeedForward-Attention Block (MFA)

Updated 30 July 2025

Mamba-FeedForward-Attention Block (MFA) is a neural architecture that fuses selective state space modeling, feedforward non-linearity, and attention mechanisms to enable scalable sequence modeling.
It employs input-dependent gating and hidden attention maps to balance efficiency and expressiveness, achieving near-linear complexity relative to traditional self-attention models.
By integrating domain-adapted modifications, MFA demonstrates versatility across NLP, vision, speech, and other applications while maintaining empirical performance and interpretability.

The Mamba-FeedForward-Attention Block (MFA) is a neural architecture motif that fuses selective state space modeling, feedforward non-linearity, and attention-inspired mechanisms into a unified block. This construction is designed for scalable, expressive sequence modeling across domains—such as natural language processing, computer vision, speech, time series, and symbolic music—offering linear or near-linear complexity while preserving or enhancing the representational richness of self-attention-based approaches (Ali et al., 3 Mar 2024, Gao et al., 5 May 2024, Zhang et al., 21 May 2024, Han et al., 26 May 2024, Ibrahim et al., 11 Feb 2025, Wang et al., 28 Feb 2025, Yuan et al., 27 Jul 2025). The MFA block is grounded in a reinterpretation of selective state space models (SSMs) as data-controlled linear operators with hidden but explainable attention dynamics, implemented with domain-adapted gating, bidirectionality, and integration with domain-specialized local operations.

1. Mathematical Foundations and Architecture

The mathematical core of the MFA block arises from the selective SSM mechanism, in which each layer processes an input sequence $x \in \mathbb{R}^{L \times D}$ as follows:

The input is processed with a linear projection, SiLU activation, and a 1D convolution to obtain $x̂ = \text{SiLU}(\text{Conv1D}(\text{Linear}(x)))$ and an auxiliary gate $ẑ = \text{SiLU}(\text{Linear}(x))$ (Ali et al., 3 Mar 2024).
The selective SSM maps $x̂$ to an output via stateful recurrence with time-varying, input-dependent matrices,

$h_t = \sum_{j=1}^t \left(\prod_{k=j+1}^t \bar{A}_k\right) \bar{B}_j x_j, \qquad y_t = C_t h_t$

or, for the full sequence,

$y = \tilde{\alpha} x, \qquad \tilde{\alpha}_{i, j} = C_i \left(\prod_{k=j+1}^i \bar{A}_k\right) \bar{B}_j$

where all parameters are generated through learnable projections of the input (Ali et al., 3 Mar 2024).

The output is gated: $\hat{y}' = \text{Linear}(\text{SelectiveSSM}(x̂) \odot ẑ)$ , with $\odot$ denoting elementwise product, and finally $\hat{y} = \text{LayerNorm}(\hat{y}' + x)$ .
MFA blocks may be stacked (interleaved with convolutional, attention, or further feedforward layers) and extended by domain-specific modifications—e.g., bidirectional SSM for video or speech (Gao et al., 5 May 2024, Zhang et al., 21 May 2024), explicit self-attention for local details (Yuan et al., 27 Jul 2025), and gating/normalization strategies (Han et al., 26 May 2024).

The MFA design can be summarized schematically as:

Stage	Operation Description
Input Proj + Act + Conv	$x̂ = \text{SiLU}(\text{Conv1D}(\text{Linear}(x)))$
Gate Projection + Act	$ẑ = \text{SiLU}(\text{Linear}(x))$
Selective SSM	Input-dependent state-space scan with learned $\bar{A}, \bar{B}, C$
Gating	Hadamard product: $\text{SelectiveSSM}(x̂) \odot ẑ$
Output Proj, Norm, Residual	Final linear, normalization, and skip connection: $\text{LayerNorm}(\cdot + x)$

Fundamentally, this results in a data-controlled, linear operator with "hidden attention maps"—making MFA blocks simultaneously efficient (due to their SSM ancestry) and explainable (analogous to, but structurally distinct from, transformer self-attention) (Ali et al., 3 Mar 2024, Han et al., 26 May 2024).

2. Comparison with Transformer Self-Attention and Linear Attention

The MFA block's distinguishing features relative to classical transformer self-attention and linear attention include:

Implicit Attention: Rather than softmax-normalized token pair weights, MFA's implicit attention is realized via selective, input-dependent SSM parameters, producing a hidden (non-softmax) kernel $\tilde{\alpha}$ that mediates long-range and local token dependence (Ali et al., 3 Mar 2024, Wang et al., 28 Feb 2025).
Input and Forget Gates: Input gating via softplus functions modulates token utility, while forget gates induce position- and context-dependent decay, providing local bias and enabling powerful positional encoding. These mechanisms are absent from conventional linear attention (Han et al., 26 May 2024).
Normalization and Block Design: MFA omits explicit attention normalization within the SSM, relying instead on block-level layer normalization, which—combined with gating and skip shortcuts—yields stable optimization and strong empirical performance. Omitting normalization altogether, or using a simplistic block design, degrades performance (Han et al., 26 May 2024).
Resolution and Explainability: Each channel of the SSM defines a separate inner attention map. For $N$ channels and $D$ features, this can yield $(D \cdot N)/H$ times more attention maps compared to $H$ -head transformers, resulting in finer granularity for attribution and interpretability (Ali et al., 3 Mar 2024).
Parallel Scan / Recurrent Modes: MFA blocks compute via IO-aware parallel scan during training, and support efficient RNN-like autoregressive deployment, endowing them with architectural flexibility (Ali et al., 3 Mar 2024, Han et al., 26 May 2024).

A key theoretical result is that a single channel of the selective SSM used in MFA can realize all functions expressible by a transformer's self-attention head, and even functions outside the capacity of any single transformer head (e.g., the "count in row" function) (Ali et al., 3 Mar 2024).

3. Domain Adaptations and Empirical Applications

MFA blocks have been successfully adapted across domains by tailoring the auxiliary modules and the scan structure:

Speech: Bidirectional Mamba variants (InnBiMamba, ExtBiMamba) combining separate or shared input/output projections demonstrate superior speech enhancement and recognition performance versus both unidirectional Mamba and transformer baselines, particularly when employed as a replacement for multi-head self-attention layers and supported by additional feedforward non-linearity (Zhang et al., 21 May 2024).
Vision: MFA-based models rival or surpass Vision Transformers (ViT) in image classification and segmentation, with attention maps capturing both local patch and global image dependencies. Incorporating cross-scan modules, position embeddings, and hierarchical designs further boosts performance in high-resolution and dense prediction scenarios (Ibrahim et al., 11 Feb 2025, Ali et al., 3 Mar 2024, Han et al., 26 May 2024).
Video: The hybrid use of local spatial-temporal attention and global bidirectional Mamba within MFA blocks enables scalability in video generation, maintaining competitive Fréchet Video Distance (FVD) scores and showing positive correlation between model complexity and output quality (Gao et al., 5 May 2024).
Music and Long Sequence Modeling: In symbolic music generation, the MFA block unifies SSM-driven global context modeling, feedforward non-linearity, and a single self-attention layer to balance scalability with fine musical expressivity. Experimental results show reduced GPU usage and step time with improved generation quality over transformer and Mamba-only models (Yuan et al., 27 Jul 2025).
Time Series and Point Clouds: Extensions introduce fast-attention modules for cross-variate dependencies (FMamba), pointwise latent attention for local 3D geometry (PointLAMA), or adaptive pooling for global context in time series, each retaining the selective, linear-complexity SSM at the core (Ma et al., 20 Jul 2024, Lin et al., 23 Jul 2025, Xiong et al., 2 Apr 2025).

4. Interpretability and Attention Pattern Analysis

The hidden attention matrices $\tilde{\alpha}$ arising in MFA blocks are amenable to established attribution and explainability methods, analogous to attention rollout in transformers (Ali et al., 3 Mar 2024, Wang et al., 28 Feb 2025). In vision models, these matrices reveal structured dependency bands (corresponding to patch spatial or scan order), and respond to patch-ordering strategies by manifesting distinct patterns—e.g., diagonal, Morton, or spiral orders each induce characteristic attention clusters over the input image grid. Such sensitivity provides insight into spatial bias and the effect of sequence ordering.

Patch-wise or token-wise attention can thus be visualized and analyzed, confirming that early MFA layers focus on spatial proximity, while deeper layers attend to content and context (Wang et al., 28 Feb 2025). In point cloud models, the interaction between SSM-driven global state and locally-gated latent attention enables targeted representation capture of both global structure and fine geometry (Lin et al., 23 Jul 2025).

5. Computational Efficiency, Scalability, and Design Trade-Offs

MFA blocks directly address the quadratic scaling bottleneck of standard attention by relegating global context modeling to SSMs with linear complexity:

The cost of a selective SSM scan is $\mathcal{O}(J D N)$ vs. $\mathcal{O}(J^2 D)$ for self-attention, where $J$ is sequence length, $D$ the embedding dimension, and $N$ the SSM state dimension ( $N \ll J$ ) (Gao et al., 5 May 2024, Ma et al., 20 Jul 2024).
Bidirectionality augments expressiveness without imposing quadratic cost (Gao et al., 5 May 2024, Zhang et al., 21 May 2024).
Architectural adaptations—such as token and block pruning in Dynamic Vision Mamba, cross-scan hierarchical merging, or selective use of explicit self-attention—offer further resource savings or targeted expressivity (Wu et al., 7 Apr 2025, Ibrahim et al., 11 Feb 2025, Han et al., 26 May 2024).
Empirical results across benchmarks show that these strategies yield large FLOP reductions (e.g., up to 35%), faster inference, and minimal degradation in accuracy (Wu et al., 7 Apr 2025, Gao et al., 5 May 2024).

A practical consideration is the balance between recurrence (offered by SSM gates and sequential scans, yielding more potent position- and locality-sensitive representations) and parallelism (critical for modern hardware utilization). The choice depends on the domain and task-specific latency or throughput requirements (Han et al., 26 May 2024). In some cases, the forget gate's effect can be replaced by parallelizable positional encodings to expedite inference with small performance trade-offs.

6. Theoretical Significance and Expressiveness

Theoretical analyses demonstrate that:

MFA's selective SSM mechanism subsumes standard dot product attention: an SSM channel can realize any function expressible by a transformer's attention head.
MFA blocks can compute functions (such as the "count in row" mapping) that are provably out of reach for transformer heads of equivalent size (Ali et al., 3 Mar 2024).
This universality is attributed to the flexible, input-dependent generation of SSM parameters, dynamic gating, and the absence of hard symmetry or normalization constraints typical in transformer attention (Han et al., 26 May 2024).
In speech and vision domains, ablation studies highlight design elements critical for practical expressiveness—most notably the inclusion of non-linear processing layers (feedforward, gating) and careful initialization protocols for SSM parameters (Zhang et al., 21 May 2024, Han et al., 26 May 2024).

In summary, the MFA block represents an overview of selective state space modeling, gating, and feedforward non-linearity, implemented within a flexible and interpretable architecture scaffold. Its theoretical expressiveness, resource efficiency, and adaptability across domains have been validated by empirical studies and motivate continued research into hybrid SSM-attention models and their domain-aligned specializations.