MambAdapter: PETL for Speech & Audio
- MambAdapter is a parameter-efficient transfer learning method for speech and audio that integrates a lightweight Mamba state-space module into low-rank bottleneck adapters with shared projections.
- It minimizes trainable parameters using a shared projection strategy, achieving competitive accuracies (up to 89.85% on AST benchmarks) and reduced computational costs.
- Empirical evaluations show strong performance in both audio classification and ASR tasks, with notable improvements in accuracy and WER compared to traditional fine-tuning methods.
MambAdapter denotes a parameter-efficient transfer learning method that integrates a lightweight Mamba state-space module into low-rank bottleneck adapters for speech and audio, while sharing projection matrices across inserted adapters in a frozen Transformer backbone (Ali et al., 14 Jun 2026). In its exact published form, the method targets Transformer-based foundation models such as AST and Whisper, and is evaluated on audio classification and automatic speech recognition. In a broader, analogy-based usage, the term also serves as a convenient label for adapter-like Mamba integrations that attach selective state-space computation to pretrained backbones, multimodal fusion stacks, or modality-bridging modules; however, many related works do not use the exact name and should be distinguished from the specific PETL method introduced for speech and audio.
1. Definition and conceptual scope
In the narrow sense, MambAdapter is a bottleneck-adapter variant for frozen Transformer models in which the usual latent transformation is replaced by a lightweight Mamba block, and the down- and up-projection matrices are shared across layers (Ali et al., 14 Jun 2026). The design combines three elements: frozen pretrained backbones, adapter-style parameter efficiency, and linear-time selective state-space modeling in the bottleneck space. The intended setting is transfer learning where full fine-tuning is costly in memory, computation, and task-specific storage.
A central distinction separates MambAdapter from several nearby architectures. It is not a full Mamba backbone, because the pretrained Transformer remains intact and frozen. It is not merely LoRA, because the trainable computation is not restricted to low-rank updates on existing projections. It is also not a generic hybrid Transformer–Mamba model trained from scratch. This distinction matters because a substantial portion of the later literature uses Mamba in adapter-like ways without matching the original PETL formulation.
A broader reading of the term is plausible because multiple papers instantiate closely related patterns: Mamba-based modules inserted into frozen or largely frozen systems, often to recover temporal structure, volumetric context, or multimodal interactions. This suggests that “MambAdapter” can function as an editor’s umbrella term for a family of Mamba-based adaptation mechanisms, but the exact named method remains the speech-and-audio PETL approach of “MambAdapter: Lightweight Mamba-Based Adapters for Parameter-Efficient Transfer Learning in Speech and Audio” (Ali et al., 14 Jun 2026).
2. Core architecture
MambAdapter starts from a standard bottleneck-adapter template. Let denote the input to an adapted Transformer submodule, and let denote the underlying Transformer component output, where can be the FFN or Attention module. The adapter computes
$X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$
where , $W_{\text{down} \in \mathbb{R}^{d \times r}$, $W_{\text{up} \in \mathbb{R}^{r \times d}$, , and is a learnable scaling factor initialized to $0.1$ (Ali et al., 14 Jun 2026). The implied bottleneck decomposition is
0
1
2
3
The Mamba block operates only in the low-rank bottleneck space. This is the defining architectural move: sequence modeling capacity is added without replacing the backbone and without moving Mamba to the full hidden dimension. The paper argues that this is especially suitable for speech and audio because these domains exhibit long temporal contexts, structured dynamics in continuous signals, and short-range patterns that benefit from Mamba’s local convolutional component (Ali et al., 14 Jun 2026).
The projection matrices are shared across all inserted adapters. If 4 adapters are inserted, an ordinary non-shared bottleneck design requires 5 projection parameters, whereas MambAdapter reduces this to 6 by enforcing
7
This parameter sharing is one of the method’s main efficiency mechanisms. The trade-off, as the paper notes, is reduced per-layer projection flexibility, with the layer-specific Mamba blocks intended to recover expressivity (Ali et al., 14 Jun 2026).
Insertion strategy is task- and backbone-specific. For AST, MambAdapter is tested in both the Pfeiffer configuration, with adapters inserted after the FFN only, and the Houlsby configuration, with adapters inserted after both attention and FFN. For Whisper ASR, the adapters are inserted in the encoder only, using the Pfeiffer configuration, while the decoder remains frozen. All experiments use parallel adapter insertion rather than sequential insertion (Ali et al., 14 Jun 2026).
3. State-space formulation and implementation choices
The paper summarizes Mamba from a continuous-time linear state-space model,
8
with discrete form
9
where 0, 1, 2, and 3 is the state dimension, also denoted 4 (Ali et al., 14 Jun 2026). The paper further notes that Mamba introduces input-dependent selectivity and a local convolution of kernel size 5, and summarizes its complexity as 6.
Within MambAdapter, the Mamba module is explicitly described as lightweight. Each inserted Mamba block contributes approximately
7
parameters (Ali et al., 14 Jun 2026). This is important because the method’s total trainable budget is the sum of the shared projections, the per-layer Mamba blocks, and the per-adapter scaling factors.
The published implementation details are selective rather than exhaustive. What is specified includes the bottleneck ranks, placement rules, and scaling-factor initialization. For AST classification, MambAdapter uses rank 8, while Conformer uses rank 9 and standard bottleneck adapters also use rank $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$0. For Whisper ASR, ranks are adjusted to roughly match parameter budgets: LoRA uses rank $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$1, Bottleneck uses rank $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$2, Conformer uses rank $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$3, and MambAdapter uses rank $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$4 (Ali et al., 14 Jun 2026). The scaling factor $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$5 is per adapter layer and initialized to $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$6.
Several low-level details are not given in the paper text: exact activation inside the adapter path, whether layer normalization is inside the adapter, exact initialization of $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$7 and $X_{\text{out}=\alpha \cdot \mathrm{Mamba}(\hat X W_{\text{down}) W_{\text{up} + F(\hat X)$8, exact per-layer Mamba implementation beyond hyperparameters, and the training hyperparameters beyond the high-level summary. This omission is material for exact reproduction, but it does not obscure the main architectural identity of the method (Ali et al., 14 Jun 2026).
4. Empirical performance in speech and audio
MambAdapter is evaluated in two settings: audio and speech classification with AST pretrained on AudioSet, and ASR with Whisper pretrained on about 680,000 hours of multilingual multitask web audio (Ali et al., 14 Jun 2026). The classification benchmarks are ESC-50, UrbanSound8K, Speech Commands V2, and Fluent Speech Commands; results are reported in accuracy. The ASR benchmarks are five Common Voice 13 languages—Abkhaz, Central Kurdish, Esperanto, Kabyle, and Kinyarwanda—with word error rate as the main metric (Ali et al., 14 Jun 2026).
The principal reported results are summarized below.
| Setting | Main result | Trainable parameters |
|---|---|---|
| AST, Pfeiffer | MambAdapter average accuracy 89.72% | 0.06M |
| AST, Houlsby | MambAdapter average accuracy 89.85% | 0.11M |
| Whisper ASR | MambAdapter average WER 49.9 | 1.1M |
In the Pfeiffer configuration on AST, MambAdapter attains average accuracy 89.72%, compared with 90.07% for Conformer, 82.56% for Bottleneck, and 84.49% for LoRA, while using 0.06M trainable parameters versus 0.27M for Conformer (Ali et al., 14 Jun 2026). On US8K it reaches 82.51%, slightly above Conformer’s 82.04%, and on ESC it ties Conformer at 87.28%. In the Houlsby configuration, MambAdapter achieves the best average performance at 89.85%, ahead of Conformer at 89.69% and Bottleneck at 83.92%, while using 0.11M trainable parameters versus 0.54M for Conformer. A notable result is that on ESC, MambAdapter reaches 87.55%, slightly exceeding full fine-tuning at 87.48% (Ali et al., 14 Jun 2026).
For Whisper ASR, the average WER across the five languages is 49.9 for MambAdapter, compared with 50.7 for Bottleneck, 55.7 for Conformer, 57.3 for LoRA, and 45.18 for full fine-tuning (Ali et al., 14 Jun 2026). The paper emphasizes that full fine-tuning remains strongest overall, but adapters narrow the gap to about 4% WER while training only 0.45% of Whisper’s parameters, consistent with 1.1M trainable parameters versus 241M in full fine-tuning.
The scaling experiments on GSC under the Pfeiffer configuration qualify these results. Under tight parameter budgets below 500k parameters, MambAdapter outperforms both Bottleneck and Conformer by 0.5–4%, whereas beyond roughly 600k parameters, Conformer surpasses MambAdapter and peaks around 95.5% accuracy (Ali et al., 14 Jun 2026). This indicates that MambAdapter is especially competitive in reduced-parameter regimes rather than uniformly dominant across all budgets.
Ablation studies attribute most of the gains to the Mamba block itself. On ESC, GSC, and FSC, the full MambAdapter attains 92.12 average with 0.06M parameters. Removing Mamba reduces the average to 80.03, with the largest drop on FSC from 94.98 to 65.38. Removing the scaling factor causes a smaller drop to 91.39. Disabling parameter sharing raises the average only slightly to 92.26, but increases the parameter count from 0.06M to 0.28M, which the paper presents as evidence that projection sharing is an effective trade-off (Ali et al., 14 Jun 2026).
Latency and memory measurements on an NVIDIA H100 show minimal memory overhead relative to other adapters, but the largest latency overhead in strict streaming. For Whisper + MambAdapter, latency is 17.66 ms in streaming, 31.62 ms in mid, and 110.21 ms in batch mode, compared with baseline Whisper at 13.03 ms, 28.58 ms, and 100.66 ms respectively. The paper therefore concludes that MambAdapter is better suited to offline and long-form processing than strict low-latency streaming (Ali et al., 14 Jun 2026).
5. Broader Mamba-adapter design space
Although the exact name “MambAdapter” is specific to speech and audio, a broader adapter-like design space has emerged around Mamba. One major branch uses Mamba modules as trainable insertions into frozen or mostly frozen vision backbones. “Tri-Plane Mamba” adapts a pretrained SAM ViT-B image encoder for 3D CT segmentation by combining LoRA inside self-attention with TP-Mamba adapters attached after the MLP/norm pathway in each ViT block; with only three BTCV CT training samples, it reports a Dice score up to 12% higher than conventional 3D segmentation networks, reaching 65.8 average Dice in the 12% data setting (Wang et al., 2024). A later hybrid SAM study makes the adapter interpretation even more explicit: its “3D Adapter TP-Mamba-SAM” inserts TP-Mamba modules after each MSA and MLP block within a frozen SAM encoder, optionally with LoRA on query, key, and value projections, and reports 0.880 mean Dice for the TP_MFGC variant at 4.77 FPS on ACDC (Shahraki et al., 31 Jan 2026).
A second branch uses Mamba as an orthogonal or sidecar fusion adapter. PMA, the “Point Mamba Adapter,” freezes pretrained point-cloud backbones, extracts token features from all layers, dynamically orders them with a geometry-constrained gate prompt generator, and fuses them with Mamba. On PointGPT-L it uses 4.9M trainable parameters instead of 360.5M for full fine-tuning, a 99% reduction, while reaching 95.18 on ScanObjectNN PB-T50-RS (Zha et al., 27 May 2025). In visual recognition, “Mamba-Adaptor” inserts Adaptor-T into hidden-state computation and Adaptor-S into reshaped 2D outputs, and explores three roles: a full visual backbone, a booster for pretrained backbones, and a transfer-learning module; its ImageNet and COCO results position it as a general-purpose adaptor for visual Mamba models (Xie et al., 19 May 2025).
A third branch uses Mamba as a multimodal or task head on top of frozen encoders. M4Survive keeps medical foundation models fixed as embedding extractors, projects modality-specific embeddings with small MLP encoders into a shared latent space, and applies a Mamba-based adapter for multimodal fusion and Cox survival prediction, achieving 81.27 ± 0.56 c-index for the best multimodal configuration (Lee et al., 13 Mar 2025). TacMamba similarly uses Mamba as a plug-and-play tactile history compressor between a 100 Hz tactile stream and a slower visual-language-action policy, with 0.45 ms inference latency and 88.89% tactile classification accuracy (Wang et al., 2 Mar 2026). These examples suggest a recurring pattern: Mamba can serve not only as a backbone replacement, but also as a compact recurrent adaptor at interfaces where temporal compression, cross-modal fusion, or long-context summarization is required.
Not all Mamba-integrated architectures belong to this adapter family. AdaMamba, for example, is best understood as a full forecasting architecture in which Mamba is a contextual sublayer inside a larger normalization–decomposition–expert-routing pipeline, not a classic parameter-efficient adapter (Jeon, 7 Dec 2025). Likewise, TransMamba is a unified switching backbone with a Memory Converter at Transformer-to-Mamba transition points rather than a small inserted adapter (Li et al., 31 Mar 2025). These distinctions are useful because they prevent the term from collapsing into a synonym for any Transformer–Mamba hybrid.
6. Distinctions, misconceptions, and limitations
A frequent misconception is to equate any Mamba-augmented module with MambAdapter. The exact MambAdapter method is a PETL design for speech and audio: frozen Transformer backbone, low-rank bottleneck, lightweight Mamba in the bottleneck, and shared projections across layers (Ali et al., 14 Jun 2026). Related works may be “adapter-like in spirit,” but they differ structurally. PMA is an orthogonal cross-layer fusion branch rather than an inserted bottleneck adapter (Zha et al., 27 May 2025). TP-Mamba-SAM is a residual sequential bottleneck-style adapter for frozen SAM, specialized for pseudo-3D or tri-plane volumetric processing rather than general PETL (Shahraki et al., 31 Jan 2026). Mamba-Adaptor for vision modifies hidden-state access and 2D spatial aggregation inside visual SSM computation rather than replacing a bottleneck transform (Xie et al., 19 May 2025).
A second misconception is that Mamba-based adapters are inherently the most efficient option under all conditions. The exact MambAdapter results do not support that conclusion. Full fine-tuning remains strongest overall for Whisper ASR, Conformer can surpass MambAdapter when parameter budgets are larger than about 600k, and MambAdapter incurs the largest latency overhead in strict streaming despite minimal memory overhead (Ali et al., 14 Jun 2026). This suggests that its main operating point is reduced-parameter transfer learning rather than universal dominance.
Another limitation is reproducibility granularity. Several papers in this area provide strong architectural descriptions but incomplete low-level implementation detail. The original MambAdapter paper omits optimizer, scheduler, batch size, epoch count, and several internal activation and normalization choices from the main text (Ali et al., 14 Jun 2026). M4Survive gives only a high-level Mamba recurrence and leaves many architectural details unspecified (Lee et al., 13 Mar 2025). TacMamba provides the recurrent systems picture but not all latent dimensions or fusion details (Wang et al., 2 Mar 2026). This pattern suggests that the conceptual design space is clearer than the exact implementation conventions across domains.
A plausible implication is that “MambAdapter” now names both a specific method and a broader architectural idiom: the use of compact Mamba modules as trainable, often parameter-efficient adaptors attached to frozen or nearly frozen systems. The method’s original form remains the most precise reference point, but the surrounding literature shows that the adaptor role of Mamba has expanded into vision, 3D medical imaging, multimodal survival modeling, tactile–VLA bridging, and point-cloud understanding (Ali et al., 14 Jun 2026).