TransMamba: Hybrid SSM-Transformer Model

Updated 20 January 2026

TransMamba is a family of hybrid architectures that combine linear-complexity state-space (Mamba) layers with Transformer attention for dynamic, long-sequence modeling.
It employs interleaved blocks and unified parameterizations, enabling tokens to switch between attention and SSM modes via scheduled TransPoints for optimized performance.
Empirical results show that TransMamba achieves state-of-the-art accuracy and throughput across language, vision, and multimodal tasks while reducing training costs.

TransMamba refers to a family of hybrid architectures that tightly integrate linear-complexity State-Space Model (SSM) layers—specifically, the Mamba variant—with Transformer-style attention mechanisms. These models have been instantiated for sequence modeling in language, vision, multimodal, and tabular domains. The unifying goal is to combine the non-local, content-adaptive contextual modeling capabilities of self-attention with the memory- and compute-efficiency, selective long-sequence modeling, and dynamic memory control of state-space models. Recent research demonstrates that TransMamba designs excel in both accuracy and throughput for extremely long sequences and complex reasoning tasks, with major deployments including the 56B-parameter Hunyuan-TurboS LLM and SOTA results in vision and multimodal adaptation (Team et al., 21 May 2025, Chen et al., 21 Feb 2025, Li et al., 31 Mar 2025, Lou et al., 22 Jul 2025).

1. Architectural Foundations and Core Design Patterns

TransMamba hybridizes Transformer attention and Mamba SSM layers at various structural granularities: interleaved sublayer block patterns, dual-branch designs, and deeply fused parameterizations. For example, in large foundation models like Hunyuan-TurboS, 128 total layers are composed of two atomic block types:

AMF Block: Attention → Mamba2 → MoE-FFN
MF Block: Mamba2 → MoE-FFN

These blocks are interleaved (AMF, MF, AMF, MF, etc.), resulting in only 7 attention layers (5.5%), 57 Mamba2 layers (44.5%), and 64 MoE feed-forward layers (50%) out of 128 (Team et al., 21 May 2025). The rationale is to maximize expressivity where context matters most (via attention) while exploiting Mamba’s $O(n)$ scaling for the remainder.

Other settings employ unified parameterization: a single set of matrices used for both Transformer QKV projections and SSM CBx blocks, enabling tokens to dynamically transition between "Transformer mode" and "Mamba mode" per layer and position (Li et al., 31 Mar 2025). The switching is typically handled at "TransPoints"—scheduled layer/token boundaries where memory is losslessly converted from attention to SSM state using an explicit Memory Converter.

For tabular models like FT-Mamba, all Transformer layers are simply replaced by Mamba SSM layers, yielding FT-Mamba architectures with runtime and memory $O(L)$ for $L$ tokenized features (Starnes et al., 2024).

2. Mathematical Formulation and Layer Mechanics

The core mechanics of TransMamba rely on established formulations for both Transformer attention and Mamba SSM computations:

Self-Attention operates as

$\mathrm{Attn}(Q,K,V) = \mathrm{softmax}(QK^\top)V$

with $O(L^2)$ complexity for $L$ tokens.

Mamba SSM layers use input-dependent discretization:

$x_k = \exp(\Delta_k A)x_{k-1} + (\Delta_k A)^{-1}(\exp(\Delta_kA) - I)B u_k$

$y_k = C x_k$

Here, $A,B,C$ are learnable matrices, with $\Delta_k$ content-dependent. This yields $O(L)$ compute/memory. When unrolled, this is a dynamic causal convolution with history-dependent kernels (Li et al., 31 Mar 2025, Zou et al., 2024).

Unified Parameterization in some variants binds QKV and CBx via shared weights. Switching from attention to SSM at a TransPoint $P$ involves converting accumulated $K_{1:P}, V_{1:P}$ to the SSM state $s(0)$ using a closed-form linear mapping involving the SSM’s transition kernel.
Mixture-of-Experts (MoE) FFN modules activate only a subset of specialized experts per token:

$y = \sum_{i \in \mathrm{TopK}(\pi(x))} \pi_i(x) W_2^i \sigma(W_1^i x)$

For Hunyuan-TurboS, 1 shared and 32 specialized experts per layer are used, with only 3 active per token, reducing parameter footprint in activation (Team et al., 21 May 2025).

3. Adaptive and Flexible Inference Strategies

TransMamba models often incorporate adaptive execution control—most notably the Adaptive Long-Short Chain-of-Thought (CoT) mechanism in Hunyuan-TurboS:

Gating network computes $g(x) = \sigma(w^\top x + b)$ indicating task complexity.
If $g(x) < \tau$ , model executes short CoT (single-pass); otherwise, long CoT (multi-step, deeper computation).
Reinforcement Learning is used to penalize unnecessary computation, encouraging minimal "reasoning" depth for "easy" tasks while allowing full depth when necessary:

$R = R_\mathrm{correct} - \lambda(\ell_\mathrm{tokens})$

This dynamic depth/effort allocation enables significant computational savings without accuracy sacrifice in response generation (Team et al., 21 May 2025).

For switching between attention and SSM within a layer, dynamic or static TransPoint scheduling can be used. Schedules may be static (fixed $P$ per layer) or dynamic (vary $P_l$ across layers, heuristically or learnably), trading off representational smoothness, computation, and sequence coverage (Li et al., 31 Mar 2025).

4. Transfer, Adaptation, and Efficient Knowledge Distillation

TransMamba’s framework enables not only architectural fusion but also efficient transfer from existing Transformer models using universal adaptation techniques (Chen et al., 21 Feb 2025):

Feature Calibration: Student Mamba model’s features are projected/aligned into the Transformer teacher’s latent space using MLPs and zero-padding.
Weight Subcloning and Adaptive Bidirectional Distillation (WSAB): Linear/MLP weights (excluding SSM kernels) are cloned from the teacher; bidirectional layer-wise cosine similarity with adaptive weighting distills knowledge separately for forward and backward SSM passes.
Cross-Mamba Module: Enhances cross-modal (image–text) fusion by injecting language-aware representations into the SSM pipeline via cross-attention pooling.

This cross-architecture knowledge transfer cuts training data requirements by as much as 2× while matching or exceeding accuracy on image classification, VQA, and text–video retrieval (Chen et al., 21 Feb 2025). It facilitates rapid deployment of efficient Mamba-based models without full retraining.

5. Empirical Performance, Benchmarking, and Cost Analysis

TransMamba architectures consistently achieve or surpass state-of-the-art results across language, vision, and multimodal tasks:

Task/Domain	Model / Setting	Key Metric(s)	Result(s)	Source
LLM Reasoning	Hunyuan-TurboS (56B)	Chatbot Arena, avg. top7	1356 (beating Gemini-2.0-Flash-001)	(Team et al., 21 May 2025)
		Token Efficiency	~1207.8 tokens/output (≈40% of rivals)	(Team et al., 21 May 2025)
ImageNet Transfer	TransMamba Adaptation	Top-1 Accuracy	PMamba-T +2.62%, ViM-T +5.12%	(Chen et al., 21 Feb 2025)
VQA	Trans-LLaVA Mamba	GQA/VQA/VizWiz/SQA avg	49.4% (matches/beat LLAVA-3.2B-1B)	(Chen et al., 21 Feb 2025)
Vision	A2Mamba-L (95M)	ImageNet-1K Top-1	86.1%	(Lou et al., 22 Jul 2025)
		COCO AP^{b/AP^m}	55.6/48.1	(Lou et al., 22 Jul 2025)
Point Clouds	PoinTramba	ScanObjectNN PB-T50-RS	89.1% (↑6.6% over PointMamba)	(Wang et al., 2024)
Super-Resolution	T-PMambaSR	Urban100 (×2)	33.17 dB/0.9371 (low params/FLOPs)	(Guo et al., 5 Nov 2025)

Ablation studies confirm that each architectural component—including interleaved block patterns, BIO ordering in point clouds, and MoE routing—contributes to either inference efficiency, representational power, or both.

6. Theoretical Insights and Comparative Discussion

TransMamba architectures leverage complementary mathematical properties of Transformers and SSMs:

Transformer attention efficiently models arbitrary pairwise interactions but suffers $O(L^2)$ scaling.
Mamba SSM layers offer $O(L)$ per-layer cost and selective, indefinite memory via content-dependent recurrence gates, but may underperform on high-frequency or local context tasks.
Unified hybrid architectures benefit from theoretical kernel duality: attention can be viewed as a special SSM instance, embedding both in a shared RKHS (Zou et al., 2024). Combined models thus simultaneously access local nonlinearity and global recurrence.

Analyses show that such hybrids achieve 2–4× memory and runtime reductions on long sequences while preserving accuracy. In LLMs, a 3B-param TransMamba matches a 6B Transformer on 16k tokens but doubles tokens/sec throughput (Zou et al., 2024, Li et al., 31 Mar 2025).

7. Broader Impact, Limitations, and Research Directions

TransMamba establishes a new paradigm for large-scale, cost-effective sequence models capable of extreme context lengths (≥256k tokens), competitive reasoning, and efficient adaptation to new modalities and tasks (Team et al., 21 May 2025, Chen et al., 21 Feb 2025). Its efficacy is validated across LLMs, vision backbones, super-resolution, point cloud processing, recommendation, and multimodal fusion.

Current limitations include potential training instability in pure SSM regimes, open challenges for optimal mixing schedules at scale (e.g., learnable TransPoints), and the need for architectures and prompts specifically tailored to SSMs for low-shot transfer (Misra et al., 2024). Future research aims to develop generalized adapters for SSMs, explore theoretical underpinnings of hybrid kernel spaces, and expand to new modalities and scale laws.

TransMamba’s architectural and algorithmic advances have catalyzed the practical deployment of state-space models in industrial LLMs and beyond, marking a significant advance in scalable, flexible, and efficient deep sequence modeling.

Markdown Upgrade to Chat

References (9)

Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought (2025)

TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba (2025)

TransMamba: Flexibly Switching between Transformer and Mamba (2025)

A2Mamba: Attention-augmented State Space Models for Visual Recognition (2025)

Mamba for Scalable and Efficient Personalized Recommendations (2024)

Venturing into Uncharted Waters: The Navigation Compass from Transformer to Mamba (2024)

PoinTramba: A Hybrid Transformer-Mamba Framework for Point Cloud Analysis (2024)

Transformer-Progressive Mamba Network for Lightweight Image Super-Resolution (2025)

On the low-shot transferability of [V]-Mamba (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TransMamba.