Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Published 5 May 2026 in cs.CV | (2605.03438v1)

Abstract: Pre-trained 3D point cloud foundation models (PFMs) have demonstrated strong transferability across diverse downstream tasks. However, full fine-tuning these models is computationally expensive and storage-intensive. Parameter-efficient fine-tuning (PEFT) offers a promising alternative, but existing PEFT approaches are primarily designed for Transformer-based backbones and rely on token-level prompting or feature transformation. Mamba-based backbones introduce a granularity mismatch between token-level adaptation and state-level sequence dynamics. Consequently, straightforward transfer of existing PEFT approaches to frozen Mamba backbones leads to substantial accuracy degradation and unstable optimization. To address this issue, we propose Mantis, the first Mamba-native PEFT framework for 3D PFMs. Specifically, a State-Aware Adapter (SAA) is introduced to inject lightweight task-conditioned control signals into selective state-space updates, enabling state-level adaptation while keeping the pre-trained backbone frozen. Moreover, different valid point cloud serializations are regularized by Dual-Serialization Consistency Distillation (DSCD), thereby reducing serialization-induced instability. Extensive experiments across multiple benchmarks demonstrate that our Mantis achieves competitive performance with only about 5% trainable parameters. Our code is available at https://github.com/gzhhhhhhh/Mantis.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces Mantis, a Mamba-native framework that uses state-aware adapters (SAA) to inject task-conditioned control signals for dynamic modulation of state transitions.
The paper presents Dual-Serialization Consistency Distillation (DSCD) to align features across varied serializations, stabilizing training and reducing sensitivity to input order.
The paper demonstrates that Mantis attains comparable or improved accuracy over full fine-tuning with only ~5% of parameters, validated on challenging 3D point cloud benchmarks.

Mamba-native Parameter-Efficient Fine-tuning for 3D Point Cloud Models: An Analysis of "Mantis" (2605.03438)

Motivation and Problem Setting

With the rapid proliferation of large-scale 3D point cloud foundation models (PFMs), the necessity for effective transfer to diverse downstream tasks has made full fine-tuning computationally and storage intensive. Existing parameter-efficient fine-tuning (PEFT) paradigms—paramount in NLP and 2D vision—predominantly target Transformer-based backbones and operate at the token or feature level. However, Mamba-based architectures, which utilize selective State Space Models (SSMs), have recently demonstrated strong efficiency advantages for point cloud processing due to their linear time complexity and content-aware state transitions.

A fundamental granularity mismatch arises when Transformer-native PEFT strategies are directly applied to frozen Mamba backbones. Transformer PEFT modules, adapted to token-level prompts or feature transformations, are not naturally aligned with the state-level sequence dynamics of Mamba. This mismatch engenders substantial degradation in downstream accuracy and training instability, especially in challenging, order-sensitive scenarios inherent to real-world point clouds.

Mantis: The Mamba-native PEFT Solution

"Mantis" introduces a two-pronged framework tailor-made for efficient, robust adaptation of Mamba-based PFMs in 3D vision:

State-Aware Adapter (SAA):

SAA operates directly in the state-space, injecting task-conditioned, input-dependent control signals into the selective SSM operators ( $A_t$ , $B_t$ , $C_t$ , $A_t$ ), enabling dynamic modulation of state transitions without altering the frozen backbone weights. The control signals are rigorously constructed via a soft-thresholding proximal optimization scheme to ensure sparse, low-rank perturbation of the state-space operators, improving adaptation efficiency and stability over static weight-level (e.g., LoRA) or token/feature-level approaches.

Dual-Serialization Consistency Distillation (DSCD):

To address serialization-induced instability in Mamba backbones, DSCD regularizes feature and prediction consistency across two valid serializations of each input point cloud (e.g., Hilbert and Trans-Hilbert curves). By enforcing strong cross-order alignment at the representation and decision levels, DSCD mitigates serialization sensitivity and stabilizes optimization—especially critical under challenging, real-world cluttered data.

Empirical Results

Extensive experiments on challenging benchmarks (ScanObjectNN [36], ModelNet40 [45], ShapeNetPart [46]) using SSM-based backbones (PointMamba [23], Mamba3D [12], ZigzagPointMamba [7]) reveal:

Parameter Efficiency and Performance:

Mantis utilizes only ~5% of the parameters required for full fine-tuning but matches or exceeds full fine-tuning baselines. On ScanObjectNN PB_T50_RS (most challenging split), Mantis achieves +0.65% (PointMamba), +1.43% (Mamba3D), and +1.46% (ZigzagPointMamba) gains over full fine-tuning, attaining 93.48% with only 0.8M trainable parameters (16.9M for full FT).

Stability and Robustness:

Training curves and ablations demonstrate that Mantis avoids the unstable optimization and convergence issues found in prompt/adapters/LoRA transferred from Transformers. Cross-serialization alignment directly reduces representation- and prediction-level discrepancies, with stronger effects on heavily perturbed or noisy datasets.

Task-Generality:

Mantis achieves SOTA or near-SOTA in few-shot classification and part segmentation, outperforming existing PEFTs and maintaining segmentation fidelity on complex geometric details. Its relatively low compute cost (linear in sequence length, negligible parameter/memory overhead) is maintained for large-scale scene semantic segmentation on S3DIS.

Theoretical and Algorithmic Contributions

State-level Adaptation:

Formal analysis shows SAA induces controlled low-rank perturbations to the selective transfer kernel, guaranteeing bounded deviation in hidden state evolution, thereby reconciling modifiability and backbone stability.

Serialization Sensitivity:

The order-dependence of SSMs leads to variable transfer kernels per valid serialization. Mantis’s DSCD regularizes this, ensuring the geometry-anchored priors of the frozen model are preserved under downstream adaptation.

Efficiency:

Both parameter and compute overheads are provably marginal compared to full fine-tuning—the SAA requires only bottlenecked projections and low-rank modulation, and the Mamba backbone maintains linear scan speed.

Limitations and Future Prospects

Mantis depends on manually designed configurations (inserted layers, serialization schemes) and is currently limited to single-modal (geometry-only) inputs. Extension to multimodal 3D representation learning (point clouds + language or images), automatic search for optimal SAA configuration/topology, and investigation of explicit backbone adaptation in dense, open-world scene understanding are promising avenues.

Practically, Mantis enables widespread deployment of powerful 3D PFMs in storage/computation-constrained settings, and theoretically, sets the design paradigm for future Mamba-native (or general SSM-native) transfer algorithms by emphasizing state-wise, not token-wise, adaptation.

Conclusion

Mantis establishes the first parameter-efficient fine-tuning framework natively aligned with Mamba-based 3D point cloud foundation models. By bridging the granularity gap with state-aware adaptation and serialization consistency, it demonstrates that highly efficient, robust transfer learning is achievable without sacrificing accuracy or stability. This work both advances the practical state-of-the-art and informs the theoretical design of future PEFT for sequence models in 3D vision (2605.03438).

Markdown Report Issue