An Empirical Study of Mamba-based Pedestrian Attribute Recognition (2407.10374v2)

Published 15 Jul 2024 in cs.CV and cs.AI

Abstract: Current strong pedestrian attribute recognition models are developed based on Transformer networks, which are computationally heavy. Recently proposed models with linear complexity (e.g., Mamba) have garnered significant attention and have achieved a good balance between accuracy and computational cost across a variety of visual tasks. Relevant review articles also suggest that while these models can perform well on some pedestrian attribute recognition datasets, they are generally weaker than the corresponding Transformer models. To further tap into the potential of the novel Mamba architecture for PAR tasks, this paper designs and adapts Mamba into two typical PAR frameworks, i.e., the text-image fusion approach and pure vision Mamba multi-label recognition framework. It is found that interacting with attribute tags as additional input does not always lead to an improvement, specifically, Vim can be enhanced, but VMamba cannot. This paper further designs various hybrid Mamba-Transformer variants and conducts thorough experimental validations. These experimental results indicate that simply enhancing Mamba with a Transformer does not always lead to performance improvements but yields better results under certain settings. We hope this empirical study can further inspire research in Mamba for PAR, and even extend into the domain of multi-label recognition, through the design of these network structures and comprehensive experimentation. The source code of this work will be released at \url{https://github.com/Event-AHU/OpenPAR}

PDF HTML Abstract

An Empirical Study of Mamba-based Pedestrian Attribute Recognition

The paper presents a comprehensive investigation into leveraging Mamba architecture for Pedestrian Attribute Recognition (PAR), aiming to balance accuracy and computational efficiency. Existing Transformer-based models, though highly effective, often suffer from significant computational overhead due to their quadratic complexity concerning input sequence length. Mamba and its derivatives promise linear complexity, presenting an opportunity to reduce computational costs without compromising performance.

Core Contributions and Methodologies

The authors explore adapting the Mamba structure into two primary PAR frameworks: a text-image fusion approach and a pure vision Mamba multi-label recognition framework. The empirical results suggest that employing attribute tags as additional input benefits the Vim-based framework, while the VMamba-based framework does not experience similar improvements.

The research introduces several hybrid models combining Mamba with Transformer networks. This hybridization does not universally enhance performance but shows promise under particular settings.

Experimental Outcomes

Across multiple datasets, including PA100K, PETA, RAP-V1, RAP-V2, and WIDER, the paper reveals several insightful findings:

Pure Mamba-based vs. Hybrid Architectures: The VMamba-based frameworks perform comparably or even outperform some Transformer-based models like ViT under specific configurations. Particularly, VMamba exhibits better performance than Vim when tested on datasets with diverse attributes.
Integration Complexity: Incorporating Transformers into Mamba-based models does not uniformly improve outcomes. For instance, mere enhancements in the architecture complexity do not guarantee better performance in pedestrian attribute recognition tasks. Some hybrid models showed performance deterioration, likely due to mismatches in network architecture and dataset characteristics.

Implications and Future Directions

The research highlights the potential of Mamba architectures for applications requiring efficient processing of visual data. Mamba's ability to maintain competitive performance with reduced resource requirements can revolutionize deployment in resource-constrained environments.

Looking ahead, exploring Mamba’s utility in multi-modal fusion remains challenging. Despite demonstrating effective attribute recognition, optimizing Mamba for multi-modal fusion requires further exploration, especially to achieve consistent improvements across different modalities.

Conclusion

The paper paves the way for employing state space models like Mamba within the PAR domain, illustrating that with careful integration, they can match or surpass existing state-of-the-art models under optimal settings. It invites continued exploration into hybrid models and the possibility of leveraging state space models' linear complexity to innovate more efficient solutions in vision-related tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiao Wang (507 papers)
Weizhe Kong (3 papers)
Jiandong Jin (11 papers)
Shiao Wang (16 papers)
Ruichong Gao (2 papers)
Qingchuan Ma (7 papers)
Chenglong Li (94 papers)
Jin Tang (139 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Event-AHU/OpenPAR: [OpenPAR] An open-source framework for Pedestrian Attribute Recognition, based on PyTorch (94 stars)