Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition (2505.06002v2)

Published 9 May 2025 in cs.CV

Abstract: Large-scale pre-trained models have achieved remarkable success in language and image tasks, leading an increasing number of studies to explore the application of pre-trained image models, such as CLIP, in the domain of few-shot action recognition (FSAR). However, current methods generally suffer from several problems: 1) Direct fine-tuning often undermines the generalization capability of the pre-trained model; 2) The exploration of task-specific information is insufficient in the visual tasks; 3) The semantic order information is typically overlooked during text modeling; 4) Existing cross-modal alignment techniques ignore the temporal coupling of multimodal information. To address these, we propose Task-Adapter++, a parameter-efficient dual adaptation method for both image and text encoders. Specifically, to make full use of the variations across different few-shot learning tasks, we design a task-specific adaptation for the image encoder so that the most discriminative information can be well noticed during feature extraction. Furthermore, we leverage LLMs to generate detailed sequential sub-action descriptions for each action class, and introduce semantic order adapters into the text encoder to effectively model the sequential relationships between these sub-actions. Finally, we develop an innovative fine-grained cross-modal alignment strategy that actively maps visual features to reside in the same temporal stage as semantic descriptions. Extensive experiments fully demonstrate the effectiveness and superiority of the proposed method, which achieves state-of-the-art performance on 5 benchmarks consistently. The code is open-sourced at https://github.com/Jaulin-Bage/Task-Adapter-pp.

Summary

Overview of Task-Adapter++: Efficient Few-shot Learning for Action Recognition

The paper entitled "Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition" introduces an innovative approach to enhancing few-shot action recognition (FSAR) leveraging large-scale pre-trained models, such as CLIP. FSAR has gained traction due to the inherent challenges in obtaining sufficient labeled video data. The application of pre-trained models to FSAR, while promising, has faced several obstacles including the degradation of generalization capacity, insufficient task-specific information utilization, neglect of semantic order in text processing, and the poor temporal coupling of multimodal features.

Task-Adapter++ seeks to solve these issues by proposing a dual adaptation method that enhances image and text encoders with task-specific and semantic order adapters, respectively. This parameter-efficient framework embellishes FSAR capabilities by improving feature extraction during fine-tuning, enriching semantic processing, and refining cross-modal alignment.

Key Methodologies

Task-Specific Adaptation: The paper designs a task-specific adaptation for the image encoder, which dynamically refines discriminative information at the feature extraction stage. By employing an adapter-based framework, this innovation allows frozen self-attention blocks to perform cross-video task-specific attention, effectively balancing the retention of pre-trained model knowledge and task-specific discriminative patterns.
Semantic Order Adaptation: Leveraging LLMs, Task-Adapter++ provides granular sub-action descriptions for each action class, capturing temporal sequences of actions effectively. The inclusion of semantic order adapters ensures sequential dependencies between sub-actions are modeled, thus enhancing the contextual alignment between visual and textual representations.
Fine-grained Cross-modal Alignment: Unlike previous methods, Task-Adapter++ proposes a more engaging cross-modal alignment strategy where visual features are temporally mapped to reside alongside semantic descriptions. This refined matching algorithm supports enhanced similarity calculations by tying them to structured temporal stages rather than treating actions as holistic units.

Experimental Validation

The paper demonstrates the superiority of Task-Adapter++ across five benchmarks. The model achieves state-of-the-art results consistently, with an improvement of up to 3.4% over previous methods. These results emphasize the adaptability and precision of Task-Adapter++ in dealing with the complexities inherent in FSAR tasks. Moreover, extensive ablation studies confirm the effectiveness of each component of the framework.

Implications and Future Work

Task-Adapter++ redefines the approach to few-shot learning in action recognition by efficiently harnessing the latent knowledge of pre-trained models through parameter-efficient tuning techniques. This method presents strong implications for both practical applications, such as real-time action classification, and theoretical advancements in model adaptation strategies.

Future directions might explore deeper integration with other multimodal frameworks, broaden the scope of task-specific adaptations into other domains, and refine alignment algorithms to further bolster the robustness and applicability of FSAR. Similarly, investigating the potential of adaptable architectures invested in domain-specific pre-training could also yield impactful results. AI research continues to benefit from such innovative methodologies, paving the way for greater efficiency and scalability in machine learning applications.

Related Papers

GitHub

GitHub - Jaulin-Bage/Task-Adapter-pp: code will be released soon (3 stars)