Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hulk: A Universal Knowledge Translator for Human-Centric Tasks (2312.01697v4)

Published 4 Dec 2023 in cs.CV and cs.AI

Abstract: Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk.

Citations (9)

Summary

  • The paper presents Hulk, a universal model that unifies various human-centric tasks through modality translation, eliminating the need for task-specific fine-tuning.
  • It leverages a shared encoder-decoder architecture to translate between text, image, sparse, and dense label modalities, streamlining input-output complexities.
  • Hulk outperforms state-of-the-art methods on 11 out of 12 benchmarks, achieving notable metrics such as a 71.3% mIoU on human parsing.

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

The paper presents "Hulk," a comprehensive model designed to address a variety of human-centric perception tasks across 2D vision, 3D vision, skeleton-based, and vision-language modalities without requiring task-specific fine-tuning. This model seeks to overcome the limitations of previous human-centric models that demand separate fine-tuning for specific tasks, thereby creating a significant overhead in model training and deployment for diverse tasks.

The development of Hulk is driven by the increasing need for efficient models capable of handling multiple human-centric tasks, such as pedestrian detection, pose estimation, and skeleton-based action recognition. These tasks span industries from augmented reality to sports analysis, requiring models that can efficiently translate and understand modalities such as images, texts, sparse labels, and dense labels.

Methodology

Hulk's architecture unifies diverse human-centric tasks into a modality translation framework, focusing on two primary types of task-specific heads: discrete representations (e.g., language tokens) and continuous representations (e.g., spatial coordinates). This design simplifies the input-output complexities associated with different learning tasks into four general modalities: text, image, sparse label, and dense label. By translating between these modalities, Hulk treats each task as a form of modality translation, utilizing a shared encoder-decoder framework influenced by natural language processing techniques.

The model's performance is benchmarked across 12 distinct datasets, covering eight tasks, where it surpasses the state-of-the-art in 11 of these cases. Notably, Hulk achieves a mIoU of 71.3% on the CIHP dataset for human parsing, and it records metric improvements in 2D pose estimation and 3D human mesh recovery, among other tasks. These results demonstrate Hulk's capacity to not merely consolidate tasks under a singular model but to enhance collective task performance without bespoke adjustments per task.

Implications and Future Directions

Hulk's successful deployment illustrates two broader implications for the AI community. First, it signifies a step towards generalist models that can efficiently handle multi-tasking without significant performance trade-offs. Second, it highlights the importance of knowledge sharing and efficient architectural designs that cut across various inputs and outputs leveraging shared learning representations instead of traditional task-specific modeling.

Theoretically, the Hulk framework encourages a re-evaluation of task-specific models in favor of unified models that can learn from shared data representations, reducing resource consumption and increasing model applicability. Practically, this could lead to significant advancements in industries reliant on computer vision and perception, ensuring quicker adaptations to varied application scenarios.

Looking forward, while Hulk provides a novel framework for task integration, there remains a scope to expand its applicability into exploring more challenging perception tasks and fine-tuning its generalization capabilities across different modalities. Advances in efficient computing and data representation may bolster Hulk’s scalability, ensuring its competitiveness in an expanding field of multi-modal AI research.