HOI-R1: Language-Centric HOI Detection
- HOI-R1 is a language-based framework that replaces traditional object detection pipelines by leveraging multimodal large language models.
- The model employs chain-of-thought reasoning and reinforcement learning to generate structured outputs, achieving nearly double accuracy on the HICO-DET benchmark.
- HOI-R1 simplifies human-object interaction detection with rapid convergence, improved generalization, and open-source implementation for wider adoption.
HOI-R1 is a multimodal LLM (MLLM)–centric human–object interaction detection (HOID) framework that eschews classic object detection pipelines in favor of pure language-based reasoning. This approach leverages MLLMs’ ability to extract, synthesize, and reason about structured relationships in image data and text, distilling HOI outputs—including bounding boxes and HOI labels—solely from visionary cognitive guidance embedded in the input prompt. HOI-R1 achieves doubled accuracy over baseline LLMs on the HICO-DET benchmark, with strong generalization, rapid convergence, and open-source resources for implementation.
1. Conceptual Overview and Motivation
HOI-R1 was developed in response to the complexity of conventional HOID pipelines, which generally require specialized object detectors, transformer decoders, and intricate model architectures to connect vision-language prior knowledge with structured HOI instance representations. This framework discards the object detection module entirely and instead prompts the MLLM to solve object localization and verb labeling in pure text—a paradigm enabled by recent advances in chain-of-thought (CoT) reasoning and @@@@1@@@@ (RL) in MLLMs.
The primary objective is to exploit the strong holistic scene understanding, compositionality, and cognitive reasoning ability natural to MLLMs, thereby simplifying the HOID pipeline, fostering ease of extensibility and adaptation, and leveraging language-based abstraction for both geometric and semantic HOI outputs.
2. Language-Centric Model Architecture
HOI-R1’s architecture consists of three key prompt components passed to the MLLM along with the image:
- Task Instruction: Specifies the detection role, full vocabulary of objects and actions, and instructs the model on the desired output structure.
- Reasoning Guidance: Decomposes HOID into sequential cognitive steps—detecting humans, identifying actions, matching objects—to mimic human logic in stepwise format.
- Format Example: Demonstrates expected output via JSON-like structure including
>
tags for reasoning traces and<answer>
tags for the final predictions (bounding box coordinates and HOI labels).The teacher model (e.g., GPT4o-mini) first generates structured reasoning traces and predictions, which are then distilled into a student MLLM (such as Qwen2.5-VL-3B) via supervised fine-tuning (SFT). During inference, the MLLM outputs are structured text encompassing spatial localization and category labeling. The SFT loss is an autoregressive negative log-likelihood:
where indexes the reasoning (<think>) sequence and the predicted answer (<answer>).
3. Training Process: Thinking Distillation and RL Alignment
The two-stage training protocol first instills stepwise reasoning via SFT (using the teacher’s traces + ground truth), then applies Group Relative Policy Optimization (GRPO) in an RL phase using HOI-specific reward functions:
Supervised Fine-Tuning (SFT): Transfers cognitive reasoning format and domain knowledge from teacher to student MLLM, ensuring both reasoning and answer generation conform to HOI task expectation.
- RL Alignment (GRPO): Optimizes model outputs for both structure and content via multi-component HOID reward functions. The RL phase uses only ~40 additional steps after one SFT epoch to nearly double accuracy, indicating rapid convergence.
GRPO compares groups of candidate output samples directly, thus avoiding explicit critic networks. The advantage for a group is
with and .
4. HOI Reasoning Process
The chain-of-thought paradigm is central to HOI-R1. The model is prompted explicitly to “think” through human localization, action identification, and object association. The template includes:
- <think>: Reasoning trace of the cognitive process.
- <answer>: Structured response with bounding box coordinates and assigned HOI labels.
This approach guides the MLLM to deliver not only the required output but also its self-explanatory logic, resulting in coherent, interpretable, and verifiable solutions. All outputs are criterion-checked for format compliance via the reward scheme.
5. HOID Reward Functions in RL
Reward functions direct the RL alignment to reinforce both output structure and content:
- Key Format Reward:
- Box Format Reward: Validates format and non-duplication via IoU-based uniqueness constraints.
- Object/Verb Label Rewards: Compare predicted labels with the predefined object and action list; verb reward utilizes the ratio of unique and correct predictions.
- HOI IoU Reward: Matches predicted and ground-truth boxes using the Hungarian algorithm with cost:
The final reward is a weighted sum, driving the model toward semantically and geometrically correct outputs.
6. Empirical Results and Evaluation
Evaluated on HICO-DET (600 categories, >38k images):
- Baseline Qwen2.5-VL-3B (no additional training): outperforms HO-RCNN.
- SFT only (1 epoch): mAP rises from 8.39 to 16.77 (Full), with similar improvements on Rare and Non-Rare.
- Full pipeline (SFT + RL): 18.33 mAP on Rare, 19.02 mAP on Non-Rare; exceeds Qwen2.5-VL-32B-AWQ and Qwen2.5-VL-72B-AWQ baselines for generalization.
- Training efficiency: Rapid convergence—~40 RL steps after SFT suffice for notable gains.
- Generalization: Outperforms larger contemporary MLLMs on subsets, confirming the efficiency of the language-based reasoning and RL-guided output alignment.
7. Implementation and Practical Impact
HOI-R1 source code and pipeline (including prompt templates, reasoning trace formats, reward functions, and training configuration) are open-access at https://github.com/cjw2021/HOI-R1. The simplified architecture facilitates broader adoption, extension, and experimentation in structured vision-language tasks.
The HOI-R1 paradigm reframes HOI detection as language-centric reasoning, replacing conventional multi-stage detection architectures. This shift highlights the utility of MLLMs’ compositional reasoning for structured vision tasks, yielding robust performance and facilitating future innovation in text-based multimodal frameworks.