Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

HOI-R1: Language-Centric HOI Detection

Updated 9 October 2025
  • HOI-R1 is a language-based framework that replaces traditional object detection pipelines by leveraging multimodal large language models.
  • The model employs chain-of-thought reasoning and reinforcement learning to generate structured outputs, achieving nearly double accuracy on the HICO-DET benchmark.
  • HOI-R1 simplifies human-object interaction detection with rapid convergence, improved generalization, and open-source implementation for wider adoption.

HOI-R1 is a multimodal LLM (MLLM)–centric human–object interaction detection (HOID) framework that eschews classic object detection pipelines in favor of pure language-based reasoning. This approach leverages MLLMs’ ability to extract, synthesize, and reason about structured relationships in image data and text, distilling HOI outputs—including bounding boxes and HOI labels—solely from visionary cognitive guidance embedded in the input prompt. HOI-R1 achieves doubled accuracy over baseline LLMs on the HICO-DET benchmark, with strong generalization, rapid convergence, and open-source resources for implementation.

1. Conceptual Overview and Motivation

HOI-R1 was developed in response to the complexity of conventional HOID pipelines, which generally require specialized object detectors, transformer decoders, and intricate model architectures to connect vision-language prior knowledge with structured HOI instance representations. This framework discards the object detection module entirely and instead prompts the MLLM to solve object localization and verb labeling in pure text—a paradigm enabled by recent advances in chain-of-thought (CoT) reasoning and @@@@1@@@@ (RL) in MLLMs.

The primary objective is to exploit the strong holistic scene understanding, compositionality, and cognitive reasoning ability natural to MLLMs, thereby simplifying the HOID pipeline, fostering ease of extensibility and adaptation, and leveraging language-based abstraction for both geometric and semantic HOI outputs.

2. Language-Centric Model Architecture

HOI-R1’s architecture consists of three key prompt components passed to the MLLM along with the image:

  • Task Instruction: Specifies the detection role, full vocabulary of objects and actions, and instructs the model on the desired output structure.
  • Reasoning Guidance: Decomposes HOID into sequential cognitive steps—detecting humans, identifying actions, matching objects—to mimic human logic in stepwise format.
  • Format Example: Demonstrates expected output via JSON-like structure including > tags for reasoning traces and <answer> tags for the final predictions (bounding box coordinates and HOI labels).

    The teacher model (e.g., GPT4o-mini) first generates structured reasoning traces and predictions, which are then distilled into a student MLLM (such as Qwen2.5-VL-3B) via supervised fine-tuning (SFT). During inference, the MLLM outputs are structured text encompassing spatial localization and category labeling. The SFT loss is an autoregressive negative log-likelihood:

    LSFT=E(x,q,r,a)D[tTRlogπθ(rtx,q,r<t)+tTAlogπθ(atx,q,r,a<t)]\mathcal{L}_{SFT} = -\mathbb{E}_{(x,q,r,a) \sim \mathcal{D}} \left[ \sum_{t}^{T_R} \log \pi_{\theta}(r_t | x, q, r_{<t}) + \sum_{t}^{T_A} \log \pi_{\theta}(a_t | x, q, r, a_{<t}) \right]

    where rtr_t indexes the reasoning (<think>) sequence and ata_t the predicted answer (<answer>).

    3. Training Process: Thinking Distillation and RL Alignment

    The two-stage training protocol first instills stepwise reasoning via SFT (using the teacher’s traces + ground truth), then applies Group Relative Policy Optimization (GRPO) in an RL phase using HOI-specific reward functions:

    • Supervised Fine-Tuning (SFT): Transfers cognitive reasoning format and domain knowledge from teacher to student MLLM, ensuring both reasoning and answer generation conform to HOI task expectation.

    • RL Alignment (GRPO): Optimizes model outputs for both structure and content via multi-component HOID reward functions. The RL phase uses only ~40 additional steps after one SFT epoch to nearly double accuracy, indicating rapid convergence.

    GRPO compares groups of candidate output samples directly, thus avoiding explicit critic networks. The advantage for a group GG is

    A=1Gi=1Gmin(ρiAi,clip(ρi,1ϵ,1+ϵ)Ai)A = \frac{1}{G} \sum_{i=1}^{G} \min \left( \rho_i A_i, \operatorname{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i \right)

    with ρi=πθ(oiq)πθold(oiq)\rho_i = \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} and Ai=rimean(r1,...,rG)std(r1,...,rG)A_i = \frac{r_i - \text{mean}(r_1, ..., r_G)}{\text{std}(r_1, ..., r_G)}.

    4. HOI Reasoning Process

    The chain-of-thought paradigm is central to HOI-R1. The model is prompted explicitly to “think” through human localization, action identification, and object association. The template includes:

    • <think>: Reasoning trace of the cognitive process.
    • <answer>: Structured response with bounding box coordinates and assigned HOI labels.

    This approach guides the MLLM to deliver not only the required output but also its self-explanatory logic, resulting in coherent, interpretable, and verifiable solutions. All outputs are criterion-checked for format compliance via the reward scheme.

    5. HOID Reward Functions in RL

    Reward functions direct the RL alignment to reinforce both output structure and content:

    • Key Format Reward: rtag=I{"<think>"y^}r_{tag} = \mathbb{I}\{ "<think>" \in \hat{y} \}
    • Box Format Reward: Validates format and non-duplication via IoU-based uniqueness constraints.
    • Object/Verb Label Rewards: Compare predicted labels with the predefined object and action list; verb reward utilizes the ratio of unique and correct predictions.
    • HOI IoU Reward: Matches predicted and ground-truth boxes using the Hungarian algorithm with cost:

    Cij=1sij;sij=12[IoU(h^i,hj)+IoU(o^i,oj)]C_{ij} = 1 - s_{ij}; \quad s_{ij} = \frac{1}{2} [IoU(\hat{h}_i, h_j) + IoU(\hat{o}_i, o_j)]

    The final reward rr is a weighted sum, driving the model toward semantically and geometrically correct outputs.

    6. Empirical Results and Evaluation

    Evaluated on HICO-DET (600 categories, >38k images):

    • Baseline Qwen2.5-VL-3B (no additional training): outperforms HO-RCNN.
    • SFT only (1 epoch): mAP rises from 8.39 to 16.77 (Full), with similar improvements on Rare and Non-Rare.
    • Full pipeline (SFT + RL): 18.33 mAP on Rare, 19.02 mAP on Non-Rare; exceeds Qwen2.5-VL-32B-AWQ and Qwen2.5-VL-72B-AWQ baselines for generalization.
    • Training efficiency: Rapid convergence—~40 RL steps after SFT suffice for notable gains.
    • Generalization: Outperforms larger contemporary MLLMs on subsets, confirming the efficiency of the language-based reasoning and RL-guided output alignment.

    7. Implementation and Practical Impact

    HOI-R1 source code and pipeline (including prompt templates, reasoning trace formats, reward functions, and training configuration) are open-access at https://github.com/cjw2021/HOI-R1. The simplified architecture facilitates broader adoption, extension, and experimentation in structured vision-language tasks.

    The HOI-R1 paradigm reframes HOI detection as language-centric reasoning, replacing conventional multi-stage detection architectures. This shift highlights the utility of MLLMs’ compositional reasoning for structured vision tasks, yielding robust performance and facilitating future innovation in text-based multimodal frameworks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HOI-R1.