AOR-Net: Action-Object-Relation Network

Updated 24 January 2026

The paper introduces AOR-Net, a multi-level framework that models actions, objects, and their relations using a staged visual reasoning pipeline and adaptive fusion.
It employs Chain-of-Action prompting and a Mixture-of-Thoughts module to dynamically integrate global motion, object-specific, and relational features for improved accuracy.
Empirical results on the DAOS dataset demonstrate state-of-the-art improvements in fine-grained action recognition with significant gains over prior models.

The Action-Object-Relation Network (AOR-Net) is a multi-level reasoning framework designed to address fine-grained action recognition in video, with a particular focus on domains where subtle human-object interactions are critical, such as in-cabin driver behavior monitoring. Unlike previous models that naively fuse global and object-centric features, AOR-Net introduces explicit modeling of actions, objects, and their relations using a staged visual reasoning pipeline and an adaptive fusion mechanism to emphasize the most salient interaction cues. Developed for the DAOS (Driver Action with Object Synergy) dataset, AOR-Net demonstrates state-of-the-art performance on this and related benchmarks, advancing object-relation-aware action recognition (Li et al., 17 Jan 2026).

1. Motivation and Problem Setting

In domains such as driver monitoring, distinguishing actions like “holding phone” versus “grasping wheel” requires disambiguating visually similar upper-body movements by referencing both object presence and the nature of human-object interactions. Legacy methods either ignore spatial human-object relations or indiscriminately incorporate object features, often introducing irrelevant cues. AOR-Net is constructed to overcome these deficiencies by:

Enforcing a chain-of-action prompting structure that sequentially reasons over global, object, and relational cues.
Introducing a Mixture-of-Thoughts (MoT) module that adaptively fuses these cues, dynamically selecting which information stream should dominate given the action context.
Integrating textual prototype banks for actions, objects, and relations to facilitate cross-modal matching and prompt-guided reasoning.

This explicit modeling of task-relevant objects and their interactions closes empirical gaps identified in prior work, where most approaches failed to effectively link objects to their associated actions in complex, object-rich spaces (Li et al., 17 Jan 2026).

2. Architectural Framework

AOR-Net is structured hierarchically, reflecting its multi-stage prompting and fusion design. The main components are as follows:

Open-VCLIP Backbone: A CLIP-style ViT encoder is used to tokenize $T$ frames into a grid of patch embeddings $\mathbf{V}_A \in \mathbb{R}^{(THW+1)\times d}$ .
Chain-of-Action Prompting (CoA): A staged process that produces dedicated action, object, and relation streams:
- Action Level: The backbone output $\mathbf{V}_A$ serves as a global motion token representation.
- Object Level: For up to $O$ detected objects per frame (boxes $\mathfrak{B}$ ), RoIAlign extracts object-specific patch features, which are passed through MLPs and temporally max-pooled to $\mathbf{V}_O \in \mathbb{R}^{O\times d}$ . A multi-head self-attention block, $\mathrm{MHSA}$ , allows exchange between action and object streams.
- Relation Level: Human ( $\mathbf{V}_H$ ) and object tokens ( $\mathbf{V}_{Obj}$ ) are paired and encoded via a 5-layer relation MLP to yield relational features, further processed by a multi-head cross-attention (MHCA) over object tokens to obtain relation-enhanced embeddings $\mathbf{V}_R'$ .

After the CoA path, three streams $\mathbf{V}'_A$ , $\mathbf{V}'_O$ , and $\mathbf{V}'_R$ represent action, object, and relation features, respectively.

Textual Prototype Bank: Using GPT-4o, natural language prompts are generated for all action, object, and action-object-relation descriptions and encoded with the CLIP text encoder into $\mathbf{T}_A$ , $\mathbf{T}_O$ , and $\mathbf{T}_R$ .
Mixture-of-Thoughts (MoT) Module: Each visual stream is aligned to its text prototype bank via similarity matrices and the Gumbel-Softmax one-hot trick:

$\mathbf{M}_l = \mathbf{V}'_l\,\mathbf{T}_l^\top,\quad \hat{\mathbf{M}}_l = \mathrm{one\text{-}hot}(\arg\max \mathbf{M}_l) + \mathbf{M}_l-\mathrm{detach}(\mathbf{M}_l)$

Aligned text features are fused with visual tokens, and adaptive weights are computed such that the final fused representation $\mathbf{A}_{\mathrm{final}}$ is:

$\mathbf{A}_{\mathrm{final}} = W_A\mathbf{F}_A + \sum_{i=1}^O W_{O,i} \mathbf{F}_{O,i} + \sum_{j=1}^R W_{R,j} \mathbf{F}_{R,j}$

Classifier and Loss: A linear layer predicts action labels from $\mathbf{A}_{\mathrm{final}}$ , optimized with cross-entropy loss over the one-hot action labels.

3. Implementation Specifics

AOR-Net is implemented with several notable engineering decisions:

Data Modalities: Multi-view, multi-modal input from four Azure Kinect cameras covering front, face, left, right; modalities include synchronized RGB, IR, and depth at 15 fps.
Backbone Pretraining: Open-VCLIP is pre-trained on Kinetics-400; 8 frames per 3 s clip are sampled and resized to $224\times224$ .
Object Handling: Maximum $O=6$ objects per frame (best empirically), RoIAlign for object feature extraction.
Training: AdamW optimizer, 4×A5000 GPUs, 30 epochs, learning rate $\{10^{-4},10^{-5},10^{-6}\}$ , batch sizes of 16 (single modality) and 4 (multi-modal settings), label-smoothing cross-entropy.

Pseudocode for the forward pass is provided in the source, detailing sequential object parsing, prompting, alignment, and fusion steps (Li et al., 17 Jan 2026).

4. Experimental Results and Benchmarking

Comprehensive evaluation is conducted on DAOS (9,787 clips, 36 actions, 15 objects) and a dense-object-labeled DriveAct subset:

Method	Fine-grained Top-1	Fine-grained Mean-1	Coarse-grained Top-1	Coarse-grained Mean-1
Open-VCLIP (RGB only)	53.8	43.0	53.3	42.4
AOR-Net (RGB only)	55.5	45.4	55.1	45.0
Open-VCLIP† (multimod)	62.0	51.3	57.1	44.0
AOR-Net† (multimod)	63.8	55.0	61.4	46.7

† denotes fusion of RGB, IR, and Depth modalities. AOR-Net achieves peak Top-1 of 63.8% and Mean-1 of 55.0% in fine-grained settings with all available modalities. On the DriveAct transfer subset, AOR-Net obtains 77.67% Top-1 and 66.90% Mean-1.

Performance ablation shows improvements at each architectural innovation: CoA alone raises Top-1 by 0.93% over the Open-VCLIP baseline, with the MoT module providing an additional 0.82% gain. The best-performing configuration sets the maximum object number at $O=6$ , uses a relation encoder depth of $L=5$ , and embedding dimension $d=512$ (Li et al., 17 Jan 2026).

5. Significance of Chain-of-Action Prompting and Mixture-of-Thoughts

The CoA mechanism enforces a sequential, interpretable reasoning process, mirroring human approaches to resolving fine-grained action ambiguities by grounding first in global motion, then in object salience, and finally in the specific nature of human-object relations. The Mixture-of-Thoughts module further increases robustness, especially under object-scarce conditions or formidable intra-class variance, by adaptively selecting the most discriminative stream (action, object, or relation) for each input. Visualization of MoT attention maps validates this, showing selective focus on relevant human-object pairs while suppressing distractors.

A plausible implication is that such modular, prompt-driven frameworks generalize well to other object-centric action recognition tasks in resource-rich and resource-poor environments (Li et al., 17 Jan 2026).

6. Comparison with Prior Object-Relation Models

Compared to previous works such as THORN (“Temporal Human-Object Relation Network”) (Guermal et al., 2022), which employs 3D CNN backbones and graph-based relation reasoning for generic compositional action recognition, AOR-Net introduces:

Direct prompting at three semantic abstraction levels (action, object, relation) rather than solely leveraging learnable adjacency in a single graph structure.
Explicit cross-modal alignment with stacked textual prototype banks, where multiple language prompts for actions/relations are validated by experts and encoded via the CLIP text encoder.
Dynamic fusion via MoT rather than static summation or concatenation, ensuring contextually appropriate weighting of evidence streams.

Despite these differences, both frameworks establish that modeling human-object and object-object interactions is crucial for high-fidelity action recognition, especially where actions are compositionally defined or visually ambiguous.

7. Ablation Analyses and Practical Observations

Empirical investigation into design hyperparameters demonstrates:

Increasing the maximum allowed objects per frame yields diminishing returns beyond $O=6$ .
MoT improves both mean and Top-1 accuracy, confirming the benefit of dynamic fusion over static feature aggregation.
The Gumbel-Softmax temperature for cross-modal alignment peaks at $T=5.0$ , indicating the trade-off between hard and soft assignment is crucial for stable optimization.
Multi-modal fusion (RGB+IR+Depth) significantly boosts performance, with Top-1 accuracy increasing from 55.51% (RGB only) to 63.75%.
Attention visualization indicates successful suppression of irrelevant objects and selection of critical human-object pairs in complex scenes.

These findings suggest that, in challenging domains marked by class imbalance, occlusion, or object clutter, adaptivity at both the reasoning and fusion stages is essential for robust performance (Li et al., 17 Jan 2026).

AOR-Net, by unifying adaptive multi-level reasoning, prompt-guided cross-modal alignment, and dynamic evidence fusion, establishes a benchmark for explicit object-relation-aware action recognition in richly annotated, complex environments. Its systematic improvements through chain-of-action prompting and mixture-of-thoughts fusion delineate methodological advances over prior models while providing architectural transparency and interpretability crucial for future extensions and cross-domain applications.

Markdown Upgrade to Chat

References (2)

DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset (2026)

THORN: Temporal Human-Object Relation Network for Action Recognition (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Action-Object-Relation Network (AOR-Net).