Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer (2112.01838v2)

Published 3 Dec 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human-object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary-Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialise, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.

Citations (92)

View on Semantic Scholar

Summary

Review of "Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer"

The paper "Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer" by Frederic Z. Zhang, Dylan Campbell, and Stephen Gould, introduces a novel approach for improving human-object interaction (HOI) detection by leveraging a two-stage detection paradigm. This work suggests that while one-stage detection models have gained prominence in recent literature, their two-stage counterparts, when harnessing equivalent transformer architectures, demonstrate superior performance and efficiency.

Key Contributions and Methodology

The authors propose the Unary–Pairwise Transformer, a two-stage detection model composed of a modified transformer architecture that employs separate unary and pairwise encodings for each detection instance. The methodology delineates a departure from conventional one-stage approaches by introducing separate processing for individual entities (unary encodings) and their interactions (pairwise encodings), which complement each other in improving detection performance.

The primary contributions of this work are as follows:

Two-Stage Architecture: The model couples a backbone object detector (DETR) with an advanced interaction head composed of transformer layers specifically tailored for HOI classification.
Unary and Pairwise Encodings: These encodings form the crux of the interaction head. Unary encodings involve processing individual human and object instances, while pairwise encodings address the interactions—encoded through spatial positional encodings—to suppress non-viable pairs and enhance true positive interactions.
Efficiency Gains: Empirically, the two-stage model exhibits a considerable reduction in training time and memory usage compared to one-stage models, enabling usage of high-capacity backbone networks without exorbitant resource demands.
Superior Performance: The model achieves state-of-the-art performance on prominent HOI datasets like HICO-DET and V-COCO, demonstrating significant improvement in mean average precision (mAP), particularly under known object settings.

Experimental Results

The paper reports comprehensive experiments on major datasets. When evaluated on the HICO-DET dataset, the model achieves an mAP of up to 32.62 and significantly improves detection scores over existing state-of-the-art methods by addressing rare and non-rare interactions alike. For the V-COCO dataset, performance gains are also evident across both evaluation scenarios. The implication is a model that not only surpasses current benchmarks but does so with enhanced computational efficiency.

Theoretical and Practical Implications

The paper contributes theoretical insights into the relationship between model architecture and efficiency in the context of transformers applied to HOI tasks. The separation of unary and pairwise analyses allows for more nuanced interactions modeling, which is pertinent for complex scene understanding. Practically, the approach circumvents the common trade-offs in memory and compute time without sacrificing detection accuracy, aligning well with real-world application scenarios requiring real-time inference capabilities.

Future Directions

Future research directions could expand upon the notion of interaction specificity and examine the transferability of the unary–pairwise mechanism to other domains of scene understanding such as action detection in videos or more complex dynamic environments. Moreover, further analysis could be conducted to evaluate the robustness of this approach under varying data conditions, possibly involving transfer learning scenarios across disparate datasets.

Conclusion

In conclusion, the proposed Unary–Pairwise Transformer constitutes a substantial advancement in efficient HOI detection. The research delineates a clear path in optimizing transformer architectures for specialized detection tasks, balancing the need for computational prudence with cutting-edge performance. The novel hierarchical encoding strategy paves the way for future investigations into efficient scene understanding methodologies, and could serve as an architectural template for related tasks beyond static image analysis.

Related Papers

YouTube

Show All Videos