- The paper introduces VLM4HOI, a specialized vision-language model that integrates visual and textual data to tackle hand-object interaction referral in egocentric vision.
- The paper leverages the HOI-QA dataset containing approximately 3.9 million question-answer pairs to fine-tune the model for precise hand/object and interaction referrals.
- The paper demonstrates that VLM4HOI outperforms existing models, significantly enhancing interaction understanding for applications in robotics and augmented reality.
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Introduction to HOI-Ref Task
The task of Hand-Object Interaction Referral (HOI-Ref) for egocentric vision involves recognizing and referring to hands and objects within images, as well as understanding the interactions between them. This new task has been identified due to the need for robust interaction models which can benefit applications in robotics and augmented reality, particularly from a first-person perspective.
Vision LLMs for HOI-Ref
The central approach in tackling the HOI-Ref task is through Vision LLMs (VLMs). Leveraging the advancements in LLMs, the paper introduces a model named VLM4HOI. This model, uniquely trained for the HOI-Ref task, utilizes a vision encoder and a projection layer to integrate visual data into the embedding space used by LLMs. The model is fine-tuned using a specially curated dataset named HOI-QA.
The HOI-QA Dataset
The HOI-QA dataset consists of approximately 3.9 million question-answer pairs designed specifically for training and assessing VLMs on the HOI-Ref task. Derived from existing egocentric video datasets, the HOI-QA dataset is constructed by transforming various annotations—such as those indicating actions or object interactions—into question-answer formats that necessitate referring to both spatial and interactive components.
Model Training and Implementation
VLM4HOI leverages the HOI-QA dataset for training, utilizing an extensive set of instructions encompassing both the object/hand identifications and interaction specifics. The model implements a vision encoder integrated with a LLM, enabling it to generate answers by processing concatenated visual and textual embeddings. The linguistic component is tuned to handle instructions effectively, allowing the model to discern and generate correct referrals based on input queries.
Evaluation and Results
The model's performance is evaluated against traditional referral tasks and those specifically developed for interaction-based referrals, divided into hand/object referrals (HO-Ref) and interaction referrals (I-Ref). VLM4HOI outperforms existing models like MiniGPT-v2 significantly, showing an improvement particularly in the interaction understanding, which is critical for applications in interactive settings.
Conclusions and Future Directions
The creation and training of VLM4HOI, along with the development of the HOI-QA dataset, mark substantial progress in the field of egocentric vision-based interaction understanding. The results demonstrate the importance of specialized datasets and tailored model training for tasks involving complex interactions. Future research might focus on extending these methods to more diverse interaction scenarios and improving the model's robustness against varying environmental contexts in egocentric vision.
Acknowledgements
The paper credits support from several research grants and acknowledges contributions from related prior works that facilitated model comparisons and evaluations. Future work could integrate more dynamic interaction scenarios and explore real-time processing challenges in interactive applications.