- The paper introduces LEAP, which uses LLMs and multimodal text inputs to generate hierarchical, explainable action programs from egocentric videos.
- It integrates audio, SLAM, narration, object detection, and hand-object contact data to detail sub-actions, pre- and post-conditions, and control flow.
- LEAP significantly improves action recognition and anticipation, achieving top leaderboard performance on the EPIC Kitchens dataset.
Introduction to LEAP
In the field of computer vision and artificial intelligence, researchers have been interested in how to better understand and generate human actions from video sequences. A new approach in this field is LEAP (LLM-Generation of Egocentric Action Programs), which leverages LLMs to parse egocentric videos and produce action programs that encapsulate the necessary steps and conditions for specific actions.
Components of LEAP
Action programs created by LEAP represent actions as hierarchical and compositional structures, similar to how human language is structured. These structures, known as action programs, are designed to reflect the motor, perceptual, and visual aspects of actions. They consist of several elements:
- Sub-actions: The fine-grained steps involved in completing an action.
- Pre-conditions: The specific conditions that must be met before a sub-action can be performed.
- Post-conditions: The outcomes that result from performing a sub-action.
- Control Flow: This defines the order and repetition of sub-actions, encapsulating loops and conditional statements.
Generation of Action Programs
The method employs video input from the EPIC Kitchens dataset, a collection of videos recorded from a first-person perspective, showcasing a variety of kitchen-related tasks. Since LLMs like GPT-4 do not natively process video data, LEAP incorporates multi-modal inputs that are translated into text to feed the LLM. These inputs include:
- Audio Extractor: Identifies sounds associated with actions.
- SLAM (Simultaneous Localization and Mapping): Captures the movement of the person performing the action.
- Narrations: Provides contextual information based on what the actor says before, during, and after actions.
- Object Detection (Faster-RCNN): Recognizes objects interacted with during the action.
- Hand-Object Contact Detector: Notes the contact between hands and objects.
Using these components, LEAP synthesizes complete action programs from the video segments.
LEAP also showcases its application seamlessly in two practical tasks within computer vision: action recognition and action anticipation. When the action programs generated by LEAP were included in the training process for models focused on these tasks, the performance significantly improved. Particularly, the model was able to achieve first place on the EPIC Kitchens Action Recognition leaderboard when only considering RGB (visual) input. This highlights LEAP's effectiveness in aiding action understanding through the action programs it produces. To support further research and applications, the dataset of action programs generated from the EPIC Kitchens train set is made available to the public.
Conclusion and Future Work
LEAP's innovative framework contributes a robust, explainable, and efficient way of interpreting complex actions through the novel use of LLMs, conditioned on diverse multimodal inputs converted to textual descriptions. Looking ahead, LEAP opens up possibilities for future works to explore new architectures and tasks that deepen the understanding of human actions and potentially enable learning from demonstration across different robotic platforms.