Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 105 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Kimi K2 193 tok/s Pro
2000 character limit reached

LEAP: LLM-Generation of Egocentric Action Programs (2312.00055v1)

Published 29 Nov 2023 in cs.CV, cs.LG, and cs.RO

Abstract: We introduce LEAP (illustrated in Figure 1), a novel method for generating video-grounded action programs through use of a LLM. These action programs represent the motoric, perceptual, and structural aspects of action, and consist of sub-actions, pre- and post-conditions, and control flows. LEAP's action programs are centered on egocentric video and employ recent developments in LLMs both as a source for program knowledge and as an aggregator and assessor of multimodal video information. We apply LEAP over a majority (87\%) of the training set of the EPIC Kitchens dataset, and release the resulting action programs as a publicly available dataset here (https://drive.google.com/drive/folders/1Cpkw_TI1IIxXdzor0pOXG3rWJWuKU5Ex?usp=drive_link). We employ LEAP as a secondary source of supervision, using its action programs in a loss term applied to action recognition and anticipation networks. We demonstrate sizable improvements in performance in both tasks due to training with the LEAP dataset. Our method achieves 1st place on the EPIC Kitchens Action Recognition leaderboard as of November 17 among the networks restricted to RGB-input (see Supplementary Materials).

Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces LEAP, which uses LLMs and multimodal text inputs to generate hierarchical, explainable action programs from egocentric videos.
  • It integrates audio, SLAM, narration, object detection, and hand-object contact data to detail sub-actions, pre- and post-conditions, and control flow.
  • LEAP significantly improves action recognition and anticipation, achieving top leaderboard performance on the EPIC Kitchens dataset.

Introduction to LEAP

In the field of computer vision and artificial intelligence, researchers have been interested in how to better understand and generate human actions from video sequences. A new approach in this field is LEAP (LLM-Generation of Egocentric Action Programs), which leverages LLMs to parse egocentric videos and produce action programs that encapsulate the necessary steps and conditions for specific actions.

Components of LEAP

Action programs created by LEAP represent actions as hierarchical and compositional structures, similar to how human language is structured. These structures, known as action programs, are designed to reflect the motor, perceptual, and visual aspects of actions. They consist of several elements:

  • Sub-actions: The fine-grained steps involved in completing an action.
  • Pre-conditions: The specific conditions that must be met before a sub-action can be performed.
  • Post-conditions: The outcomes that result from performing a sub-action.
  • Control Flow: This defines the order and repetition of sub-actions, encapsulating loops and conditional statements.

Generation of Action Programs

The method employs video input from the EPIC Kitchens dataset, a collection of videos recorded from a first-person perspective, showcasing a variety of kitchen-related tasks. Since LLMs like GPT-4 do not natively process video data, LEAP incorporates multi-modal inputs that are translated into text to feed the LLM. These inputs include:

  • Audio Extractor: Identifies sounds associated with actions.
  • SLAM (Simultaneous Localization and Mapping): Captures the movement of the person performing the action.
  • Narrations: Provides contextual information based on what the actor says before, during, and after actions.
  • Object Detection (Faster-RCNN): Recognizes objects interacted with during the action.
  • Hand-Object Contact Detector: Notes the contact between hands and objects.

Using these components, LEAP synthesizes complete action programs from the video segments.

Application and Performance

LEAP also showcases its application seamlessly in two practical tasks within computer vision: action recognition and action anticipation. When the action programs generated by LEAP were included in the training process for models focused on these tasks, the performance significantly improved. Particularly, the model was able to achieve first place on the EPIC Kitchens Action Recognition leaderboard when only considering RGB (visual) input. This highlights LEAP's effectiveness in aiding action understanding through the action programs it produces. To support further research and applications, the dataset of action programs generated from the EPIC Kitchens train set is made available to the public.

Conclusion and Future Work

LEAP's innovative framework contributes a robust, explainable, and efficient way of interpreting complex actions through the novel use of LLMs, conditioned on diverse multimodal inputs converted to textual descriptions. Looking ahead, LEAP opens up possibilities for future works to explore new architectures and tasks that deepen the understanding of human actions and potentially enable learning from demonstration across different robotic platforms.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.