BABEL: Bodies, Action and Behavior with English Labels (2106.09696v2)

Published 17 Jun 2021 in cs.CV, cs.GR, and cs.LG

Abstract: Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels describe the overall action in the sequence, and frame labels describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.

PDF Abstract

Overview of BABEL: Bodies, Action and Behavior with English Labels

The paper introduces BABEL, a dataset designed to tackle the limitations of existing resources in understanding human movement semantics in computer vision. It fills a niche in the field by providing a comprehensive dataset that combines the quantitative precision of motion capture (mocap) data with the semantic richness of natural language labels.

Key Contributions

Dataset Composition: BABEL annotates over 43 hours of mocap data from the AMASS dataset with language labels that depict human actions. It includes both high-level sequence labels summarizing overall actions and detailed frame labels to specify actions at each time step in the sequence. This results in over 28,000 sequence labels and 63,000 frame labels across 250 unique action categories.
Resolution of Existing Gaps: BABEL addresses a significant gap in the availability of datasets that either provide large quantities of action labels or precise 3D human motion but not both. While video datasets are rich with action labels, they lack precise 3D human motion data. Conversely, mocap datasets offer exact body motions but cover a limited range of actions. BABEL leverages both video data and mocap to provide semantic annotations alongside precise 3D motion data.
Applications and Challenges: The dataset is aimed to serve as a benchmark for tasks like action recognition, temporal action localization, and motion synthesis. Notably, the inclusion of frame-level labels poses significant learning challenges applicable to real-world scenarios, wherein actions can overlap, necessitating models to handle concurrent actions and transitions seamlessly.
Benchmarking and Implications: The paper evaluates model performance on 3D action recognition tasks using BABEL, emphasizing its potential as a robust benchmark for progress in this domain. The dataset's complexity and richness can enhance model training and lead to advances in AI's ability to understand human movement.

Implications for AI and Future Directions

The BABEL dataset promises significant impacts both practically and theoretically. Practically, it provides a more comprehensive resource for training AI systems on real-world human movement, enhancing the performance of systems engaged in human-computer interaction, robotics, and surveillance. Theoretically, BABEL encourages the development of models capable of more nuanced interpretation of human behavior, fostering progress in areas such as hierarchical action understanding, pose prediction, and motion synthesis.

As research progresses, BABEL lays the groundwork for future investigations into improving AI's grasp of human movement by integrating semantic and quantitative data. Enhancements could involve expanding the dataset to include more diverse actors and settings or exploring unsupervised learning techniques to further harness the dataset's wealth of information. Additionally, BABEL could serve as a foundational resource for interdisciplinary studies into the cognitive and social dimensions of motion and behavior.

In conclusion, BABEL sets a new standard for human movement datasets by marrying detailed quantitative motion data with rich semantic annotations. It marks a noteworthy stride in equipping AI systems with the tools needed for a deeper and more comprehensive understanding of human actions.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Abhinanda R. Punnakkal (2 papers)
Arjun Chandrasekaran (11 papers)
Nikos Athanasiou (13 papers)
Alejandra Quiros-Ramirez (1 paper)
Michael J. Black (163 papers)

Citations (174)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos