Pose Priors from Language Models (2405.03689v1)

Published 6 May 2024 in cs.CV and cs.CL

Abstract: We present a zero-shot pose optimization method that enforces accurate physical contact constraints when estimating the 3D pose of humans. Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation. We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization. Despite its simplicity, our method produces surprisingly compelling pose reconstructions of people in close contact, correctly capturing the semantics of the social and physical interactions. We demonstrate that our method rivals more complex state-of-the-art approaches that require expensive human annotation of contact points and training specialized models. Moreover, unlike previous approaches, our method provides a unified framework for resolving self-contact and person-to-person contact.

Authors (6)

Sanjay Subramanian (18 papers)
Evonne Ng (8 papers)
Lea Müller (11 papers)
Dan Klein (100 papers)
Shiry Ginosar (16 papers)
Trevor Darrell (324 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces ProsePose, a zero-shot framework that uses large multimodal models to enforce physical contact constraints in 3D human pose estimation.
It converts language model–derived contact constraints into loss functions, reducing reliance on costly motion capture and manual annotations.
Experimental results show significant improvements in joint error reduction and contact point accuracy across datasets like Hi4D, FlickrCI3D, and CHI3D.

Pose Priors from LLMs

Overview

The paper "Pose Priors from LLMs" introduces ProsePose, a zero-shot pose optimization framework leveraging large multimodal models (LMMs) to enforce physical contact constraints in 3D human pose estimation. The key insight of this work is that LLMs, which have been pretrained on extensive textual data, can provide a semantic prior on human pose interactions. This approach circumvents the need for expensive training datasets involving motion capture or manually annotated contact points, which are typically required by state-of-the-art methods.

Methodology

ProsePose operates in three stages:

Pose Initialization: An initial estimate of the 3D pose is obtained using a regression-based model.
Constraint Generation with LMM: An LMM generates contact constraints by analyzing the input image and outputting plausible physical contact points between different body parts. These constraints are then converted into loss functions.
Constrained Pose Optimization: The generated loss functions, along with additional predefined losses, are used to refine the initial pose estimates to accurately reflect physical contact constraints.

Experimental Results

The authors validated ProsePose on several datasets, including Hi4D, FlickrCI3D, and CHI3D for two-person interactions, and MOYO for single-person complex yoga poses. The results demonstrate that ProsePose significantly improves over existing zero-shot baselines, reducing errors (PA-MPJPE) and increasing the percentage of correct contact points (PCC).

For Hi4D, ProsePose reduced the joint PA-MPJPE to 93mm from the heuristic baseline's 116mm.
On the FlickrCI3D dataset, ProsePose achieved a joint PA-MPJPE of 58mm and an average PCC of 79.9%, outperforming the heuristic baseline's 67mm and 77.8%, respectively.
On the CHI3D dataset, ProsePose achieved an average PCC of 75.8%, showing improvement over the heuristic baseline's 74.1%.
For MOYO, ProsePose maintained a comparable PA-MPJPE to the HMR2+opt baseline but significantly improved the PCC, indicating better recognition of self-contact points.

Implications

ProsePose demonstrates that LMMs can be effectively used to guide 3D human pose optimization, leveraging the semantic understanding embedded within these models. This approach can be applied without additional training, making it a practical solution for scenarios with limited access to annotated data.

Theoretically, this work highlights the potential of LMMs in understanding and reasoning about physical interactions from textual data. Practically, it provides a flexible framework for improving pose estimation in diverse applications, including human-computer interaction, animation, and robotics, where precise capturing of human poses and contacts is crucial.

Future Directions

While ProsePose has shown promising results, the reliance on LMMs introduces potential issues such as hallucination and bias toward commonly represented poses in the training data. Future developments could explore:

Fine-tuning LMMs specifically for pose estimation tasks.
Integrating additional priors or constraints to mitigate hallucination effects.
Extending the method to more complex interactions involving more than two individuals.

Overall, this approach opens new avenues for enhancing pose estimation frameworks by incorporating rich semantic priors from LLMs, suggesting a broader utility of LMMs in computer vision and pose estimation tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sanjayssub/status/1788233140194148644

https://twitter.com/fly51fly/status/1787706677267251549