PoseFix: Correcting 3D Human Poses with Natural Language (2309.08480v2)
Abstract: Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses.
- Text2action: Generative adversarial synthesis from language to action. In ICRA, 2018.
- Language2pose: Natural language grounded pose forecasting. 3DV, 2019.
- Clipface: Text-guided editing of textured 3d morphable models. In SIGGRAPH, 2023.
- Vqa: Visual question answering. In ICCV, 2015.
- Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. In CVPRW, 2022.
- METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- Context-aware human motion prediction. In CVPR, 2020.
- Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
- PoseScript: 3D Human Poses from Natural Language. In ECCV, 2022.
- Pose tutor: An explainable system for pose correction in the wild. In CVPR, 2022.
- Action Modifiers: Learning from Adverbs in Instructional Videos. In CVPR, 2020.
- AIFit: Automatic 3D human-interpretable feedback models for fitness training. In CVPR, 2021.
- Stylegan-human: A data-centric odyssey of human generation. In ECCV, 2022.
- Synthesis of compositional animations from textual descriptions. In ICCV, 2021.
- Generating diverse and natural 3d human motions from text. In CVPR, 2022.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, 2022.
- Action2motion: Conditioned generation of 3D human motions. In ACMMM, 2020.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Image captioning: Transforming objects into words. In NeurIPS, 2019.
- Prompt-to-prompt image editing with cross attention control. In ICLR, 2023.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM TOG, 2022.
- Talk-to-edit: Fine-grained facial editing via dialog. In ICCV, 2021.
- FixMyPose: Pose correctional captioning and retrieval. In AAAI, 2021.
- Flame: Free-form language-based motion synthesis & editing. In AAAI, 2023.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Auto-encoding variational bayes. In ICLR, 2014.
- Dancing to music. In NeurIPS, 2019.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021.
- Generating animated videos of human activities from natural language descriptions. In NeurIPS workshops, 2018.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 2004.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Posecoach: A customizable analysis and visualization system for video-based running coaching. IEEE trans. VCG, 2022.
- Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021.
- SMPL: A skinned multi-person linear model. ACM TOG, 2015.
- AMASS: Archive of motion capture as surface shapes. In ICCV, 2019.
- Attributes as operators: factorizing unseen attribute-object compositions. In ECCV, 2018.
- Protores: Proto-residual network for pose authoring via learned inverse kinematics. In ICLR, 2021.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- Relative attributes. In ICCV, 2011.
- Expressive body capture: 3D hands, face, and body from a single image. In CVPR, 2019.
- Glove: Global vectors for word representation. In EMNLP, 2014.
- Action-conditioned 3D human motion synthesis with transformer VAE. In ICCV, 2021.
- Temos: Generating diverse human motions from textual descriptions. In ECCV, 2022.
- The kit motion-language dataset. Big data, 2016.
- BABEL: Bodies, action and behavior with english labels. In CVPR, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
- Embodied hands: Modeling and capturing hands and bodies together. In SIGGRAPH Asia, 2017.
- Simple and effective vae training with calibrated decoders. In ICML, 2021.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Generating text with recurrent neural networks. In ICML, 2011.
- Motionclip: Exposing human motion generation to clip space. In ECCV, 2022.
- Attention is all you need. In NeurIPS, 2017.
- Show and tell: A neural image caption generator. In CVPR, 2015.
- Composing text and image for image retrieval-an empirical odyssey. In CVPR, 2019.
- Fashion iq: A new dataset towards retrieving images by natural language feedback. In CVPR, 2021.
- Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions. IEEE RAL, 2018.
- Clip-actor: Text-driven recommendation and stylization for animating human meshes. In ECCV, 2022.
- Dlow: Diversifying latent flows for diverse human motion prediction. In ECCV, 2020.
- We are more than our joints: Predicting how 3d bodies move. In CVPR, 2021.
- 3d pose based feedback for physical exercises. In ACCV, 2022.
- Text guided person image synthesis. In CVPR, 2019.
- On the continuity of rotation representations in neural networks. In CVPR, 2019.
- Ginger Delmas (4 papers)
- Philippe Weinzaepfel (38 papers)
- Francesc Moreno-Noguer (68 papers)
- Grégory Rogez (17 papers)