Aligning Actions and Walking to LLM-Generated Textual Descriptions (2404.12192v1)
Abstract: LLMs have demonstrated remarkable capabilities in various domains, including data augmentation and synthetic data generation. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns. We leverage the expressive power of LLMs to align motion representations with high-level linguistic cues, addressing two distinct tasks: action recognition and retrieval of walking sequences based on appearance attributes. For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations. In the domain of gait analysis, we investigate the impact of appearance attributes on walking patterns by generating textual descriptions of motion sequences from the DenseGait dataset using LLMs. These descriptions capture subtle variations in walking styles influenced by factors such as clothing choices and footwear. Our approach demonstrates the potential of LLMs in augmenting structured motion attributes and aligning multi-modal representations. The findings contribute to the advancement of comprehensive motion understanding and open up new avenues for leveraging LLMs in multi-modal alignment and data augmentation for motion analysis. We make the code publicly available at https://github.com/Radu1999/WalkAndText
- Teach: Temporal action composition for 3d humans. In 2022 International Conference on 3D Vision (3DV), pages 414–423, 2022.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2:3, 2023.
- A.-M. Bucur. Utilizing chatgpt generated data to retrieve depression symptoms from social media. arXiv preprint arXiv:2307.02313, 2023.
- Gaitpt: Skeletons are all you need for gait recognition. arXiv preprint arXiv:2308.10623, 2023.
- A. Cosma and E. Radoi. Learning gait representations with noisy multi-task learning. Sensors, 22(18), 2022.
- The paradox of motion: Evidence for spurious correlations in skeleton-based gait recognition models, 2024.
- Zero-shot action recognition in videos: A survey. Neurocomputing, 439:159–175, 2021.
- Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 2021–2029, New York, NY, USA, 2020. Association for Computing Machinery.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- Mistral 7b, 2023.
- Action-gpt: Leveraging large-scale language models for improved and generalized action generation. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 31–36, 2023.
- The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
- X. Li and J. Li. Angle-optimized text embeddings, 2023.
- Synthetic data generation with large language models for text classification: Potential and limitations. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10443–10461, Singapore, Dec. 2023. Association for Computational Linguistics.
- MTEB: Massive text embedding benchmark. In A. Vlachos and I. Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics.
- Action-conditioned 3d human motion synthesis with transformer vae. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10965–10975, 2021.
- Temos: Generating diverse human motions from textual descriptions. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 480–497, Cham, 2022. Springer Nature Switzerland.
- Multi-track timeline control for text-driven 3d human motion generation. arXiv preprint arXiv:2401.08559, 2024.
- Babel: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 722–731, June 2021.
- Diffusiongpt: Llm-driven text-to-image generation system, 2024.
- Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021.
- Data augmentation for intent classification with off-the-shelf large language models. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 47–57, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Unified pre-training with pseudo texts for text-to-image person re-identification. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11140–11150, Los Alamitos, CA, USA, oct 2023. IEEE Computer Society.
- Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Motionclip: Exposing human motion generation to clip space. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, page 358–374, Berlin, Heidelberg, 2022. Springer-Verlag.
- Generating faithful synthetic data with large language models: A case study in computational social science, 2023.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
- Large language models for healthcare data augmentation: An example on Patient-Trial matching. AMIA Annu Symp Proc, 2023:1324–1333, Jan. 2024.
- A survey of large language models, 2023.
- On the continuity of rotation representations in neural networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5738–5746, 2019.
- Large language models for information retrieval: A survey. CoRR, abs/2308.07107, 2023.
- Radu Chivereanu (2 papers)
- Adrian Cosma (30 papers)
- Razvan Rughinis (3 papers)
- Andy Catruna (4 papers)
- Emilian Radoi (11 papers)