Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Modal Retrieval for Motion and Text via DopTriple Loss (2305.04195v3)

Published 7 May 2023 in cs.CV and cs.CL

Abstract: Cross-modal retrieval of image-text and video-text is a prominent research area in computer vision and natural language processing. However, there has been insufficient attention given to cross-modal retrieval between human motion and text, despite its wide-ranging applicability. To address this gap, we utilize a concise yet effective dual-unimodal transformer encoder for tackling this task. Recognizing that overlapping atomic actions in different human motion sequences can lead to semantic conflicts between samples, we explore a novel triplet loss function called DropTriple Loss. This loss function discards false negative samples from the negative sample set and focuses on mining remaining genuinely hard negative samples for triplet training, thereby reducing violations they cause. We evaluate our model and approach on the HumanML3D and KIT Motion-Language datasets. On the latest HumanML3D dataset, we achieve a recall of 62.9% for motion retrieval and 71.5% for text retrieval (both based on R@10). The source code for our approach is publicly available at https://github.com/eanson023/rehamot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.
  2. Intra-Modal Constraint Loss for Image-Text Retrieval. In IEEE International Conference on Image Processing (ICIP). 4023–4027.
  3. PoseScript: 3D human poses from natural language. In European Conference on Computer Vision (ECCV). 346–362.
  4. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612 (2017).
  5. On the limitations of visual-semantic embedding networks for image-to-text information retrieval. Journal of Imaging 7, 8 (2021), 125.
  6. Generating diverse and natural 3d human motions from text. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
  7. Action2motion: Conditioned generation of 3d human motions. In ACM International Conference on Multimedia (ACM MM). 2021–2029.
  8. Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 1735–1742.
  9. Momentum contrast for unsupervised visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9729–9738.
  10. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.
  11. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics 35, 4 (2016), 1–11.
  12. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3128–3137.
  13. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980 (2014).
  14. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014).
  15. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems (NeurIPS) 34 (2021), 9694–9705.
  16. Identity-aware textual-visual matching with latent co-attention. In IEEE International Conference on Computer Vision (ICCV). 1890–1899.
  17. Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Processing Letters 29 (2022), 1332–1336.
  18. Multi-network contrastive learning of visual representations. Knowledge-Based Systems 258 (2022), 109991.
  19. SMPL: A skinned multi-person linear model. ACM transactions on graphics 34, 6 (2015), 1–16.
  20. AMASS: Archive of motion capture as surface shapes. In IEEE International Conference on Computer Vision (ICCV). 5442–5451.
  21. The KIT whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR). IEEE, 329–336.
  22. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV).
  23. The KIT motion-language dataset. Big data 4, 4 (2016), 236–252.
  24. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). 8748–8763.
  25. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125 (2022).
  26. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 (2019).
  27. Human motion diffusion model. arXiv:2209.14916 (2022).
  28. Attention is all you need. 30 (2017).
  29. Learning fine-grained image similarity with deep ranking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1386–1393.
  30. Learning deep structure-preserving image-text embeddings. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5005–5013.
  31. Quaternion Representation Learning for cross-modal matching. Knowledge-Based Systems (2023), 110505.
  32. Supervised contrastive learning for recommendation. Knowledge-Based Systems 258 (2022), 109973.
  33. Multimodal contrastive training for visual representation learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6995–7004.
  34. Generative label fused network for image-text matching. Knowledge-Based Systems (2023), 110280.
  35. Self-supervised human mobility learning for next location prediction and trajectory classification. Knowledge-Based Systems 228 (2021), 107214.
  36. On the continuity of rotation representations in neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5745–5753.
  37. Crossclr: Cross-modal contrastive learning for multi-modal video representations. In IEEE International Conference on Computer Vision (ICCV). 1450–1459.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sheng Yan (6 papers)
  2. Yang Liu (2253 papers)
  3. Haoqiang Wang (4 papers)
  4. Xin Du (72 papers)
  5. Mengyuan Liu (72 papers)
  6. Hong Liu (395 papers)
Citations (7)