Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model (2208.09141v3)

Published 19 Aug 2022 in cs.CV

Abstract: The Sign Language Production (SLP) project aims to automatically translate spoken languages into sign sequences. Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P). In this paper, we present a novel solution for this task by converting the continuous pose space generation problem into a discrete sequence generation problem. We introduce the Pose-VQVAE framework, which combines Variational Autoencoders (VAEs) with vector quantization to produce a discrete latent representation for continuous pose sequences. Additionally, we propose the G2P-DDM model, a discrete denoising diffusion architecture for length-varied discrete sequence data, to model the latent prior. To further enhance the quality of pose sequence generation in the discrete space, we present the CodeUnet model to leverage spatial-temporal information. Lastly, we develop a heuristic sequential clustering method to predict variable lengths of pose sequences for corresponding gloss sequences. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: \textcolor{blue}{\url{https://slpdiffusier.github.io/g2p-ddm}}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Structured Prediction Helps 3D Human Motion Modelling. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 7143–7152. IEEE.
  2. Structured Denoising Diffusion Models in Discrete State-Spaces. In NeurIPS.
  3. Layer Normalization. ArXiv preprint, abs/1607.06450.
  4. Is Space-Time Attention All You Need for Video Understanding? In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, 813–824. PMLR.
  5. Neural Sign Language Translation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 7784–7793. IEEE Computer Society.
  6. Multi-channel Transformers for Multi-articulatory Sign Language Translation. ArXiv preprint, abs/2009.00299.
  7. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 10020–10030. IEEE.
  8. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43: 172–186.
  9. Everybody Dance Now. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 5932–5941. IEEE.
  10. Diffusion Models Beat GANs on Image Synthesis. In NeurIPS.
  11. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl. Based Syst., 99: 135–145.
  12. Taming Transformers for High-Resolution Image Synthesis. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12868–12878.
  13. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6112–6121. Hong Kong, China: Association for Computational Linguistics.
  14. Non-Autoregressive Neural Machine Translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  15. Vector Quantized Diffusion Model for Text-to-Image Synthesis. ArXiv preprint, abs/2111.14822.
  16. DGS Corpus & Dicta-Sign: The Hamburg Studio Setup. In sign-lang@ LREC 2010, 106–109. European Language Resources Association (ELRA).
  17. Denoising Diffusion Probabilistic Models. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  18. Cascaded Diffusion Models for High Fidelity Image Generation. J. Mach. Learn. Res., 23: 47:1–47:33.
  19. Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models. ArXiv preprint, abs/2102.05379.
  20. SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 11067–11076.
  21. Towards Fast and High-Quality Sign Language Production. Proceedings of the 29th ACM International Conference on Multimedia.
  22. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  23. Hierarchical Transformers Are More Efficient Language Models. ArXiv preprint, abs/2110.13711.
  24. Improved Denoising Diffusion Probabilistic Models. In Meila, M.; and Zhang, T., eds., Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, 8162–8171. PMLR.
  25. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
  26. The syntax of sign language agreement: Common ingredients, but unusual recipe. Glossa: a journal of general linguistics.
  27. High-Resolution Image Synthesis with Latent Diffusion Models. ArXiv preprint, abs/2112.10752.
  28. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI.
  29. Adversarial Training for Multi-Channel Sign Language Production. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event, UK, September 7-10, 2020. BMVA Press.
  30. Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video. ArXiv preprint, abs/2011.09846.
  31. Progressive Transformers for End-to-End Sign Language Production. ArXiv preprint, abs/2004.14874.
  32. Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 1899–1909. IEEE.
  33. Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5141–5151.
  34. Schmidt, F. 2019. Generalization in Generation: A closer look at Exposure Bias. In Proceedings of the 3rd Workshop on Neural Generation and Translation, 157–167. Hong Kong: Association for Computational Linguistics.
  35. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Bach, F. R.; and Blei, D. M., eds., Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, 2256–2265. JMLR.org.
  36. Score-Based Generative Modeling through Stochastic Differential Equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  37. Sign Language Production using Neural Machine Translation and Generative Adversarial Networks. In British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 304. BMVA Press.
  38. Neural Discrete Representation Learning. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 6306–6315.
  39. Visualizing data using t-SNE. Journal of machine learning research, 9(11).
  40. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation, 1: 270–280.
  41. Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural networks : the official journal of the International Neural Network Society, 125: 41–55.
  42. PiSLTRc: Position-informed Sign Language Transformer with Content-aware Convolution. ArXiv preprint, abs/2107.12600.
  43. Neural Sign Language Synthesis: Words Are Our Glosses. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 3384–3392.
  44. Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer. ArXiv preprint, abs/2204.08680.
  45. Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation. IEEE Transactions on Multimedia, 24: 768–779.
Citations (7)

Summary

We haven't generated a summary for this paper yet.