Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Sign Actors: A diffusion model for 3D sign language production from text (2312.02702v2)

Published 5 Dec 2023 in cs.CV

Abstract: Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However, Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data, which hinders their realism. In this work, a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities.

Significance and Challenges of Sign Language Production (SLP)

Sign language is the primary mode of communication for the Deaf and Hard of Hearing communities. Despite advancements in recognition and translation, producing realistic sign language through computer vision poses significant challenges. Many existing methods depend on 2D data, limiting their ability to capture the full complexity of sign language, which features a combination of manual gestures and non-manual elements like facial expressions and body movements.

Innovative Approach to 3D Sign Language Production

In an effort to enhance the field of Sign Language Production, this paper introduces a new model designed to generate three-dimensional sign language sequences from text input, utilizing a diffusion-based process. The model employs a unique graph neural network built upon the anatomically detailed SMPL-X skeleton, enabling dynamic and anatomically correct representation of sign language avatars.

Creation of a Comprehensive 3D Dataset

To support the training of the model, researchers have developed the first large-scale dataset of 3D sign language, annotated with detailed SMPL-X parameters. The dataset is derived from the existing How2Sign dataset and includes high-fidelity reconstructions of signing avatars paired with their text transcripts. The reconstruction pipeline surpasses previous methods in accuracy by applying a novel pose optimization constrained by realistic human pose priors.

Evaluation and Impact

The model undergoes rigorous testing against several benchmarks, showcasing superior performance over current state-of-the-art approaches in generating sign language from text. This includes improved accuracy in hand articulations and body movements, as well as better alignment with text meaning. A user paper involving individuals fluent in American Sign Language further validates the model's efficacy, with generated signs achieving high accuracy in reflecting the intended message.

In summary, the paper presents an advancement in bridging the communication gap for the Deaf and Hard of Hearing, with a text-to-sign generation model that produces more realistic signing avatars. This progress highlights the potential of diffusion models and graph neural networks in improving accessibility through technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Signum database: Video corpus for signer-independent continuous sign language recognition. In sign-lang@ LREC 2010, pages 243–246. European Language Resources Association (ELRA), 2010.
  2. Bsl-1k: Scaling up co-articulated sign language recognition using mouthing cues. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 35–53. Springer, 2020.
  3. A survey on mouth modeling and analysis for sign language recognition. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–7. IEEE, 2015.
  4. The american sign language lexicon video dataset. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. IEEE, 2008.
  5. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd international conference on knowledge discovery and data mining, pages 359–370, 1994.
  6. The Hands Are The Head of The Mouth. The Mouth as Articulator in Sign Languages. Hamburg: Signum Press, 2001.
  7. Sign language corpora for analysis, processing and evaluation. In LREC, 2010.
  8. Sign language recognition, generation, and translation: An interdisciplinary perspective. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility, pages 16–31, 2019.
  9. Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7784–7793, 2018.
  10. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10023–10033, 2020.
  11. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  12. The devisign large vocabulary of chinese sign language database and baseline evaluations. In Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS). Institute of Computing Technology, 2014.
  13. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
  14. A simple multi-modality transfer learning baseline for sign language translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5120–5130, 2022a.
  15. Two-stream network for sign language recognition and translation. Advances in Neural Information Processing Systems, 35:17043–17056, 2022b.
  16. Tessa, a system to aid communication with deaf people. In Proceedings of the fifth international ACM conference on Assistive technologies, pages 205–212, 2002.
  17. Adversarial parametric pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10997–11005, 2022.
  18. Speech recognition techniques for a sign language recognition system. hand, 60:80, 2007.
  19. How2sign: a large-scale multimodal dataset for continuous american sign language. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2735–2744, 2021.
  20. Dicta-sign: sign language recognition, generation and modelling with application in deaf communication. In sign-lang@ LREC 2010, pages 80–83. European Language Resources Association (ELRA), 2010.
  21. The dicta-sign wiki: Enabling web communication for the deaf. In Computers Helping People with Special Needs: 13th International Conference, ICCHP 2012, Linz, Austria, July 11-13, 2012, Proceedings, Part II 13, pages 205–212. Springer, 2012.
  22. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  23. Collaborative regression of expressive bodies using moderation. In International Conference on 3D Vision (3DV), pages 792–804, 2021.
  24. Extensions of the sign language recognition and translation corpus rwth-phoenix-weather. In LREC, pages 1911–1916, 2014.
  25. Reconstructing signing avatars from video using linguistic priors. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 12791–12801, 2023.
  26. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
  27. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
  28. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11303–11312, 2021.
  29. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  30. Non-autoregressive sign language production with gaussian space. In The 32nd British Machine Vision Conference (BMVC 21). British Machine Vision Conference (BMVC), 2021.
  31. Hamid Reza Vaezi Joze and Oscar Koller. Ms-asl: A large-scale data set and benchmark for understanding american sign language. arXiv preprint arXiv:1812.01053, 2018.
  32. Towards automatic speech to sign language generation. arXiv preprint arXiv:2106.12790, 2021.
  33. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1459–1469, 2020a.
  34. Tspnet: Hierarchical feature learning via temporal semantic pyramid for sign language translation. Advances in Neural Information Processing Systems, 33:12034–12045, 2020b.
  35. Sign language translation: A survey of approaches and techniques. Electronics, 12(12):2678, 2023.
  36. One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023.
  37. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019.
  38. Amass: Archive of motion capture as surface shapes. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  39. An automated technique for real-time production of lifelike animations of american sign language. Universal Access in the Information Society, 15:551–566, 2016.
  40. A survey on the animation of signing avatars: From sign representation to utterance synthesis. Computers & Graphics, 92:76–98, 2020.
  41. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10975–10985, 2019.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  43. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In IEEE International Conference on Computer Vision Workshops, 2021.
  44. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347, 2018.
  45. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  46. Adversarial training for multi-channel sign language production. In The 31st British Machine Vision Virtual Conference. British Machine Vision Association.
  47. Progressive transformers for end-to-end sign language production. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 687–705. Springer, 2020.
  48. Mixed signals: Sign language production via a mixture of motion primitives. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1919–1929, 2021.
  49. Signing at scale: Learning to co-articulate signs for large-scale photo-realistic sign language production. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5141–5151, 2022.
  50. Building the british sign language corpus. 2013.
  51. Sign language production using neural machine translation and generative adversarial networks. In Proceedings of the 29th British Machine Vision Conference (BMVC 2018). British Machine Vision Association, 2018.
  52. There and back again: 3d sign language generation from text using back-translation. In 2022 International Conference on 3D Vision (3DV), pages 187–196. IEEE, 2022.
  53. The linguistics of British Sign Language: an introduction. Cambridge University Press, 1999.
  54. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2022.
  55. Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision (ECCV), 2022.
  56. S-pot - a benchmark in spotting signs within continuous signing. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 1892–1897, Reykjavik, Iceland, 2014. European Language Resources Association (ELRA).
  57. The significance of facial features for automatic sign language recognition. In 2008 8th IEEE international conference on automatic face & gesture recognition, pages 1–6. IEEE, 2008.
  58. Purdue rvl-slll american sign language database. 2006.
  59. Combination of tangent distance and an image distortion model for appearance-based sign language recognition. In Pattern Recognition: 27th DAGM Symposium, Vienna, Austria, August 31-September 2, 2005. Proceedings 27, pages 401–408. Springer, 2005.
  60. Nn-based czech sign language synthesis. In Speech and Computer: 21st International Conference, SPECOM 2019, Istanbul, Turkey, August 20–25, 2019, Proceedings 21, pages 559–568. Springer, 2019.
  61. Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  62. Gloss-free sign language translation: Improving from visual-language pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20871–20881, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Vasileios Baltatzis (12 papers)
  2. Rolandos Alexandros Potamias (19 papers)
  3. Evangelos Ververas (11 papers)
  4. Guanxiong Sun (6 papers)
  5. Jiankang Deng (96 papers)
  6. Stefanos Zafeiriou (137 papers)
Citations (8)