A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars
This paper presents a method for translating spoken languages into sign languages, referred to as Spoken2Sign translation, utilizing a 3D avatar to display the outcomes. The approach builds on the existing frameworks that traditionally focus on recognizing and translating sign language to spoken language (Sign2Spoken). The significance of this reverse translation is underscored by the communication opportunity it offers to bridge the gap between the deaf and the hearing communities.
The authors propose a three-step methodology to enable this translation system:
- Gloss-Video Dictionary Construction: The method involves creating a gloss-video dictionary as an intermediate step for translation. This is achieved using established datasets and involves segmenting continuous sign videos into isolated signs with a state-of-the-art Continuous Sign Language Recognition (CSLR) model, TwoStream-SLR. Although existing Isolated Sign Language Recognition (ISLR) datasets inherently serve as sign language dictionaries, they lack the parallel text and gloss sequence data necessary for this transformation, which is effectively augmented by CSLR datasets.
- 3D Sign Estimation: After the creation of the gloss-video dictionary, each sign video is used to estimate a 3D sign to construct a gloss-3D sign dictionary. The authors enhance the widely-used SMPLify-X model, particularly with improvements targeted at better modeling the particularities of sign language gestures — such as incorporating temporal consistency and leveraging a detailed 3D human pose prior focused on signing gestures. The enhanced method, dubbed SMPLSign-X, facilitates smoother and more accurate 3D sign estimation, essential for co-articulation between signs in translations.
- Spoken2Sign Translation: The core translation procedure involves converting input text into a gloss sequence using a Text to Gloss (Text2Gloss) translator based on mBART, a model known for its strong sequence-to-sequence capabilities. The corresponding 3D signs are then retrieved from the gloss-3D sign dictionary. A Sign Connector predicts the duration of co-articulations between adjacent 3D signs, improving the visual flow and mimicry of natural signing articulation.
The proposed translation system was evaluated on datasets such as Phoenix-2014T and CSL-Daily using a reverse translation evaluator. The numerical outcomes demonstrated superiority over state-of-the-art methods in terms of BLEU-4 and ROUGE scores. In addition, a qualitative user paper with deaf participants rated the translation results highly in terms of naturalness and accuracy, illustrating the practical applicability of the method.
Furthermore, the authors highlight two significant by-products derived from the development: 3D keypoint augmentation and multi-view understanding. The availability of 3D sign data allows for effective augmentation strategies enhancing the performance of keypoint-based models for other sign language understanding tasks such as Sign Language Recognition (SLR) and Sign Language Translation (SLT).
The paper provides a structured, effective baseline for developing robust Spoken2Sign systems and offers potential pathways for future research in sign language processing, particularly in enhancing the realism and utility of avatar-based signing. The implications are not only technological but also societal, by potentially broadening access to effective communication tools within the deaf community. Future work may aim at addressing potential challenges in data scarcity and improving 3D joint estimation accuracy for optimizing the translation process's overall effectiveness.