A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation (2203.04287v2)

Published 8 Mar 2022 in cs.CV

Abstract: This paper proposes a simple transfer learning baseline for sign language translation. Existing sign language datasets (e.g. PHOENIX-2014T, CSL-Daily) contain only about 10K-20K pairs of sign videos, gloss annotations and texts, which are an order of magnitude smaller than typical parallel data for training spoken language translation models. Data is thus a bottleneck for training effective sign language translation models. To mitigate this problem, we propose to progressively pretrain the model from general-domain datasets that include a large amount of external supervision to within-domain datasets. Concretely, we pretrain the sign-to-gloss visual network on the general domain of human actions and the within-domain of a sign-to-gloss dataset, and pretrain the gloss-to-text translation network on the general domain of a multilingual corpus and the within-domain of a gloss-to-text corpus. The joint model is fine-tuned with an additional module named the visual-language mapper that connects the two networks. This simple baseline surpasses the previous state-of-the-art results on two sign language translation benchmarks, demonstrating the effectiveness of transfer learning. With its simplicity and strong performance, this approach can serve as a solid baseline for future research. Code and models are available at: https://github.com/FangyunWei/SLRT.

Authors (5)

Yutong Chen (30 papers)
Fangyun Wei (53 papers)
Xiao Sun (99 papers)
Zhirong Wu (31 papers)
Stephen Lin (72 papers)

Citations (73)

View on Semantic Scholar

Summary

A Study on Multi-Modality Transfer Learning for Improved Sign Language Translation

The paper presents a meticulously designed approach for enhancing sign language translation (SLT) through a novel multi-modality transfer learning framework, aimed at overcoming the data scarcity that impairs current systems. The authors propose a transfer learning methodology that effectively leverages vast external resources from both visual and linguistic domains, demonstrating its superiority over existing methods, including those employing semi-supervised learning or sophisticated data augmentation techniques.

Overview of the Methodology

The framework categorically divides the SLT into two primary tasks: (i) Sign2Gloss, which focuses on transforming sign language videos into gloss sequences, and (ii) Gloss2Text, which translates glosses into spoken language sentences. This division facilitates the strategic pretraining of models on large-scale datasets in related domains before fine-tuning them with sign language-specific data. The visual encoder is pretrained progressively with datasets like Kinetics-400 and WLASL, capturing both generic human action features and fine-grained gloss-level details. Similarly, the language translation model employs mBART, a pre-trained LLM adept at multilingual translations, further trained on gloss-to-text tasks to ensure domain alignment.

The integration of a Visual-Language Mapper (V-L Mapper) is particularly noteworthy as it bridges the traditionally isolated visual and linguistic models. This mapper allows for a seamless transfer of learned features between the two modalities, enabling an end-to-end training regime that surpasses the conventional Sign2Gloss followed by Gloss2Text (Sign2Gloss2Text) pipeline in performance.

Numerical Results and Claims

The paper reports substantial improvements in translation metrics such as BLEU and ROUGE across two significant datasets—PHOENIX-2014T and CSL-Daily. Specifically, the proposed framework achieves a BLEU-4 score of 28.39 on the PHOENIX-2014T test set, notably exceeding other state-of-the-art methods. The adoption of the progressive pretraining technique, in concert with joint optimization via the V-L Mapper, is attributed to these strong results, bringing an improved understanding of continuous sign language into NLP models.

Implications and Future Directions

The implications of the research lie in its simplicity and efficiency, setting a robust baseline for further research in SLT. By leveraging existing models and datasets from overlapping domains, the paper showcases not only cost-effective but also high-performing systems for bridging communication gaps between deaf and hearing populations. The model's extensibility to various sign languages, through tools such as mBART, positions it as a versatile candidate for future applications in diverse linguistic contexts.

The research opens avenues for further advancements in SLT through the exploration of richer contextual models that can process both visual subtleties and nuanced linguistic structures. There is also potential for eliminating the reliance on gloss annotations, which could simplify the translation process and enhance scalability.

In conclusion, the paper introduces a potent approach to SLT with significant improvements over existing methods, underpinned by a clear and direct application of transfer learning techniques. The strategic harnessing of multi-modal data underscores the promising trajectory for advancements in language translation technologies.

PDF Markdown

A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation (2203.04287v2)

Summary

A Study on Multi-Modality Transfer Learning for Improved Sign Language Translation

Overview of the Methodology

Numerical Results and Claims

Implications and Future Directions

Related Papers

GitHub

YouTube