CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning (2303.12793v1)

Published 22 Mar 2023 in cs.CV

Abstract: This work focuses on sign language retrieval-a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue-sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: https://github.com/FangyunWei/SLRT.

Citations (23)

View on Semantic Scholar

Summary

The paper presents CiCo, a novel cross-lingual contrastive learning framework that aligns sign videos and text for improved retrieval accuracy.
It leverages a domain-agnostic pre-trained sign encoder refined with pseudo-labels to overcome data scarcity in sign language datasets.
Empirical results demonstrate substantial gains, including 22.4% and 28.0% R@1 improvements for T2V and V2T retrieval on How2Sign, respectively.

Cross-Lingual Contrastive Learning for Sign Language Retrieval

The paper "CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning" presents a novel approach to the task of sign language retrieval, which is a critical component of sign language understanding. This task is bifurcated into two retrieval challenges: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. The uniqueness of sign language retrieval stems from its dual nature: it is both a video-text retrieval task and a cross-lingual retrieval task, given that sign languages are natural languages with distinct linguistic structures.

Methodological Innovations

The proposed framework, CiCo, introduces a cross-lingual contrastive learning (CLCL) algorithm, designed to handle the linguistic nuances of sign languages while addressing the data scarcity typical of sign language datasets. CiCo capitalizes on the semantic granularity inherent to sign languages by identifying fine-grained sign-to-word mappings between sign videos and corresponding textual descriptions. This is achieved by embedding both sign video features and text into a shared space using contrastive learning techniques.

To counteract the challenge of limited data, CiCo leverages a domain-agnostic pre-trained sign encoder and refines this model using pseudo-labeled data from target domains to create a domain-aware encoder. The model thus combines robust feature extraction capabilities with domain-specific adaptations, ensuring quality feature representations and efficient training even on limited data.

Empirical Results

The empirical results are noteworthy, demonstrating CiCo's superiority over the pioneering method, SPOT-ALIGN, in both T2V and V2T retrieval tasks across multiple datasets. Specifically, CiCo achieves substantial improvements with R@1 scores on the How2Sign dataset, improving T2V retrieval by 22.4 and V2T retrieval by 28.0 percentage points over the baseline. Similar improvements are noted on the PHOENIX-2014T dataset. These results underscore CiCo's efficacy in capturing the intricate semantic relationships intrinsic to sign language communication.

Theoretical and Practical Implications

The theoretical implications of this work lie in its formulation of sign language retrieval as a combined video-text and cross-lingual retrieval problem. This perspective encourages the exploration of retrieving semantically rich sign videos via textual queries, leveraging linguistic structures rather than relying solely on visual content. The practical implications are equally significant. By facilitating accurate and efficient sign language retrieval, this work has the potential to enhance communication accessibility for the deaf and hard-of-hearing communities.

Future Directions

Looking forward, this research opens up avenues for further innovations in the domain of sign language understanding and retrieval. Improved modeling techniques and larger, more diverse datasets could further augment the performance of models like CiCo. There is also a vast unexplored potential in integrating contextual and multimodal information to enhance the comprehension of sign languages in dynamic and interactive environments.

In conclusion, the CiCo framework represents a meaningful advancement in the domain of sign language computation, addressing both theoretical gaps and practical needs with its cross-lingual contrastive approach. Its success paves the way for richer interactions and understanding between sign language users and non-signers, thereby contributing significantly to the broader field of natural language processing and retrieval.

PDF Markdown

Related Papers

GitHub

GitHub - FangyunWei/SLRT (281 stars)