Question-based Sign Language Translation

Updated 24 September 2025

Question-based Sign Language Translation is a paradigm that integrates text-based dialogue context with continuous sign language videos to improve translation accuracy and disambiguation.
It employs contrastive multimodal learning, cross-modal fusion via SSAW, and self-supervised auxiliary tasks to align and enhance video and text representations.
QB-SLT reduces dependence on costly gloss annotations while achieving impressive performance metrics, such as a BLEU-4 score of 35.42 and a ROUGE of 60.16.

Question-based Sign Language Translation (QB-SLT) is a paradigm in sign language translation (SLT) that explicitly incorporates text-based dialogue context—typically in the form of naturally occurring question sentences—during translation of continuous sign language videos. Unlike standard SLT, which may rely on gloss (manual sign transcription) annotations or perform translation in isolation, QB-SLT leverages spoken or written language questions as auxiliary input to enhance semantic grounding, disambiguation, and contextual fidelity in translation. This approach addresses both the linguistic challenges of interpreting nuanced query patterns and the practical challenges of expensive gloss annotation, and enables the development of dialogue-aware sign language translation systems powered by joint multimodal and self-supervised learning.

1. Formal Definition and Motivations

QB-SLT is defined as the task of mapping a continuous sign language video V and a corresponding text-based question Q to a target text-based spoken language answer or translation S, i.e., determining

$S = \text{QB\_SLT}(V, Q)$

where Q is a natural language question providing explicit dialogue context.

Key motivations for QB-SLT include:

Reducing dependence on costly and expertise-demanding gloss annotations, which are not always available for low-resource sign languages.
More closely reflecting real-world communication scenarios, where SLT occurs in the context of dialogue, questions, or interactive exchanges rather than in isolation.
Capturing subtle pragmatic and semantic phenomena—such as reference resolution, question focus, and context-dependent disambiguation—by using questions as naturally occurring, easy-to-annotate auxiliary information (Liu et al., 17 Sep 2025).

2. Methodological Foundations and Core Architectures

Recent QB-SLT systems depart from conventional SLT pipelines by replacing or augmenting gloss-based supervision with question-based context integration. Key architectural and methodological elements include:

Contrastive Multimodal Learning: The encoder builds a shared feature space aligning sign video representations and text representations from both question and target spoken language. This is optimized through contrastive objectives, e.g., maximizing cosine similarity between (V, S) and (Q, S) pairs (Liu et al., 17 Sep 2025).
Cross-Modal Fusion Mechanisms: Fusing video features and question features is central. Sigmoid Self-Attention Weighting (SSAW) is used to produce adaptive feature fusion, applying a learnable sigmoid-gated self-attention mask that isolates informative elements from the question and video:

$f_j = f_c \odot \sigma(f_f)$

where $f_c$ is the concatenated feature vector, $f_f$ is a learned feed-forward projection, and $\odot$ denotes elementwise multiplication (Liu et al., 17 Sep 2025).

Self-supervised Auxiliary Tasks: Autoencoding the question as a parallel task to translation regularizes representations and preserves semantic interpretability, with cross-entropy loss over masked question sequence reconstruction (Liu et al., 17 Sep 2025).
End-to-End Joint Optimization: Modern QB-SLT systems typically train all modules in a unified fashion, including video encoders (e.g., Video Transformers/I3D), question/text encoders (e.g., BERT/Transformer), and translation decoders, sometimes with self-distillation (Liu et al., 17 Sep 2025).
Context-aware Decoding: The decoder is conditioned not only on sign video features but also on the embedded question, which serves to constrain output space (e.g., answer a "what" question with a noun phrase).

3. Data Resources and Annotation Paradigms

The development and evaluation of QB-SLT systems are enabled by specialized datasets that include question-answer pairs aligned with sign language videos.

PHOENIX-2014T-QA and CSL-Daily-QA are specifically constructed for QB-SLT. Both provide continuous sign language videos, corresponding spoken language translations, and manually annotated or curated natural language questions serving as dialogue context (Liu et al., 17 Sep 2025).
Annotation in QB-SLT emphasizes collecting text-based questions, which is notably less expensive and more scalable than detailed gloss annotation, thus broadening applicability across new domains and sign languages (Liu et al., 17 Sep 2025).
These datasets support translations of naturalistic, domain-diverse queries (e.g., weather, travel, daily activities) and test the model's ability to integrate both explicit (question-focused) and implicit (visual) context.

4. Comparative Evaluation: Dialogue (Question) vs. Gloss Assistance

A major empirical result of QB-SLT research is the observation that natural language question context can be as effective as, or even outperform, conventional gloss supervision for translation quality:

On PHOENIX-2014T-QA, SSL-SSAW achieves BLEU-4 scores of 35.42 and ROUGE of 60.16, significantly surpassing state-of-the-art gloss-based models and other question–gloss hybrid models (Liu et al., 17 Sep 2025).

Method	BLEU-4	ROUGE	Annotation Supervision
Gloss-based (TS-SLT)	~29	~49	Gloss
Question-based (GBT)	~27	~50	Question+Gloss
SSL-SSAW (QB-SLT)	35.42	60.16	Question

The capacity of QB-SLT approaches to rival or exceed gloss-assisted models is attributed to the provision of dialogue context, which enables models to resolve ambiguities, focus attention, and predict context-appropriate forms.
Visualizations of SSAW attention maps reveal that question tokens with high relevance receive amplified weights, directly constraining and grounding the translation (Liu et al., 17 Sep 2025).

5. Technical Innovations: Fusion, Alignment, and Self-supervision

Specific technical contributions underline QB-SLT's methodological distinctiveness:

Contrastive pretraining: Models are initialized using a contrastive loss to align sign video and text/question embedding spaces:

$L_\text{sim} = -\frac{1}{2} \left[ \sum_{V, S} \log(I_s) + \sum_{S, V} \log(I_v) \right]$

aiding in multimodal semantic alignment (Liu et al., 17 Sep 2025).

Adaptive self-attention fusion: The SSAW module (layer-normalized self-attention + feed-forward + sigmoid weighting) dynamically modulates the influence of question cues during fusion.
Self-supervised losses: By reconstructing masked question/question targets and incorporating self-distillation at the decoder, the model regularizes behavior under partial or noisy question input and better generalizes new QA pairs.

6. Practical Implications and Future Directions

QB-SLT systems lower the annotation barrier for new sign languages and domains by shifting supervision from expensive glosses to abundant question context. This reframing aligns with progressive trends in SLT, where dialogue context is regarded as essential for faithful and naturalistic translation (Liu et al., 17 Sep 2025).

Potential future research directions include:

Scaling QB-SLT datasets with authentic, diverse dialogues reflecting real-world conversational settings.
Extending multimodal fusion to handle additional cues (e.g., facial expression, prosody), which are critical for accurate translation of questions and responses in sign language.
Refining dynamic fusion and self-supervised approaches to accommodate varying levels of context noise or incompleteness.
Advancing cross-lingual/cross-modal transfer to support multilingual, multicultural QB-SLT scenarios.

7. Summary and Broader Impact

Question-based Sign Language Translation marks a shift in SLT research toward context-driven, dialogue-aware machine translation for sign languages. By unifying video and question modalities via contrastive pretraining, attention-based fusion (e.g., SSAW), and self-supervised auxiliary objectives, QB-SLT methods achieve or surpass gloss-based performance with much lower annotation cost, thereby accelerating the development of scalable and linguistically sophisticated sign language translation systems suited for interactive, real-world use (Liu et al., 17 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

SSL-SSAW: Self-Supervised Learning with Sigmoid Self-Attention Weighting for Question-Based Sign Language Translation (2025)

Follow Topic

Get notified by email when new papers are published related to Question-based Sign Language Translation (QB-SLT).