Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Online Continuous Sign Language Recognition and Translation (2401.05336v2)

Published 10 Jan 2024 in cs.CV

Abstract: Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. Code and models are available at https://github.com/FangyunWei/SLRT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In ECCV, pages 35–53, 2020.
  2. Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, pages 173–182. PMLR, 2016.
  3. CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR. In Proc. Interspeech 2022, pages 2103–2107, 2022.
  4. Simple, scalable adaptation for neural machine translation. In EMNLP, pages 1538–1548, 2019.
  5. Neural sign language translation. In CVPR, 2018.
  6. Sign language transformers: Joint end-to-end sign language recognition and translation. In CVPR, pages 10020–10030, 2020.
  7. A simple multi-modality transfer learning baseline for sign language translation. In CVPR, pages 5120–5130, 2022a.
  8. Two-stream network for sign language recognition and translation. In NeurIPS, 2022b.
  9. Fully convolutional networks for continuous sign language recognition. In ECCV, pages 697–714, 2020.
  10. Cico: Domain-aware sign language retrieval via cross-lingual contrastive learning. In CVPR, 2023.
  11. A deep neural framework for continuous sign language recognition by iterative training. IEEE TMM, PP:1–1, 2019.
  12. Revisiting skeleton-based action recognition. In CVPR, pages 2969–2978, 2022.
  13. An online attention-based model for speech recognition. Proc. Interspeech 2019, pages 4390–4394, 2019.
  14. Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019.
  15. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML, page 369–376, 2006.
  16. Distilling cross-temporal contexts for continuous sign language recognition. In CVPR, pages 10771–10780, 2023.
  17. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
  18. Self-mutual distillation learning for continuous sign language recognition. In ICCV, pages 11303–11312, 2021.
  19. Streaming end-to-end speech recognition for mobile devices. In ICASSP, pages 6381–6385, 2019.
  20. Augment your batch: Improving generalization through instance repetition. In CVPR, pages 8129–8138, 2020.
  21. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799. PMLR, 2019.
  22. Hand-model-aware sign language recognition. In AAAI, pages 1558–1566, 2021a.
  23. Global-local enhancement network for nmf-aware sign language recognition. ACM transactions on multimedia computing, communications, and applications (TOMM), 17(3):1–19, 2021b.
  24. Collaborative multilingual continuous sign language recognition: A unified framework. IEEE TMM, 2022a.
  25. Prior-aware cross modality augmentation learning for continuous sign language recognition. IEEE TMM, 2023a.
  26. SignBERT+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE TPAMI, 2023b.
  27. Temporal lift pooling for continuous sign language recognition. In ECCV, pages 511–527, 2022b.
  28. Self-emphasizing network for continuous sign language recognition. In AAAI, 2023c.
  29. Continuous sign language recognition with correlation network. In CVPR, 2023d.
  30. Adabrowse: Adaptive video browser for efficient continuous sign language recognition. In MM, 2023e.
  31. Skeleton aware multi-modal sign language recognition. In CVPRW, pages 3413–3423, 2021.
  32. Cosign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In ICCV, pages 20676–20686, 2023.
  33. Whole-body human pose estimation in the wild. In ECCV, pages 196–214, 2020.
  34. Hamid Reza Vaezi Joze and Oscar Koller. MS-ASL: A large-scale data set and benchmark for understanding American sign language. In BMVC, 2019.
  35. Adam: A method for stochastic optimization. In ICLR, 2015.
  36. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. CVIU, 141:108–125, 2015.
  37. Human part-wise 3d motion context learning for sign language recognition. In ICCV, pages 20740–20750, 2023.
  38. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In WACV, pages 1459–1469, 2020a.
  39. Transferring cross-domain knowledge for video sign language recognition. In CVPR, pages 6205–6214, 2020b.
  40. Multilingual denoising pre-training for neural machine translation. TACL, 8:726–742, 2020.
  41. Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In ACL, pages 3025–3036, 2019.
  42. Online hybrid ctc/attention architecture for end-to-end speech recognition. In Interspeech, pages 2623–2627, 2019.
  43. Transformer-based online ctc/attention end-to-end speech recognition architecture. In ICASSP, pages 6084–6088, 2020.
  44. Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 167–174, 2015.
  45. Visual alignment constraint for continuous sign language recognition. In ICCV, pages 11542–11551, 2021.
  46. Deep radial embedding for visual sequence learning. In ECCV, pages 240–256, 2022.
  47. Watch, read and lookup: learning to spot signs from multiple supervisors. In ACCV, 2020.
  48. Automatic dense annotation of large-vocabulary sign language videos. In ECCV, pages 671–690, 2022.
  49. Meinard Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
  50. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In ECCV, pages 172–186, 2020.
  51. Parlance. https://github.com/parlance/ctcdecode. In Github, 2021.
  52. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In EMNLP, pages 7654–7673, 2020.
  53. Scaling up online speech recognition using convnets. Proc. Interspeech 2020, pages 3376–3380, 2020.
  54. Iterative alignment network for continuous sign language recognition. In CVPR, pages 4165–4174, 2019.
  55. Boosting continuous sign language recognition via cross modality augmentation. In ACM MM, pages 1497–1505, 2020.
  56. Sign language and linguistic universals. Cambridge University Press, 2006.
  57. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP, pages 6783–6787, 2021.
  58. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058, 2016.
  59. Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In ICML, pages 32654–32676, 2023.
  60. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
  61. Read and attend: Temporal localisation in sign language videos. In CVPR, pages 16857–16866, 2021.
  62. K-adapter: Infusing knowledge into pre-trained models with adapters. In ACL Findings, pages 1405–1418, 2021.
  63. Improving continuous sign language recognition with cross-lingual signs. In ICCV, pages 23612–23621, 2023.
  64. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 305–321, 2018.
  65. Simulslt: End-to-end simultaneous sign language translation. In MM, pages 4118–4127, 2021.
  66. SLTUNET: A simple unified model for sign language translation. In ICLR, 2023a.
  67. C2st: Cross-modal contextualized sequence transduction for continuous sign language recognition. In ICCV, pages 21053–21062, 2023b.
  68. BEST: BERT pre-training for sign language recognition with coupling tokenization. In AAAI, 2023.
  69. CVT-SLR: Contrastive visual-textual transformation for sign language recognition with variational alignment. In CVPR, 2023.
  70. Spatial-temporal multi-cue network for continuous sign language recognition. In AAAI, pages 13009–13016, 2020.
  71. Improving sign language translation with monolingual data by sign back-translation. In CVPR, 2021.
  72. C2SLR: Consistency-enhanced continuous sign language recognition. In CVPR, 2022a.
  73. Improving continuous sign language recognition with consistency constraints and signer removal. arXiv preprint arXiv:2212.13023, 2022b.
  74. Local context-aware self-attention for continuous sign language recognition. In Proc. Interspeech, pages 4810–4814, 2022c.
  75. Natural language-assisted sign language recognition. In CVPR, 2023.
Citations (3)

Summary

  • The paper introduces a novel framework that transitions continuous sign language recognition from offline to real-time processing using a sliding window approach.
  • The methodology segments continuous videos into isolated signs and trains an ISLR model with both classification and saliency losses to improve accuracy.
  • The framework achieves state-of-the-art results on benchmarks like Phoenix-2014, demonstrating potential for enhanced accessibility and future research in real-time applications.

Towards Online Sign Language Recognition and Translation

The paper "Towards Online Sign Language Recognition and Translation" addresses a significant gap in the field of sign language recognition by proposing a novel framework for online continuous sign language recognition (CSLR). Unlike traditional CSLR methods that rely on offline models trained with connectionist temporal classification (CTC) loss and operate on entire sign videos, this research offers a pragmatic approach to real-time sign language processing via a robust online framework.

Overview of Methodology

The proposed framework is divided into three phases:

  1. Sign Language Dictionary Construction: The framework begins with building a sign language dictionary from a target CSLR dataset. This involves segmenting continuous sign videos into isolated signs using a pre-trained CSLR model equipped with CTC loss. The segmented signs create pseudo ground truths that populate the dictionary, enhanced further by generating augmented sign clips around each isolated sign.
  2. ISLR Model Training: With this dictionary, an isolated sign language recognition (ISLR) model is optimized using standard classification losses along with a novel saliency loss. While the classification loss ensures correct gloss prediction, the saliency loss encourages the model to focus on the foreground signs and adapt to variations in sign duration.
  3. Online Recognition via a Sliding Window Approach: The online recognition is achieved by applying a sliding window over the input sign sequence and feeding each clip to the optimized ISLR model for prediction. A post-processing step is introduced to eliminate duplicate and background predictions, enhancing recognition accuracy.

Performance Assessment

The fusion of this online recognition framework with the previously leading offline model, TwoStream-SLR, has demonstrated new state-of-the-art results on three benchmarks: Phoenix-2014, Phoenix-2014T, and CSL-Daily. The results indicate significant improvement, with notable reductions in word error rates when compared with existing models adapted for online scenarios, such as the TwoStream-SLR model.

Speculative Implications and Future Developments

The successful implementation of this framework suggests substantial implications for real-time sign language recognition and translation systems. Practically, it could lead to more accessible communication aids for the deaf community, enhancing interactive applications where latency is critical. Theoretically, this opens avenues for further research into lightweight architecture adaptations for resource-constrained environments, optimizing real-time processing without sacrificing performance.

For future developments, refining the segmentation accuracy of sign boundaries and improving robustness against varying video qualities will be essential. Additionally, extending this approach to support multiple sign languages and contextual understanding through enriched datasets could make these systems more comprehensive.

This work effectively transitions CSLR from a predominantly offline task to an online field, equipping systems with the ability to process and translate sign language in dynamic settings. The bridging of isolated and continuous sign recognition in a cohesive framework fosters advancements that align with both practical needs and theoretical explorations within the domain of sign language processing.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com