Towards Online Continuous Sign Language Recognition and Translation (2401.05336v2)
Abstract: Research on continuous sign language recognition (CSLR) is essential to bridge the communication gap between deaf and hearing individuals. Numerous previous studies have trained their models using the connectionist temporal classification (CTC) loss. During inference, these CTC-based models generally require the entire sign video as input to make predictions, a process known as offline recognition, which suffers from high latency and substantial memory usage. In this work, we take the first step towards online CSLR. Our approach consists of three phases: 1) developing a sign dictionary; 2) training an isolated sign language recognition model on the dictionary; and 3) employing a sliding window approach on the input sign sequence, feeding each sign clip to the optimized model for online recognition. Additionally, our online recognition model can be extended to support online translation by integrating a gloss-to-text network and can enhance the performance of any offline model. With these extensions, our online approach achieves new state-of-the-art performance on three popular benchmarks across various task settings. Code and models are available at https://github.com/FangyunWei/SLRT.
- BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In ECCV, pages 35–53, 2020.
- Deep speech 2: End-to-end speech recognition in english and mandarin. In ICML, pages 173–182. PMLR, 2016.
- CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR. In Proc. Interspeech 2022, pages 2103–2107, 2022.
- Simple, scalable adaptation for neural machine translation. In EMNLP, pages 1538–1548, 2019.
- Neural sign language translation. In CVPR, 2018.
- Sign language transformers: Joint end-to-end sign language recognition and translation. In CVPR, pages 10020–10030, 2020.
- A simple multi-modality transfer learning baseline for sign language translation. In CVPR, pages 5120–5130, 2022a.
- Two-stream network for sign language recognition and translation. In NeurIPS, 2022b.
- Fully convolutional networks for continuous sign language recognition. In ECCV, pages 697–714, 2020.
- Cico: Domain-aware sign language retrieval via cross-lingual contrastive learning. In CVPR, 2023.
- A deep neural framework for continuous sign language recognition by iterative training. IEEE TMM, PP:1–1, 2019.
- Revisiting skeleton-based action recognition. In CVPR, pages 2969–2978, 2022.
- An online attention-based model for speech recognition. Proc. Interspeech 2019, pages 4390–4394, 2019.
- Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2019.
- Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML, page 369–376, 2006.
- Distilling cross-temporal contexts for continuous sign language recognition. In CVPR, pages 10771–10780, 2023.
- Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
- Self-mutual distillation learning for continuous sign language recognition. In ICCV, pages 11303–11312, 2021.
- Streaming end-to-end speech recognition for mobile devices. In ICASSP, pages 6381–6385, 2019.
- Augment your batch: Improving generalization through instance repetition. In CVPR, pages 8129–8138, 2020.
- Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799. PMLR, 2019.
- Hand-model-aware sign language recognition. In AAAI, pages 1558–1566, 2021a.
- Global-local enhancement network for nmf-aware sign language recognition. ACM transactions on multimedia computing, communications, and applications (TOMM), 17(3):1–19, 2021b.
- Collaborative multilingual continuous sign language recognition: A unified framework. IEEE TMM, 2022a.
- Prior-aware cross modality augmentation learning for continuous sign language recognition. IEEE TMM, 2023a.
- SignBERT+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE TPAMI, 2023b.
- Temporal lift pooling for continuous sign language recognition. In ECCV, pages 511–527, 2022b.
- Self-emphasizing network for continuous sign language recognition. In AAAI, 2023c.
- Continuous sign language recognition with correlation network. In CVPR, 2023d.
- Adabrowse: Adaptive video browser for efficient continuous sign language recognition. In MM, 2023e.
- Skeleton aware multi-modal sign language recognition. In CVPRW, pages 3413–3423, 2021.
- Cosign: Exploring co-occurrence signals in skeleton-based continuous sign language recognition. In ICCV, pages 20676–20686, 2023.
- Whole-body human pose estimation in the wild. In ECCV, pages 196–214, 2020.
- Hamid Reza Vaezi Joze and Oscar Koller. MS-ASL: A large-scale data set and benchmark for understanding American sign language. In BMVC, 2019.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. CVIU, 141:108–125, 2015.
- Human part-wise 3d motion context learning for sign language recognition. In ICCV, pages 20740–20750, 2023.
- Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In WACV, pages 1459–1469, 2020a.
- Transferring cross-domain knowledge for video sign language recognition. In CVPR, pages 6205–6214, 2020b.
- Multilingual denoising pre-training for neural machine translation. TACL, 8:726–742, 2020.
- Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In ACL, pages 3025–3036, 2019.
- Online hybrid ctc/attention architecture for end-to-end speech recognition. In Interspeech, pages 2623–2627, 2019.
- Transformer-based online ctc/attention end-to-end speech recognition architecture. In ICASSP, pages 6084–6088, 2020.
- Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 167–174, 2015.
- Visual alignment constraint for continuous sign language recognition. In ICCV, pages 11542–11551, 2021.
- Deep radial embedding for visual sequence learning. In ECCV, pages 240–256, 2022.
- Watch, read and lookup: learning to spot signs from multiple supervisors. In ACCV, 2020.
- Automatic dense annotation of large-vocabulary sign language videos. In ECCV, pages 671–690, 2022.
- Meinard Müller. Dynamic time warping. Information retrieval for music and motion, pages 69–84, 2007.
- Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In ECCV, pages 172–186, 2020.
- Parlance. https://github.com/parlance/ctcdecode. In Github, 2021.
- Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In EMNLP, pages 7654–7673, 2020.
- Scaling up online speech recognition using convnets. Proc. Interspeech 2020, pages 3376–3380, 2020.
- Iterative alignment network for continuous sign language recognition. In CVPR, pages 4165–4174, 2019.
- Boosting continuous sign language recognition via cross modality augmentation. In ACM MM, pages 1497–1505, 2020.
- Sign language and linguistic universals. Cambridge University Press, 2006.
- Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP, pages 6783–6787, 2021.
- Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058, 2016.
- Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In ICML, pages 32654–32676, 2023.
- Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
- Read and attend: Temporal localisation in sign language videos. In CVPR, pages 16857–16866, 2021.
- K-adapter: Infusing knowledge into pre-trained models with adapters. In ACL Findings, pages 1405–1418, 2021.
- Improving continuous sign language recognition with cross-lingual signs. In ICCV, pages 23612–23621, 2023.
- Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 305–321, 2018.
- Simulslt: End-to-end simultaneous sign language translation. In MM, pages 4118–4127, 2021.
- SLTUNET: A simple unified model for sign language translation. In ICLR, 2023a.
- C2st: Cross-modal contextualized sequence transduction for continuous sign language recognition. In ICCV, pages 21053–21062, 2023b.
- BEST: BERT pre-training for sign language recognition with coupling tokenization. In AAAI, 2023.
- CVT-SLR: Contrastive visual-textual transformation for sign language recognition with variational alignment. In CVPR, 2023.
- Spatial-temporal multi-cue network for continuous sign language recognition. In AAAI, pages 13009–13016, 2020.
- Improving sign language translation with monolingual data by sign back-translation. In CVPR, 2021.
- C2SLR: Consistency-enhanced continuous sign language recognition. In CVPR, 2022a.
- Improving continuous sign language recognition with consistency constraints and signer removal. arXiv preprint arXiv:2212.13023, 2022b.
- Local context-aware self-attention for continuous sign language recognition. In Proc. Interspeech, pages 4810–4814, 2022c.
- Natural language-assisted sign language recognition. In CVPR, 2023.