TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions (2403.11818v1)
Abstract: A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.
- Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7784–7793.
- Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10023–10033.
- OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence, 43(1): 172–186.
- Fully convolutional networks for continuous sign language recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, 697–714. Springer.
- A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, 21(7): 1880–1891.
- CSWin Transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12124–12134.
- Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369–376.
- Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10771–10780.
- Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11303–11312.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141.
- Temporal lift pooling for continuous sign language recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 511–527. Springer.
- Continuous Sign Language Recognition with Correlation Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2529–2539.
- Self-Emphasizing Network for Continuous Sign Language Recognition. In Thirty-seventh AAAI conference on artificial intelligence.
- Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2257–2264.
- Can active memory replace attention? Advances in Neural Information Processing Systems, 29.
- Discrete autoencoders for sequence models. arXiv preprint arXiv:1801.09797.
- Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228.
- Koller, O. 2020. Quantitative survey of the state of the art in sign language recognition. arXiv preprint arXiv:2008.09918.
- Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE transactions on pattern analysis and machine intelligence, 42(9): 2306–2320.
- Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141: 108–125.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, 7083–7093.
- TEINet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 11669–11676.
- Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), 116–131.
- Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11542–11551.
- Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, 172–186. Springer.
- Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. In Interspeech, 2752–2756.
- Boosting continuous sign language recognition via cross modality augmentation. In Proceedings of the 28th ACM International Conference on Multimedia, 1497–1505.
- Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4165–4174.
- Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4161–4170.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5693–5703.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9.
- A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450–6459.
- Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30.
- Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4794–4803.
- Sf-net: Structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341.
- Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1325.
- Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 13009–13016.
- Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia, 24: 768–779.
- BiFormer: Vision Transformer with Bi-Level Routing Attention. arXiv preprint arXiv:2303.08810.
- C2SLR: Consistency-enhanced continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5131–5140.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.