Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 28 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions (2403.11818v1)

Published 18 Mar 2024 in cs.CV

Abstract: A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Neural sign language translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7784–7793.
  2. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10023–10033.
  3. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence, 43(1): 172–186.
  4. Fully convolutional networks for continuous sign language recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, 697–714. Springer.
  5. A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia, 21(7): 1880–1891.
  6. CSWin Transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12124–12134.
  7. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369–376.
  8. Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10771–10780.
  9. Self-mutual distillation learning for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11303–11312.
  10. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141.
  11. Temporal lift pooling for continuous sign language recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 511–527. Springer.
  12. Continuous Sign Language Recognition with Correlation Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2529–2539.
  13. Self-Emphasizing Network for Continuous Sign Language Recognition. In Thirty-seventh AAAI conference on artificial intelligence.
  14. Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2257–2264.
  15. Can active memory replace attention? Advances in Neural Information Processing Systems, 29.
  16. Discrete autoencoders for sequence models. arXiv preprint arXiv:1801.09797.
  17. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228.
  18. Koller, O. 2020. Quantitative survey of the state of the art in sign language recognition. arXiv preprint arXiv:2008.09918.
  19. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE transactions on pattern analysis and machine intelligence, 42(9): 2306–2320.
  20. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Computer Vision and Image Understanding, 141: 108–125.
  21. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, 7083–7093.
  22. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 11669–11676.
  23. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), 116–131.
  24. Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11542–11551.
  25. Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, 172–186. Springer.
  26. Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. In Interspeech, 2752–2756.
  27. Boosting continuous sign language recognition via cross modality augmentation. In Proceedings of the 28th ACM International Conference on Multimedia, 1497–1505.
  28. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4165–4174.
  29. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4161–4170.
  30. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
  31. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  32. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5693–5703.
  33. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9.
  34. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450–6459.
  35. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30.
  36. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4794–4803.
  37. Sf-net: Structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341.
  38. Improving sign language translation with monolingual data by sign back-translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1325.
  39. Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 13009–13016.
  40. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia, 24: 768–779.
  41. BiFormer: Vision Transformer with Bi-Level Routing Attention. arXiv preprint arXiv:2303.08810.
  42. C2SLR: Consistency-enhanced continuous sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5131–5140.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.