Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal (2212.13023v2)

Published 26 Dec 2022 in cs.CV

Abstract: Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. A Comprehensive Study on Deep Learning-based Methods for Sign Language Recognition. IEEE TMM (2021), 1–1.
  2. 2D Human Pose Estimation: New benchmark and State of the Art Analysis. In CVPR. 3686–3693.
  3. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016).
  4. Neural Sign Language Translation. In CVPR.
  5. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In CVPR. 10020–10030.
  6. GCNet: Non-local Networks Meet Squeeze-excitation Networks and Beyond. In CVPRW.
  7. OpenPose: Realtime Multi-person 2D Pose Estimation using Part Affinity Fields. TPAMI 43, 1 (2019), 172–186.
  8. Semantic Re-tuning with Contrastive Tension. In ICLR.
  9. A Short Note about Kinetics-600. arXiv preprint arXiv:1808.01340 (2018).
  10. Shaoxiang Chen and Yu-Gang Jiang. 2019. Motion Guided Spatial Attention for Video Captioning. In AAAI, Vol. 33. 8191–8198.
  11. A Simple Multi-modality Transfer Learning Baseline for Sign Language Translation. In CVPR. 5120–5130.
  12. Two-Stream Network for Sign Language Recognition and Translation. In NeurIPS.
  13. Fully Convolutional Networks for Continuous Sign Language Recognition. In ECCV, Vol. 12369. 697–714.
  14. PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation. In AAAI.
  15. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE TMM PP (07 2019), 1–1.
  16. Dual Attention Network for Scene Segmentation. In CVPR. 3146–3154.
  17. Domain-adversarial Training of Neural Networks. JMLR 17, 1 (2016), 2096–2030.
  18. SimCSE: Simple Contrastive Learning of Sentence Embeddings. EMNLP (2021).
  19. The “something something” Video Database for Learning and Evaluating Visual Common Sense. In ICCV. 5842–5850.
  20. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. 369–376.
  21. Dense Temporal Convolution Network for Sign Language Translation. In IJCAI. 744–750.
  22. Hierarchical LSTM for Sign Language Translation. In AAAI. 6845–6852.
  23. Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition. In CVPR. 10771–10780.
  24. Self-Mutual Distillation Learning for Continuous Sign Language Recognition. In ICCV. 11303–11312.
  25. Deep Residual Learning for Image Recognition. In CVPR. 770–778.
  26. Learning Deep Representations by Mutual Information Estimation and Maximization. In ICLR.
  27. Searching for MobileNetV3. In ICCV. 1314–1324.
  28. Global-local Enhancement Network for NMF-aware Sign Language Recognition. ACM TOMM 17, 3 (2021), 1–19.
  29. Continuous Sign Language Recognition with Correlation Network. In CVPR.
  30. Video-Based Sign Language Recognition Without Temporal Segmentation. In AAAI. 2257–2264.
  31. When Age-invariant Face Recognition Meets Face Age Synthesis: A Multi-task Learning Framework. In CVPR. 7282–7291.
  32. CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition. In ICCV. 20676–20686.
  33. Style Normalization and Restitution for Generalizable Person Re-identification. In CVPR. 3143–3152.
  34. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
  35. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  36. Oscar Koller. 2020. Quantitative Survey of the State of the Art in Sign Language Recognition. arXiv preprint arXiv:2008.09918 (2020).
  37. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos. IEEE TPAMI 42, 9 (04 2019), 2306–2320.
  38. Continuous Sign Language Recognition: Towards Large Vocabulary Statistical Recognition Systems Handling Multiple Signers. CVIU 141 (Dec. 2015), 108–125.
  39. Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs. In CVPR. 3416–3424.
  40. Relation-guided Spatial Attention and Temporal Refinement for Video-based Person Re-identification. In AAAI, Vol. 34. 11434–11441.
  41. Learning What and Where to Attend. In ICLR.
  42. Speaker Embedding Extraction with Phonetic Information. Interspeech (2018), 2247–2251.
  43. Cross-modal Dual Learning for Sentence-to-video Generation. In ACM MM. 1239–1247.
  44. Exploring Disentangled Feature Representation beyond Face Identification. In CVPR. 2080–2089.
  45. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP. 1412–1421.
  46. Visual Alignment Constraint for Continuous Sign Language Recognition. In ICCV. 11542–11551.
  47. Zhe Niu and Brian Mak. 2020. Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition. In ECCV. 172–186.
  48. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:1807.03748 (2018).
  49. Deep Sentence Embedding using Long Short-term Memory Networks: Analysis and Application to Information Retrieval. IEEE/ACM TASLP 24, 4 (2016), 694–707.
  50. Mask-guided Attention Network for Occluded Pedestrian Detection. In CVPR. 4967–4975.
  51. Katerina Papadimitriou and Gerasimos Potamianos. 2020. Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning. In Interspeech. 2752–2756.
  52. Speaker-aware Long Short-term Memory Multi-task Learning for Speech Recognition. In European Signal Processing Conference (EUSIPCO). 1911–1915.
  53. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In ACM MM. 1497–1505.
  54. Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition. In IJCAI. 885–891.
  55. Iterative Alignment Network for Continuous Sign Language Recognition. In CVPR. 4165–4174.
  56. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP. 3982–3992.
  57. Facenet: A Unified Embedding for Face Recognition and Clustering. In CVPR. 815–823.
  58. Self-Attention with Relative Position Representations. In NAACL-HLT. 464–468.
  59. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
  60. X-vectors: Robust DNN Embeddings for Speaker Recognition. In ICASSP. 5329–5333.
  61. Deep High-resolution Representation Learning for Human Pose Estimation. In CVPR. 5693–5703.
  62. Going Deeper with Convolutions. In CVPR. 1–9.
  63. Graph-Based Multimodal Sequential Embedding for Sign Language Translation. IEEE TMM (2021).
  64. Attention is All you Need. In NeurIPS. 5998–6008.
  65. Generalizing to Unseen Domains: A Survey on Domain Generalization. arXiv preprint arXiv:2103.03097 (2021).
  66. Connectionist Temporal Fusion for Sign Language Translation. In ACM MM. 1483–1491.
  67. Semantic Boundary Detection with Reinforcement Learning for Continuous Sign Language Recognition. IEEE TCSVT 31, 3 (2020), 1138–1149.
  68. Fangyun Wei and Yutong Chen. 2023. Improving Continuous Sign Language Recognition with Cross-Lingual Signs. In ICCV. 23612–23621.
  69. CBAM: Convolutional Block Attention Module. In ECCV. 3–19.
  70. Pay Less Attention with Lightweight and Dynamic Convolutions. In ICLR.
  71. Investigating Bias and Fairness in Facial Expression Recognition. In ECCV. 506–523.
  72. Modeling Localness for Self-Attention Networks. In EMNLP. 4449–4458.
  73. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. In CVPR. 6210–6219.
  74. Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition. In ECCV. 434–450.
  75. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In ICLR.
  76. Yu Zhang and Qiang Yang. 2021. A Survey on Multi-task Learning. IEEE TKDE (2021).
  77. CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment. In CVPR.
  78. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In CVPR.
  79. Dynamic Pseudo Label Decoding for Continuous Sign Language Recognition. In ICME. 1282–1287.
  80. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition. In AAAI. 13009–13016.
  81. Ronglai Zuo and Brian Mak. 2022a. C2SLR: Consistency-Enhanced Continuous Sign Language Recognition. In CVPR. 5131–5140.
  82. Ronglai Zuo and Brian Mak. 2022b. Local Context-aware Self-attention for Continuous Sign Language Recognition. In Proc. Interspeech. 4810–4814.
  83. Natural Language-Assisted Sign Language Recognition. In CVPR. 14890–14900.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com