Exploring Attention Mechanisms in Integration of Multi-Modal Information for Sign Language Recognition and Translation (2309.01860v4)
Abstract: Understanding intricate and fast-paced movements of body parts is essential for the recognition and translation of sign language. The inclusion of additional information intended to identify and locate the moving body parts has been an interesting research topic recently. However, previous works on using multi-modal information raise concerns such as sub-optimal multi-modal feature merging method, or the model itself being too computationally heavy. In our work, we have addressed such issues and used a plugin module based on cross-attention to properly attend to each modality with another. Moreover, we utilized 2-stage training to remove the dependency of separate feature extractors for additional modalities in an end-to-end approach, which reduces the concern about computational complexity. Besides, our additional cross-attention plugin module is very lightweight which doesn't add significant computational overhead on top of the original baseline. We have evaluated the performance of our approaches on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for the sign language translation task. Our approach reduced the WER by 0.9 on the recognition task and increased the BLEU-4 scores by 0.8 on the translation task.
- “Sign language in schools?,” https://www.voicesofyouth.org/blog/sign-language-schools, 2021, Accessed: 2023-03-06.
- “Sign language transformers: Joint end-to-end sign language recognition and translation,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10020–10030, 2020.
- Zhe Niu and Brian Kan-Wing Mak, “Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition,” in European Conference on Computer Vision, 2020.
- “Self-mutual distillation learning for continuous sign language recognition,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11283–11292, 2021.
- “Iterative alignment network for continuous sign language recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4165–4174.
- “A deep neural framework for continuous sign language recognition by iterative training,” IEEE Transactions on Multimedia, vol. 21, pp. 1880–1891, 2019.
- “Neural sign language translation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7784–7793, 2018.
- “Stochastic transformer networks with linear competing units: Application to end-to-end sl translation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11946–11955.
- “Skeleton aware multi-modal sign language recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3413–3423.
- “Two-stream network for sign language recognition and translation,” Advances in Neural Information Processing Systems, vol. 35, pp. 17043–17056, 2022.
- “Joint visual and audio learning for video highlight detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8127–8137.
- “Frozen pretrained transformers for neural sign language translation,” in 18th Biennial Machine Translation Summit (MT Summit 2021). Association for Machine Translation in the Americas, 2021, pp. 88–97.
- “Rwth-phoenix-weather: A large vocabulary sign language recognition and translation corpus,” 05 2012.
- “Continuous sign language recognition through cross-modal alignment of video and text embeddings in a joint-latent space,” IEEE Access, vol. 8, pp. 91170–91180, 2020.
- “Visual alignment constraint for continuous sign language recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11542–11551.
- “Spatial-temporal multi-cue network for continuous sign language recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 13009–13016.
- “C2slr: Consistency-enhanced continuous sign language recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5131–5140.
- “Deep radial embedding for visual sequence learning,” in European Conference on Computer Vision. Springer, 2022, pp. 240–256.
- “Extensions of the sign language recognition and translation corpus rwth-phoenix-weather,” 05 2014, vol. 1.