Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment (2312.15645v1)
Abstract: Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences. As a typical multi-modal task, there exists an inherent modality gap between sign language videos and spoken language text, which makes the cross-modal alignment between visual and textual modalities crucial. However, previous studies tend to rely on an intermediate sign gloss representation to help alleviate the cross-modal problem thereby neglecting the alignment across modalities that may lead to compromised results. To address this issue, we propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT) that facilitates direct and sufficient cross-modal alignment between sign language videos and spoken language text. Specifically, our CV-SLT consists of two paths with two Kullback-Leibler (KL) divergences to regularize the outputs of the encoder and decoder, respectively. In the prior path, the model solely relies on visual information to predict the target text; whereas in the posterior path, it simultaneously encodes visual information and textual knowledge to reconstruct the target text. The first KL divergence optimizes the conditional variational autoencoder and regularizes the encoder outputs, while the second KL divergence performs a self-distillation from the posterior path to the prior path, ensuring the consistency of decoder outputs. We further enhance the integration of textual information to the posterior path by employing a shared Attention Residual Gaussian Distribution (ARGD), which considers the textual information in the posterior path as a residual component relative to the prior path. Extensive experiments conducted on public datasets (PHOENIX14T and CSL-daily) demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy.
- Accelerating Scientific Computations with Mixed Precision Algorithms. Computer Physics Communications, 180(12): 2526–2533.
- Latent-GLAT: Glancing at Latent Variables for Parallel Text Generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 8398–8409. Dublin, Ireland: Association for Computational Linguistics.
- Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, 10–21. Berlin, Germany: Association for Computational Linguistics.
- Importance Weighted Autoencoders. In International Conference on Learning Representations.
- Neural Sign Language Translation. In CVPR, 7784–7793.
- Multi-Channel Transformers for Multi-Articulatory Sign Language Translation. In Bartoli, A.; and Fusiello, A., eds., Computer Vision – ECCV 2020 Workshops, Lecture Notes in Computer Science, 301–319. Cham: Springer International Publishing. ISBN 978-3-030-66823-5.
- Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation. In CVPR, 10020–10030. Seattle, WA, USA: IEEE. ISBN 978-1-72817-168-5.
- A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation. In CVPR, 5110–5120.
- Two-Stream Network for Sign Language Recognition and Translation. In Advances in Neural Information Processing Systems.
- A Token-Level Contrastive Framework for Sign Language Translation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5.
- Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations.
- Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, 6306–6320. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
- Technical Approaches to Chinese Sign Language Processing: A Review. IEEE Access, 7: 96926–96935.
- Auto-Encoding Variational Bayes. In International Conference on Learning Representations.
- Koehn, P. 2004. Statistical Significance Tests for Machine Translation Evaluation. In Lin, D.; and Wu, D., eds., Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 388–395. Barcelona, Spain: Association for Computational Linguistics.
- TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation. In Advances in Neural Information Processing Systems, volume 33, 12034–12045. Curran Associates, Inc.
- Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics.
- Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
- Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, 1278–1286. PMLR.
- Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness. In International Joint Conference on Artificial Intelligence, volume 3, 2964–2970.
- Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8846–8853.
- Learning Structured Output Representation Using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- Variational Recurrent Neural Machine Translation. In Proceedings of the AAAI Conference on Artificial Intelligence, 1.
- NVAE: A Deep Hierarchical Variational Autoencoder. In Advances in Neural Information Processing Systems, volume 33, 19667–19679. Curran Associates, Inc.
- Visualizing Data Using T-SNE. Journal of Machine Learning Research, 9(86): 2579–2605.
- Attention Is All You Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Including Signed Languages in Natural Language Processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 7347–7360. Online: Association for Computational Linguistics.
- Better Sign Language Translation with STMC-Transformer. In Proceedings of the 28th International Conference on Computational Linguistics, 5975–5989. Barcelona, Spain (Online): International Committee on Computational Linguistics.
- Efficient Sign Language Translation with a Curriculum-Based Non-Autoregressive Decoder. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 5260–5268. Macau, SAR China: International Joint Conferences on Artificial Intelligence Organization. ISBN 978-1-956792-03-4.
- SLTUNET: A Simple Unified Model for Sign Language Translation. In International Conference on Learning Representations.
- Variational Neural Machine Translation. In Empirical Methods in Natural Language Processing, 521–530. Austin, Texas: Association for Computational Linguistics.
- Enhancing Neural Sign Language Translation by Highlighting the Facial Expression Information. Neurocomputing, 464: 462–472.
- CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment. In CVPR, 23141–23150.
- Improving Sign Language Translation with Monolingual Data by Sign Back-Translation. In CVPR, 1316–1325.
- Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation. IEEE Transactions on Multimedia, 24: 768–779.
- Non-Autoregressive Neural Machine Translation with Consistency Regularization Optimized Variational Framework. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 607–617. Seattle, United States: Association for Computational Linguistics.
- A Batch Normalized Inference Network Keeps the KL Vanishing Away. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2636–2649. Online: Association for Computational Linguistics.