Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages (2402.16021v1)

Published 25 Feb 2024 in cs.CL, cs.AI, cs.CV, and eess.AS

Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, where we interpret different modalities as different languages, and treat multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, which provide a unified interface across modalities and significantly decrease the computational cost. In the proposed TMT, a multi-modal encoder-decoder conducts the core translation, whereas modality-specific processing is conducted only within the tokenization and detokenization stages. We evaluate the proposed TMT on all six modality translation tasks. TMT outperforms single model counterparts consistently, demonstrating that unifying tasks is beneficial not only for practicality but also for performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182. PMLR.
  2. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer.
  3. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proc. ACL, pages 5723–5738.
  4. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  6. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704.
  7. Exploration of efficient end-to-end asr using discretized input from self-supervised learning. In Proc. Interspeech.
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  9. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16867–16876.
  10. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  11. Av2av: Direct audio-visual speech to audio-visual speech translation with unified audio-visual speech representation. arXiv preprint arXiv:2312.02512.
  12. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
  13. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
  14. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  15. Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.
  16. End-to-end image-to-speech generation for untranscribed unknown languages. IEEE Access, 9:55144–55154.
  17. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883.
  18. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567.
  19. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041.
  20. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218.
  21. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
  22. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905.
  23. David Harwath and James Glass. 2015. Deep multimodal semantic embeddings for speech and images. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 237–244. IEEE.
  24. Tomoki Hayashi and Shinji Watanabe. 2020. Discretalk: Text-to-speech as a machine translation problem. arXiv preprint arXiv:2005.05525.
  25. Mugen: A playground for video-audio-text multimodal understanding and generation. In European Conference on Computer Vision, pages 431–449.
  26. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
  27. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899.
  28. Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18783–18794.
  29. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  30. Text-free image-to-speech synthesis using learned segmental units. In Proc. ACL, pages 5284–5300.
  31. Keith Ito and Linda Johnson. 2017. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/.
  32. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  33. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE.
  34. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
  35. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  36. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
  37. Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation. arXiv preprint arXiv:2308.01831.
  38. Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  39. Multilingual visual speech recognition with a single model by learning with discrete visual speech units. arXiv preprint arXiv:2401.09802.
  40. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
  41. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  42. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216.
  43. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033.
  44. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  45. Direct speech-to-speech translation with discrete units. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3327–3339.
  46. Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872.
  47. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  48. Direct speech-to-image translation. IEEE Journal of Selected Topics in Signal Processing, 14(3):517–529.
  49. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  50. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  51. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
  52. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521.
  53. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  54. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  55. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  56. Mosnet: Deep learning based objective assessment for voice conversion. Challenge (VCC), page 11.
  57. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  58. Speechlmscore: Evaluating speech generation using speech language model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  59. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  60. Seit: Storage-efficient vision training with tokens using 1% of pixel storage. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  61. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355.
  62. Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967.
  63. End-to-end speech recognition: A survey. arXiv preprint arXiv:2303.03329.
  64. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  65. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
  66. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR.
  67. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
  68. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494.
  69. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE.
  70. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
  71. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  72. Felix Stahlberg. 2020. Neural machine translation: A review. Journal of Artificial Intelligence Research, 69:343–418.
  73. Xu Tan. 2023. Neural Text-to-Speech Synthesis. Springer Nature.
  74. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  75. Neural discrete representation learning. Advances in neural information processing systems, 30.
  76. Attention is all you need. In Proc. NeurlIPS.
  77. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  78. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
  79. Git: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research.
  80. S2igan: Speech-to-image generation via adversarial learning. In Proc. of Interspeech. ISCA.
  81. Synthesizing spoken descriptions of images. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3242–3254.
  82. Tacotron: Towards end-to-end speech synthesis. In Proc. of Interspeech.
  83. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
  84. Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  85. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR.
  86. CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92).
  87. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  88. Vector-quantized image modeling with improved vqgan. In International Conference on Learning Representations.
  89. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Minsu Kim (115 papers)
  2. Jee-weon Jung (69 papers)
  3. Hyeongseop Rha (6 papers)
  4. Soumi Maiti (26 papers)
  5. Siddhant Arora (50 papers)
  6. Xuankai Chang (61 papers)
  7. Shinji Watanabe (416 papers)
  8. Yong Man Ro (91 papers)
Citations (3)