Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Published 4 Mar 2024 in cs.CV and cs.AI | (2403.02249v2)

Abstract: Sequence-to-sequence vision-LLMs are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-LLM, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Syntactically supervised transformers for faster neural machine translation. ACL, 2019.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  3. latent-glat: Glancing at latent variables for parallel text generation. CoRR, 2022.
  4. End-to-end object detection with transformers. In European conference on computer vision, 2020.
  5. Pix2seq: A language modeling framework for object detection. International Conference on Learning Representations, 2022a.
  6. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022b.
  7. Uniter: Universal image-text representation learning. In European conference on computer vision, 2020.
  8. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR, 2021.
  9. Visual relationship detection using part-and-sum transformers with composite queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3550–3559, 2021.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  12. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020.
  13. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 2020.
  14. Mask-predict: Parallel decoding of conditional masked language models. EMNLP-IJCNLP, 2019.
  15. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376, 2006.
  16. Fully non-autoregressive neural machine translation: Tricks of the trade. ACL, 2021.
  17. Non-autoregressive neural machine translation. ICLR, 2018.
  18. Levenshtein transformer. Advances in Neural Information Processing Systems, 2019.
  19. Awni Hannun. Sequence modeling with ctc. Distill, 2017. https://distill.pub/2017/ctc.
  20. Distilling the knowledge in a neural network. arXiv, 2015.
  21. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  22. Fast decoding in sequence models using discrete latent variables. In International Conference on Machine Learning. PMLR, 2018.
  23. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  24. Learning instance occlusion for panoptic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10720–10729, 2020.
  25. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  26. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 2021a.
  27. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. ACL/IJCNLP, 2021b.
  28. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, 2020.
  29. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. arXiv preprint arXiv:1811.04719, 2018.
  30. Task-level curriculum learning for non-autoregressive neural machine translation. IJCAI, 2021.
  31. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  32. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.
  33. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  34. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  35. Structured prediction as translation between augmented natural languages. arXiv preprint arXiv:2101.05779, 2021.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  37. Guiding non-autoregressive neural machine translation decoding with reordering information. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  38. Neural machine translation of rare words with subword units. arXiv, 2015.
  39. Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In Proceedings of the aaai conference on artificial intelligence, 2020.
  40. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 2018.
  41. Insertion transformer: Flexible sequence generation via insertion operations. In International Conference on Machine Learning. PMLR, 2019.
  42. Fast structured decoding for sequence models. Advances in Neural Information Processing Systems, 2019.
  43. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
  44. Attention is all you need. Advances in neural information processing systems, 2017.
  45. Semi-autoregressive neural machine translation. EMNLP, 2018.
  46. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  47. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv, 2021a.
  48. Simvlm: Simple visual language model pretraining with weak supervision. arXiv, 2021b.
  49. A survey on non-autoregressive generation for neural machine translation and beyond. arXiv, 2022.
  50. Crossing the format boundary of text and boxes: Towards unified vision-language modeling. arXiv, 2021.
  51. Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer, 2016.
  52. Florence: A new foundation model for computer vision. arXiv, 2021.
  53. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  54. Musketeer (all for one, and one for all): A generalist vision-language model with task explanation prompts, 2023.
  55. Understanding knowledge distillation in non-autoregressive machine translation. ICLR, 2019.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.