Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (2312.17172v1)

Published 28 Dec 2023 in cs.CV, cs.AI, and cs.CL

Abstract: We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

Unified-IO 2 (Lu et al., 2023 ) introduces the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action. The core idea is to unify various modalities by tokenizing them into a shared semantic space and processing them with a single encoder-decoder transformer architecture. This approach aims to build general-purpose AI agents that can interact with the world in a multimodal manner, inspired by how biological systems use redundancy across senses for improved learning.

Building such a comprehensive multimodal model from scratch presents significant challenges, including sourcing and processing massive, diverse datasets, designing effective architectures, ensuring training stability, and instruction tuning for versatile capabilities. The paper addresses these challenges with several technical contributions:

  1. Unified Task Representation: All inputs and outputs across different modalities—text, images, audio, action, bounding boxes, keypoints, 3D cuboids, camera poses, and per-pixel labels (depth, surface normals, segmentation masks)—are converted into sequences of tokens in a shared discrete space. Text uses byte-pair encoding; sparse structures are discretized into special tokens; images are encoded using a pre-trained Vision Transformer (ViT) and decoded using a VQ-GAN; audio is encoded using a pre-trained Audio Spectrogram Transformer (AST) on spectrograms and decoded using a ViT-VQGAN and HiFi-GAN vocoder. Image and audio history are incorporated using a perceiver resampler to compress features into a fixed number of tokens, referenced in text via special tokens.
  2. Architectural Improvements for Stability: The diverse modalities lead to training instability in standard transformer architectures. Unified-IO 2 incorporates:
    • 2D Rotary Embedding (RoPE): Extends standard RoPE to 2D positions for non-text modalities, applied at each transformer layer.
    • QK Normalization: Applies LayerNorm to queries and keys before dot-product attention to mitigate large attention logits, particularly with image and audio data.
    • Scaled Cosine Attention: Used in the perceiver resampler for stricter normalization, further stabilizing training.
    • Float32 attention logits are enabled, and pre-trained ViT/AST are frozen during pretraining to avoid joint updating instabilities.
  3. Multimodal Mixture of Denoisers Objective: Adapting the UL2 framework for text, the paper proposes a generalized objective for multimodal pre-training. It includes masked denoising (reconstructing corrupted inputs) and causal generation paradigms across text, image, and audio targets. A novel technique, "Autoregressive with Dynamic Masking," is introduced for image/audio denoising to prevent information leakage from the decoder during autoregressive prediction while maintaining causal generation capabilities.
  4. Efficient Implementation (Dynamic Packing): To handle the highly variable sequence lengths inherent in multimodal data, dynamic packing is used. This packs tokens from multiple examples into a single sequence, masking attentions between examples. Unlike typical pre-processing packing, dynamic packing occurs right before and after the transformer stage to accommodate modality-specific encoders/decoders, implemented efficiently using matrix multiplication. A heuristic algorithm dynamically pairs examples during streaming to optimize packing, achieving an almost 4x increase in training throughput.
  5. Large-Scale Multimodal Data: The model is trained from scratch on a massive corpus of over 600 terabytes.
    • Pre-training Data: An 8.5 billion example mixture from diverse sources (NLP, Image-Text, Video-Audio, Interleaved Image-Text, 3D-Embodiment, Synthetic), sampled to balance modalities and corpus sizes. Self-supervised signals are generated by randomly selecting a target modality and masking/corrupting inputs.
    • Instruction Tuning Data: Fine-tuned on an ensemble of over 120 datasets covering 220+ tasks across vision, language, audio, and action. This dataset combines supervised data with prompts, carries over pre-training data (30%) to prevent catastrophic forgetting, includes task augmentation (6%) for diverse skills, and free-form text (4%) for chat capabilities.
  6. Evaluation: The model was evaluated on over 35 datasets without task-specific finetuning. Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark, surpassing its predecessor. It shows strong results across diverse tasks including image generation (competitive on TIFA), audio generation, various vision-language tasks (VQA, referring expression, captioning), video and audio understanding (classification, captioning, VQA), sparse/dense image labeling (detection, segmentation, keypoint, normal estimation), 3D tasks (detection, view synthesis), and embodied AI (action prediction, state prediction). It matches or outperforms many vision-language generalist models despite its significantly broader scope.

Implementation Considerations & Limitations:

  • The model uses base-sized pre-trained encoders (ViT, AST) due to memory constraints; larger encoders could improve performance.
  • Image and audio generation quality, while competitive for a generalist model, does not fully match specialist diffusion models, and audio generation is limited to ~4 seconds per segment.
  • Performance on less common modalities/tasks (depth, video, 3D detection) is less reliable, likely due to limited task variety in the training data for these areas.
  • Training required careful hyperparameter tuning and techniques to handle instability and efficiency.

In conclusion, Unified-IO 2 demonstrates the feasibility and effectiveness of training a single autoregressive model from scratch to handle a wide range of multimodal tasks spanning vision, language, audio, and action. The proposed architectural changes, training objective, and data curation strategy are crucial for enabling this breadth of capabilities and scaling autoregressive multimodal models. Future work includes exploring decoder-only architectures, scaling model size further, improving data quality, and refining the overall design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (226)
  1. TallyQA: Answering Complex Counting Questions. In AAAI, 2019.
  2. MusicLM: Generating Music From Text. arXiv preprint arXiv:2301.11325, 2023.
  3. Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations. In CVPR, 2021.
  4. Jointly Training Large Autoregressive Multimodal Models. arXiv preprint arXiv:2309.15564, 2023.
  5. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS, 2022.
  6. VQA: Visual Question Answering. In ICCV, 2015.
  7. Look, Listen and Learn. In ICCV, 2017.
  8. Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732, 2021.
  9. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390, 2023.
  10. Layer Normalization. In NeurIPS Deep Learning Symposium, 2016.
  11. Estimating and Exploiting the Aleatoric Uncertainty in Surface Normal Estimation. In ICCV, 2021.
  12. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966, 2023.
  13. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In ICCV, 2021.
  14. AudioLM: A Language Modeling Approach to Audio Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023.
  15. Jax: composable transformations of python+numpy programs, 2018.
  16. Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild. In CVPR, 2023.
  17. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In CoRL, 2023.
  18. InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR, 2023.
  19. Language Models are Few-Shot Learners. In NeurIPS, 2020.
  20. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  21. A Naturalistic Open Source Movie for Optical Flow Evaluation. In ECCV, 2012.
  22. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  23. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In CVPR, 2021.
  24. VGGSound: A Large-Scale Audio-Visual Dataset. In ICASSP, 2020.
  25. MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning. arXiv preprint arXiv:2310.09478, 2023a.
  26. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195, 2023b.
  27. Pix2seq: A Language Modeling Framework for Object Detection. In ICLR, 2022.
  28. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv preprint arXiv:1504.00325, 2015.
  29. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv preprint arXiv:2305.18565, 2023c.
  30. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL-HLT, 2019.
  31. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457, 2018.
  32. Together Computer. RedPajama: an Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data, 2023.
  33. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, 2023.
  34. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In NeurIPS, 2023.
  35. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, 130:33–55, 2022.
  36. Visual Dialog. In CVPR, 2017.
  37. DALL·E Mini, 2021.
  38. Scaling Vision Transformers to 22 Billion Parameters. In ICML, 2023.
  39. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In NeurIPS, 2022.
  40. Objaverse: A Universe of Annotated 3D Objects. In CVPR, 2023.
  41. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  42. RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks Track, 2021.
  43. CogView: Mastering Text-to-Image Generation via Transformers. In NeurIPS, 2021.
  44. FlowNet: Learning Optical Flow with Convolutional Networks. In ICCV, 2015.
  45. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  46. PaLM-E: An Embodied Multimodal Language Model. In ICML, 2023.
  47. Clotho: An Audio Captioning Dataset. In ICASSP, 2020.
  48. Gerald M Edelman. Neural Darwinism: Selection and reentrant signaling in higher brain function. Neuron, 10(2):115–125, 1993.
  49. Taming Transformers for High-Resolution Image Synthesis. In CVPR, 2021.
  50. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. International Journal of Computer Vision, 129:439–461, 2021.
  51. A framework for few-shot language model evaluation, 2021.
  52. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010, 2023.
  53. Planting a SEED of Vision in Large Language Model. arXiv preprint arXiv:2307.08041, 2023.
  54. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In ICASSP, 2017.
  55. OpenLLaMA: An Open Reproduction of LLaMA. https://github.com/openlm-research/open_llama, 2023.
  56. ImageBind: One Embedding Space To Bind Them All. In CVPR, 2023.
  57. AST: Audio Spectrogram Transformer. In Interspeech, 2021.
  58. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017a.
  59. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR, 2017b.
  60. Ego4D: Around the World in 3,000 Hours of Egocentric Video. In CVPR, 2022.
  61. AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions. In CVPR, 2018.
  62. LVIS: A Dataset for Large Vocabulary Instance Segmentation. In CVPR, 2019a.
  63. Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning. In CoRL, 2019b.
  64. Visual Programming: Compositional visual reasoning without training. In CVPR, 2023.
  65. Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture. In CVPR, 2022a.
  66. GRIT: General Robust Image Task Benchmark. arXiv preprint arXiv:2204.13653, 2022b.
  67. VizWiz Grand Challenge: Answering Visual Questions from Blind People. In CVPR, 2018.
  68. C4Corpus: Multilingual Web-size Corpus with Free License. In LREC, 2016.
  69. ImageBind-LLM: Multi-modality Instruction Tuning. arXiv preprint arXiv:2309.03905, 2023.
  70. Mask R-CNN. In ICCV, 2017.
  71. Masked Autoencoders Are Scalable Vision Learners. In CVPR, 2022.
  72. Measuring Massive Multitask Language Understanding. In ICLR, 2021.
  73. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017.
  74. Classifier-Free Diffusion Guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021.
  75. The Curious Case of Neural Text Degeneration. In ICLR, 2020.
  76. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. In ICCV, 2023a.
  77. Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models. arXiv preprint arXiv:2312.03052, 2023b.
  78. FrameNet: Learning Local Canonical Frames of 3D Surfaces from a Single RGB Image. In ICCV, 2019a.
  79. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1562–1577, 2019b.
  80. AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. arXiv preprint arXiv:2304.12995, 2023a.
  81. Language Is Not All You Need: Aligning Perception with Language Models. In NeurIPS, 2023b.
  82. Visual Storytelling. In NAACL-HLT, 2016.
  83. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In CVPR, 2019.
  84. OpenCLIP, 2021.
  85. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
  86. Perceiver IO: A General Architecture for Structured Inputs & Outputs. In ICLR, 2022.
  87. VIMA: General Robot Manipulation with Multimodal Prompts. In ICML, 2023.
  88. Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. arXiv preprint arXiv:2309.04669, 2023.
  89. Webly Supervised Concept Expansion for General Purpose Vision Models. In ECCV, 2022.
  90. The Kinetics Human Action Video Dataset. arXiv preprint arXiv:1705.06950, 2017.
  91. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In EMNLP, 2014.
  92. Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Interspeech, 2019.
  93. AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
  94. Segment Anything. In ICCV, 2023.
  95. The Stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research, 2023.
  96. Generating Images with Multimodal Language Models. In NeurIPS, 2023a.
  97. Grounding Language Models to Images for Multimodal Inputs and Outputs. In ICML, 2023b.
  98. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In NeurIPS, 2020.
  99. OpenAssistant Conversations – Democratizing Large Language Model Alignment. In NeurIPS Datasets and Benchmarks Track, 2023.
  100. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. arXiv preprint arXiv:2107.02027, 2021.
  101. AudioGen: Textually Guided Audio Generation. In ICLR, 2023.
  102. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 2017.
  103. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  104. OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents. In NeurIPS Datasets and Benchmarks Track, 2023.
  105. ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning. In ICCV, 2021.
  106. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  107. MIMIC-IT: Multi-Modal In-Context Instruction Tuning. arXiv preprint arXiv:2306.05425, 2023b.
  108. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726, 2023c.
  109. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML, 2023d.
  110. VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355, 2023e.
  111. UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding. In ICCV, 2023f.
  112. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv preprint arXiv:2306.04387, 2023g.
  113. Evaluating Object Hallucination in Large Vision-Language Models. In EMNLP, 2023h.
  114. High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning. TMLR, 2023.
  115. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  116. Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023a.
  117. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In ICML, 2023b.
  118. Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744, 2023c.
  119. Visual Instruction Tuning. In NeurIPS, 2023d.
  120. MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281, 2023e.
  121. Swin Transformer V2: Scaling Up Capacity and Resolution. In CVPR, 2022.
  122. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. In ICML, 2023.
  123. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks. In ICLR, 2023a.
  124. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS, 2022.
  125. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. In NeurIPS, 2023b.
  126. Valley: Video Assistant with Large Language model Enhanced abilitY. arXiv preprint arXiv:2306.07207, 2023.
  127. Interactive Language: Talking to Robots in Real Time. IEEE Robotics and Automation Letters, 2023.
  128. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424, 2023.
  129. Generation and Comprehension of Unambiguous Object Descriptions. In CVPR, 2016.
  130. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In CVPR, 2019.
  131. Diversity and Bias in Audio Captioning Datasets. In DCASE, 2021.
  132. Linearly Mapping from Image to Text Space. In ICLR, 2023.
  133. OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 2019.
  134. Generative Skill Chaining: Long-Horizon Skill Planning with Diffusion Models. In CoRL, 2023.
  135. EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. In NeurIPS, 2023.
  136. Modeling Context Between Objects for Referring Expression Understanding. In ECCV, 2016.
  137. Attention Bottlenecks for Multimodal Fusion. In NeurIPS, 2021.
  138. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images. In ECCV, 2012.
  139. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
  140. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. In CoRL Workshop TGR, 2023.
  141. Kosmos-G: Generating Images in Context with Multimodal Large Language Models. arXiv preprint arXiv:2310.02992, 2023.
  142. Instruction Tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023a.
  143. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824, 2023b.
  144. The Origins of Intelligence in Children. International Universities Press New York, 1952.
  145. UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild. In NeurIPS, 2023.
  146. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
  147. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140):1–67, 2020.
  148. Zero-Shot Text-to-Image Generation. In ICML, 2021.
  149. Habitat-Web: Learning Embodied Object-Search Strategies from Human Demonstrations at Scale. In CVPR, 2022.
  150. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2020.
  151. YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video. In CVPR, 2017.
  152. A Generalist Agent. Transactions on Machine Learning Research, 2022.
  153. Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding. In ICCV, 2021.
  154. High-Resolution Image Synthesis With Latent Diffusion Models. In CVPR, 2022.
  155. Improved Techniques for Training GANs. In NeurIPS, 2016.
  156. Multitask Prompted Training Enables Zero-Shot Task Generalization. In ICLR, 2022.
  157. Habitat: A Platform for Embodied AI Research. In ICCV, 2019.
  158. Christoph Schuhmann. LAION-AESTHETICS. https://laion.ai/blog/laion-aesthetics/, 2022.
  159. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. In NeurIPS Data-Centric AI Workshop, 2021.
  160. A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge. In ECCV, 2022.
  161. Neural Machine Translation of Rare Words with Subword Units. In ACL, 2016.
  162. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics. arXiv preprint arXiv:2311.00899, 2023.
  163. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In ACL, 2018.
  164. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In ICML, 2018.
  165. Towards VQA Models That Can Read. In CVPR, 2019.
  166. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. ICRA, 2023.
  167. The Development of Embodied Cognition: Six Lessons from Babies. Artificial life, 11(1-2):13–29, 2005.
  168. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv preprint arXiv:1212.0402, 2012.
  169. RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 2023a.
  170. PandaGPT: One Model To Instruction-Follow Them All. arXiv preprint arXiv:2305.16355, 2023b.
  171. A Corpus for Reasoning About Natural Language Grounded in Photographs. In ACL, 2019.
  172. Generative Pretraining in Multimodality. arXiv preprint arXiv:2307.05222, 2023.
  173. ViperGPT: Visual Inference via Python Execution for Reasoning. In ICCV, 2023.
  174. Any-to-Any Generation via Composable Diffusion. In NeurIPS, 2023.
  175. UL2: Unifying Language Learning Paradigms. In ICLR, 2023.
  176. MosaicML NLP Team. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, 2023. Accessed: 2023-05-05.
  177. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
  178. Multimodal Few-Shot Learning with Frozen Language Models. In NeurIPS, 2021.
  179. Neural Discrete Representation Learning. In NeurIPS, 2017.
  180. The iNaturalist Species Classification and Detection Dataset. In CVPR, 2018.
  181. Attention Is All You Need. In NeurIPS, 2017.
  182. CIDEr: Consensus-based Image Description Evaluation. In CVPR, 2015.
  183. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. arXiv preprint arXiv:1601.07140, 2016.
  184. BridgeData V2: A Dataset for Robot Learning at Scale. In CoRL, 2023.
  185. Temporal Segment Networks for Action Recognition in Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11):2740–2755, 2019a.
  186. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. In ICML, 2022a.
  187. ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities. arXiv preprint arXiv:2305.11172, 2023a.
  188. Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks. In CVPR, 2023b.
  189. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. In NeurIPS, 2023c.
  190. VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In ICCV, 2019b.
  191. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In EMNLP, 2022b.
  192. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. In ICLR, 2022c.
  193. Finetuned Language Models are Zero-Shot Learners. In ICLR, 2022.
  194. CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022.
  195. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  196. STAR: A Benchmark for Situated Reasoning in Real-World Videos. In NeurIPS Datasets and Benchmarks Track, 2021.
  197. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. In CVPR, 2010.
  198. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In ACM MM, 2017.
  199. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 2016.
  200. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. In CVPR, 2022.
  201. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In NAACL-HLT, 2021.
  202. Universal Instance Perception as Object Discovery and Retrieval. In CVPR, 2023.
  203. Diffsound: Discrete Diffusion Model for Text-to-sound Generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1720–1733, 2023.
  204. BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks. In CVPR, 2020.
  205. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv preprint arXiv:2304.14178, 2023a.
  206. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257, 2023b.
  207. Ferret: Refer and Ground Anything Anywhere at Any Granularity. arXiv preprint arXiv:2310.07704, 2023.
  208. Vector-quantized Image Modeling with Improved VQGAN. In ICLR, 2022.
  209. Modeling Context in Referring Expressions. In ECCV, 2016.
  210. Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. arXiv preprint arXiv:2309.02591, 2023.
  211. Taskonomy: Disentangling Task Transfer Learning. In CVPR, 2018.
  212. Contextual Object Detection with Multimodal Large Language Models. arXiv preprint arXiv:2305.18279, 2023.
  213. From Recognition to Cognition: Visual Commonsense Reasoning. In CVPR, 2019a.
  214. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL, 2019b.
  215. MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. In CVPR, 2022.
  216. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858, 2023a.
  217. MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. In NeurIPS Datasets and Benchmarks Track, 2023b.
  218. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention. arXiv preprint arXiv:2303.16199, 2023c.
  219. GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest. arXiv preprint arXiv:2307.03601, 2023d.
  220. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv preprint arXiv:2306.17107, 2023e.
  221. Towards Video Text Visual Question Answering: Benchmark and Baseline. In NeurIPS Datasets and Benchmarks Track, 2022.
  222. ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst. arXiv preprint arXiv:2305.16103, 2023.
  223. MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation. In NeurIPS, 2022.
  224. Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2017.
  225. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592, 2023.
  226. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis. In CVPR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiasen Lu (32 papers)
  2. Christopher Clark (27 papers)
  3. Sangho Lee (25 papers)
  4. Zichen Zhang (30 papers)
  5. Savya Khosla (9 papers)
  6. Ryan Marten (5 papers)
  7. Derek Hoiem (50 papers)
  8. Aniruddha Kembhavi (79 papers)
Citations (94)
Youtube Logo Streamline Icon: https://streamlinehq.com