Unified-IO 2 (Lu et al., 2023 ) introduces the first autoregressive multimodal model capable of understanding and generating image, text, audio, and action. The core idea is to unify various modalities by tokenizing them into a shared semantic space and processing them with a single encoder-decoder transformer architecture. This approach aims to build general-purpose AI agents that can interact with the world in a multimodal manner, inspired by how biological systems use redundancy across senses for improved learning.
Building such a comprehensive multimodal model from scratch presents significant challenges, including sourcing and processing massive, diverse datasets, designing effective architectures, ensuring training stability, and instruction tuning for versatile capabilities. The paper addresses these challenges with several technical contributions:
- Unified Task Representation: All inputs and outputs across different modalities—text, images, audio, action, bounding boxes, keypoints, 3D cuboids, camera poses, and per-pixel labels (depth, surface normals, segmentation masks)—are converted into sequences of tokens in a shared discrete space. Text uses byte-pair encoding; sparse structures are discretized into special tokens; images are encoded using a pre-trained Vision Transformer (ViT) and decoded using a VQ-GAN; audio is encoded using a pre-trained Audio Spectrogram Transformer (AST) on spectrograms and decoded using a ViT-VQGAN and HiFi-GAN vocoder. Image and audio history are incorporated using a perceiver resampler to compress features into a fixed number of tokens, referenced in text via special tokens.
- Architectural Improvements for Stability: The diverse modalities lead to training instability in standard transformer architectures. Unified-IO 2 incorporates:
- 2D Rotary Embedding (RoPE): Extends standard RoPE to 2D positions for non-text modalities, applied at each transformer layer.
- QK Normalization: Applies LayerNorm to queries and keys before dot-product attention to mitigate large attention logits, particularly with image and audio data.
- Scaled Cosine Attention: Used in the perceiver resampler for stricter normalization, further stabilizing training.
- Float32 attention logits are enabled, and pre-trained ViT/AST are frozen during pretraining to avoid joint updating instabilities.
- Multimodal Mixture of Denoisers Objective: Adapting the UL2 framework for text, the paper proposes a generalized objective for multimodal pre-training. It includes masked denoising (reconstructing corrupted inputs) and causal generation paradigms across text, image, and audio targets. A novel technique, "Autoregressive with Dynamic Masking," is introduced for image/audio denoising to prevent information leakage from the decoder during autoregressive prediction while maintaining causal generation capabilities.
- Efficient Implementation (Dynamic Packing): To handle the highly variable sequence lengths inherent in multimodal data, dynamic packing is used. This packs tokens from multiple examples into a single sequence, masking attentions between examples. Unlike typical pre-processing packing, dynamic packing occurs right before and after the transformer stage to accommodate modality-specific encoders/decoders, implemented efficiently using matrix multiplication. A heuristic algorithm dynamically pairs examples during streaming to optimize packing, achieving an almost 4x increase in training throughput.
- Large-Scale Multimodal Data: The model is trained from scratch on a massive corpus of over 600 terabytes.
- Pre-training Data: An 8.5 billion example mixture from diverse sources (NLP, Image-Text, Video-Audio, Interleaved Image-Text, 3D-Embodiment, Synthetic), sampled to balance modalities and corpus sizes. Self-supervised signals are generated by randomly selecting a target modality and masking/corrupting inputs.
- Instruction Tuning Data: Fine-tuned on an ensemble of over 120 datasets covering 220+ tasks across vision, language, audio, and action. This dataset combines supervised data with prompts, carries over pre-training data (30%) to prevent catastrophic forgetting, includes task augmentation (6%) for diverse skills, and free-form text (4%) for chat capabilities.
- Evaluation: The model was evaluated on over 35 datasets without task-specific finetuning. Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark, surpassing its predecessor. It shows strong results across diverse tasks including image generation (competitive on TIFA), audio generation, various vision-language tasks (VQA, referring expression, captioning), video and audio understanding (classification, captioning, VQA), sparse/dense image labeling (detection, segmentation, keypoint, normal estimation), 3D tasks (detection, view synthesis), and embodied AI (action prediction, state prediction). It matches or outperforms many vision-language generalist models despite its significantly broader scope.
Implementation Considerations & Limitations:
- The model uses base-sized pre-trained encoders (ViT, AST) due to memory constraints; larger encoders could improve performance.
- Image and audio generation quality, while competitive for a generalist model, does not fully match specialist diffusion models, and audio generation is limited to ~4 seconds per segment.
- Performance on less common modalities/tasks (depth, video, 3D detection) is less reliable, likely due to limited task variety in the training data for these areas.
- Training required careful hyperparameter tuning and techniques to handle instability and efficiency.
In conclusion, Unified-IO 2 demonstrates the feasibility and effectiveness of training a single autoregressive model from scratch to handle a wide range of multimodal tasks spanning vision, language, audio, and action. The proposed architectural changes, training objective, and data curation strategy are crucial for enabling this breadth of capabilities and scaling autoregressive multimodal models. Future work includes exploring decoder-only architectures, scaling model size further, improving data quality, and refining the overall design.