Scaling Laws for Generative Mixed-Modal Language Models (2301.03728v1)

Published 10 Jan 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Generative LLMs define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

Citations (81)

View on Semantic Scholar

Summary

The paper presents novel scaling laws that quantify mixed-modal interactions in models trained across seven data types.
It develops an additive framework that extends uni-modal scaling laws by incorporating a term to capture both synergy and competition.
Extensive experiments on models ranging from 8M to 30B parameters yield empirical guidelines for optimizing training in multi-modal contexts.

The paper entitled "Scaling Laws for Generative Mixed-Modal LLMs" by Aghajanyan et al. undertakes a comprehensive empirical paper that examines scaling laws applicable to generative mixed-modal LLMs. These models are designed to handle various data modalities, such as text, images, and speech, within a unified generative framework. The research involved an extensive series of over 250 experiments across seven different modalities with model sizes varying from 8 million to 30 billion parameters, trained on datasets comprising 5 to 100 billion tokens.

Key Findings and Methodological Contributions

The authors introduce novel scaling laws that account for the contributions of individual modalities as well as their interactions—whether synergy or competition—within mixed-modal models. They propose an additive framework extending traditional uni-modal scaling laws by incorporating an interaction term that explicitly captures both the advantageous collaboration and the inherent competition imposed by data and model size.

Experimental Design and Scaling Law Derivation

The paper systematically investigates the scaling behavior of mixed-modal models by training seven different model sizes on combinations of seven distinct modalities: text, image, image-text, speech, speech-text, code, and molecules. The experiments define each modality in terms of sequences of tokens, examples of which include VQ-VAEs for imagery and HuBERT for speech.

For each modality and their pairings, the paper derives scaling laws encapsulated through the parameterization framework proposed by Chinchilla, adapted to accommodate mixed-modal interactions. The derived laws indicate significant differences in scaling efficiencies across the modalities, with some like code and molecules demonstrating better use of scaling than others, such as image data.

The authors identify several emergent training dynamics influenced by the newly proposed scaling laws. Notably, they observe intermittent coordinate ascent-like training, where models prioritize optimizing specific modalities over others, often stalling progress on the latter. This phenomenon is found to diminish with increased model scale and is strongly correlated with specific scaling parameters related to function approximation.

The research further illustrates empirical guidelines for setting hyperparameters in multi-modal contexts, based on these scaling laws, providing practitioners a foundation for informed model architecture decisions.

Implications and Future Directions

This paper not only advances understanding of scaling behaviors in mixed-modal generative models but directly addresses the potential and challenges of integrating multiple data types within a single modeling framework. The findings hold implications for deploying efficient large-scale models that leverage data across modalities, presenting possibilities for advancements in applications where diverse data sources converge, such as multimedia content generation and autonomous multitasking systems.

The research suggests future inquiries into optimizing model architectures and training strategies to better harness cross-modal interactions, minimize detrimental competition, and enhance functional synergy. Continuing to refine scaling laws with broader datasets and modalities could further illuminate the dynamics of multimodal learning, opening paths toward more comprehensive, adaptive, and efficient AI systems.

Related Papers

Tweets

https://twitter.com/andy_l_jones/status/1824489774813442275

https://twitter.com/justintchiu/status/1781345195830784231

https://twitter.com/jacob_pfau/status/1826525279478599680

YouTube

Show All Videos