Pretrained Hybrids with MAD Skills (2406.00894v1)

Published 2 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: While Transformers underpin modern LLMs (LMs), there is a growing list of alternative architectures with new capabilities, promises, and tradeoffs. This makes choosing the right LM architecture challenging. Recently-proposed $\textit{hybrid architectures}$ seek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must be trained from scratch. We propose $\textbf{Manticore}$, a framework that addresses these challenges. Manticore $\textit{automates the design of hybrid architectures}$ while reusing pretrained models to create $\textit{pretrained}$ hybrids. Our approach augments ideas from differentiable Neural Architecture Search (NAS) by incorporating simple projectors that translate features between pretrained blocks from different architectures. We then fine-tune hybrids that combine pretrained models from different architecture families -- such as the GPT series and Mamba -- end-to-end. With Manticore, we enable LM selection without training multiple models, the construction of pretrained hybrids from existing pretrained models, and the ability to $\textit{program}$ pretrained hybrids to have certain capabilities. Manticore hybrids outperform existing manually-designed hybrids, achieve strong performance on Long Range Arena (LRA) tasks, and can improve on pretrained transformers and state space models.

Summary

The paper introduces the Manticore framework that integrates diverse pretrained models via projectors and convex mixture weights to build efficient hybrid architectures.
The methodology leverages neural architecture search ideas to combine pretrained components without retraining from scratch, reducing computational costs.
Experimental results demonstrate enhanced performance on both synthetic and natural language tasks, validating the framework's practical scalability.

Pretrained Hybrids with MAD Skills: A Summary

In the domain of LLMs (LMs), the transformer architecture has been a dominant force for a significant period. However, the emergence of various alternative architectures presents a challenge for practitioners in selecting the optimal model architecture for their tasks. This paper introduces Manticore, a framework aimed at automating the design of hybrid architectures by leveraging pretrained models from different LM families. Manticore addresses two primary challenges: manual expert-driven search for hybrid architectures and the necessity to train new hybrids from scratch.

Core Proposition

The Manticore framework augments ideas from neural architecture search (NAS) with simple projectors to translate feature representations between different pretrained blocks, allowing the integration of multiple architectures. By combining these pretrained blocks, Manticore enables the construction of hybrid models that harness the capabilities of different architectures without the need for exhaustive training from scratch.

Methodological Innovations

Manticore’s approach involves three main components:

Component Models: These are the individual pretrained models that can be any modern LM, such as transformers or state-space models. Each model undergoes a forward pass through a sequence of blocks until the final output layer is reached.
Projectors: To resolve the incompatibility issues due to different feature dimensions among models, projectors are employed. These projectors align features across architectures using linear transformations with gated residuals, ensuring compatibility between the various blocks of the component models.
Mixture Weights: Manticore uses convex combination mixture weights to learn the influence of each component model's blocks in the hybrid model. This adaptive weighting mechanism allows the framework to balance the contribution of different model components effectively.

The resultant hybrid model, termed a Manticore hybrid, is formed by partitioning the blocks of each component model evenly and integrating them using the learned mixture weights. This design obviates the need for discrete architecture selection typical in traditional NAS methods.

Experimental Validation

The experimental results presented in the paper substantiate the efficacy of Manticore hybrids across several benchmarks:

Fine-Tuning Pretrained Hybrids: Experiments demonstrated that Manticore hybrids could outperform their constituent models on both synthetic tasks (e.g., Penn Treebank completions) and natural language tasks (e.g., Penn Treebank, Alpaca, ELI5). Notably, Manticore hybrids showed significant performance improvements, particularly when the component models brought complementary skills to the task.
Training from Scratch: When trained from scratch, Manticore hybrids were shown to be competitive with existing hybrid architectures and non-hybrid models. On the MAD tasks, Manticore hybrids performed comparably to established hybrids like MambaFormer and surpassed them in some configurations.
Programming Hybrids: By leveraging external sources such as task-specific metadata or proxy tasks like the MAD tasks, Manticore could effectively program mixture weights without extensive reconfiguration. This ability to program hybrids using external data highlights Manticore’s adaptability and robustness in diverse application scenarios.

Implications and Future Directions

The theoretical and practical implications of the Manticore framework are substantial. The ability to automate the construction of pretrained hybrids not only reduces computational costs associated with training but also democratizes access to high-performing hybrid models. This can significantly accelerate research and development in LM applications across various domains.

Several key areas for future exploration include:

Enhanced Search Algorithms: The paper uses DARTS as the search algorithm for mixture weights. Future work could investigate the use of advanced gradient-based NAS methods tailored to Manticore’s search space to further optimize hybrid model performance.
Scalability: Evaluating Manticore's performance on larger-scale models and tasks, particularly those requiring extreme sequence lengths, would be a valuable extension.
Standardization: As highlighted in the paper’s appendices, there is a need for standardized, block-structured LM implementations to facilitate cleaner integration and experimentation.

In conclusion, Manticore presents a robust framework for integrating pretrained model components into high-performing hybrid architectures, potentially revolutionizing how practitioners design and deploy LLMs. The framework's ability to automate hybrid design and leverage pretrained models underscores its practicality and innovation in the rapidly evolving field of AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nick11roberts/status/1798157862730092803

https://twitter.com/fredsala/status/1798162258704650481

https://twitter.com/fly51fly/status/1799449196480880816

https://twitter.com/arxivsanitybot/status/1798346869686345886

https://twitter.com/gm8xx8/status/1798793436499427470

https://twitter.com/knishimae0531/status/1799723571511792034