STAR: Synthesis of Tailored Architectures (2411.17800v1)

Published 26 Nov 2024 in cs.LG, cs.AI, and cs.NE

Abstract: Iterative improvement of model architectures is fundamental to deep learning: Transformers first enabled scaling, and recent advances in model hybridization have pushed the quality-efficiency frontier. However, optimizing architectures remains challenging and expensive. Current automated or manual approaches fall short, largely due to limited progress in the design of search spaces and due to the simplicity of resulting patterns and heuristics. In this work, we propose a new approach for the synthesis of tailored architectures (STAR). Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.

Summary

The paper introduces STAR, a framework that automatically synthesizes optimized model architectures using a hierarchical search space and evolutionary algorithms.
It details a multi-level genome representation that encodes featurization, operator, and backbone configurations for tailored deep learning models.
Experimental results show significant improvements in quality, parameter reduction up to 13%, and cache utilization, outperforming baseline models.

STAR: Synthesis of Tailored Architectures

Introduction

The paper introduces STAR, a new method for optimizing model architectures specifically tailored to improve both quality and efficiency in deep learning applications. By leveraging a distinct hierarchical search space and evolutionary algorithms, the STAR framework facilitates the automated synthesis of architecture genomes, incorporating multiple optimization metrics.

Hierarchical Search Spaces

STAR uses a novel design space grounded in the theory of Linear Input-Varying Systems (LIVs), which generalize computational units such as attention variants, linear recurrences, and convolutions. The design space is characterized at three hierarchical levels: featurization, operator structure, and backbone composition.

Featurization involves modulating linear computation with input features, terminating with elementwise non-linearities after linear operations.
Structure analyzes token and channel mixing of operators, offering diverse configurations for efficient matrix calculations.
Composition explores interconnection strategies between LIV units, permitting shared featurizers or feature group mappings.
Figure 1: Population of architectures undergoing iterative {\tt STAR} evolution to minimize the number of parameters and maximize quality.

STAR Genomes Representation

Genomes within STAR encapsulate numeric representations of model architectures, facilitating manipulation and optimization. The genome layers include:

Backbone Genome: Describes LIV ordering and interconnections by featurizer and group sharing indices.
Operator Genome: Encodes individual LIV class characteristics and their numerical representations.
Featurizer Genome: Details token and channel mixing for each LIV's featurizer, applying expansion and repeat factors.
Figure 2: Hierarchical structure of the {\tt STAR} genome and its representation as discrete variables across multiple levels.

Optimization with Evolutionary Algorithms

The STAR framework employs evolutionary algorithms to optimize architecture genomes. Processes such as assessment, pairing, recombination, and mutation are applied iteratively:

Assessment: Evaluates candidate architectures based on quality and efficiency objectives.
Pairing and Recombination: Uses techniques like tournament selection and k-point crossover to create diverse offspring.
Mutation: Ensures population diversity by altering genome values with constraints to promote stable training.
Figure 3: Fundamental operations of {\tt STAR} evolution akin to other evolutionary optimization algorithms.

Experimental Results

Experiments in autoregressive language modeling demonstrate STAR's efficacy in synthesizing architectures that balance quality, size, and cache efficiency. Metrics reveal substantial improvements over Transformer and hybrid models:

Quality optimization outperforms baselines in downstream benchmark averages.
Quality and size optimizations reduce parameter counts by up to 13% while achieving superior performance.
Quality and cache optimizations achieve up to 90% cache size reduction compared to prevailing models.

Conclusions and Future Work

STAR constitutes a significant advance in automated architecture optimization, presenting a powerful tool for AI system design across domains. Future extensions may focus on enabling variable depth and width optimizations, refining multi-stage approaches, and integrating with existing scaling protocols.

The implications of STAR's hierarchical design and synthesis approach suggest enhanced versatility and efficiency in constructing complex AI models tailored to diverse application requirements.