- The paper introduces a novel architecture that dynamically configures transformer layers and attention heads for individual data samples.
- It employs Uniform Confidence Bound and Thompson Sampling to optimize model architecture, achieving up to 74% computational cost reduction.
- Experimental results demonstrate that the Dynamic Hierarchical Transformer maintains robust accuracy while substantially lowering resource demands.
Introduction
Transformer models have cemented their place as a cornerstone in deep learning, particularly in tasks pertaining to language understanding. Despite their success, a recurring challenge is the one-size-fits-all architecture that demands a fixed, often large number of layers and attention heads. This structure leads to high computational costs during training and inference and poses a hurdle for deployment in resource-constrained environments. The paper presents a novel architecture called the Dynamic Hierarchical Transformer (DHT), which tackles these issues by introducing a method to dynamically adjust the transformer's complexity on a per-sample basis during both training and inference.
Dynamic Configuration Approach
At the core of DHT is the ability to tailor the number of layers and attention heads to the needs of individual data samples. The paper leverages two methodologies from the contextual bandit problem domain—Uniform Confidence Bound (UCB) for numerical dynamics, determining the number of layers and heads, and combinatorial Thompson Sampling Semi-Bandits (TSP) for choosing the specific combination of heads within each layer. This flexibility allows DHT to adopt a more resource-efficient model without a significant impact on accuracy.
Hierarchical Architecture Search
DHT features a hierarchical search approach allowing the network to adapt its architecture during training. The process relies on two distinct search modes: tree search, emphasizing comprehensiveness, and integrated search, focusing on efficiency. In tree search, UCBs select the number of layers and heads, while TSBs decide head combinations. Integrated search uses complex UCBs to jointly determine layers and heads simultaneously, still relying on TSBs for head combinations. The result is an architecture that dynamically evolves based on the data it encounters.
Experimental Results
The experiments on several text classification datasets demonstrate that DHT can significantly reduce computational costs, up to 74%, while minimally impacting accuracy. The flexibility of DHT is evidenced by its performance relative to various baselines, including traditional network compression methods and other transformer models with fixed architectures. Furthermore, DHT's dynamic approach ensures stability during training, as evidenced by the smooth convergence trends.
Conclusion
The DHT model showcases a significant stride in efficient transformer architectures. By integrating contextual bandits into the training process, DHT performs sample-specific optimizations that reduce computational demand and model size. This research opens the door to more adaptive and efficient use of transformers, with the potential to extend these concepts to other deep learning frameworks. The DHT thereby contributes to both the fields of AutoML, which seeks automation in machine learning processes, and network compression, addressing the critical need for lightweight models in practical applications.