Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit

Published 5 Dec 2023 in cs.LG, cs.AI, and cs.NE | (2312.03038v3)

Abstract: Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that dynamically configures transformer layers and attention heads for individual data samples.
It employs Uniform Confidence Bound and Thompson Sampling to optimize model architecture, achieving up to 74% computational cost reduction.
Experimental results demonstrate that the Dynamic Hierarchical Transformer maintains robust accuracy while substantially lowering resource demands.

Introduction

Transformer models have cemented their place as a cornerstone in deep learning, particularly in tasks pertaining to language understanding. Despite their success, a recurring challenge is the one-size-fits-all architecture that demands a fixed, often large number of layers and attention heads. This structure leads to high computational costs during training and inference and poses a hurdle for deployment in resource-constrained environments. The paper presents a novel architecture called the Dynamic Hierarchical Transformer (DHT), which tackles these issues by introducing a method to dynamically adjust the transformer's complexity on a per-sample basis during both training and inference.

Dynamic Configuration Approach

At the core of DHT is the ability to tailor the number of layers and attention heads to the needs of individual data samples. The paper leverages two methodologies from the contextual bandit problem domain—Uniform Confidence Bound (UCB) for numerical dynamics, determining the number of layers and heads, and combinatorial Thompson Sampling Semi-Bandits (TSP) for choosing the specific combination of heads within each layer. This flexibility allows DHT to adopt a more resource-efficient model without a significant impact on accuracy.

Hierarchical Architecture Search

DHT features a hierarchical search approach allowing the network to adapt its architecture during training. The process relies on two distinct search modes: tree search, emphasizing comprehensiveness, and integrated search, focusing on efficiency. In tree search, UCBs select the number of layers and heads, while TSBs decide head combinations. Integrated search uses complex UCBs to jointly determine layers and heads simultaneously, still relying on TSBs for head combinations. The result is an architecture that dynamically evolves based on the data it encounters.

Experimental Results

The experiments on several text classification datasets demonstrate that DHT can significantly reduce computational costs, up to 74%, while minimally impacting accuracy. The flexibility of DHT is evidenced by its performance relative to various baselines, including traditional network compression methods and other transformer models with fixed architectures. Furthermore, DHT's dynamic approach ensures stability during training, as evidenced by the smooth convergence trends.

Conclusion

The DHT model showcases a significant stride in efficient transformer architectures. By integrating contextual bandits into the training process, DHT performs sample-specific optimizations that reduce computational demand and model size. This research opens the door to more adaptive and efficient use of transformers, with the potential to extend these concepts to other deep learning frameworks. The DHT thereby contributes to both the fields of AutoML, which seeks automation in machine learning processes, and network compression, addressing the critical need for lightweight models in practical applications.