- The paper presents a Hardware-Aware Training methodology that adapts pre-trained LLMs to operate robustly on analog in-memory computing hardware despite noise and quantization limits.
- It leverages synthetic training data and knowledge distillation to preserve instruction following and safety features even under hardware-induced analog noise.
- Evaluations across diverse benchmarks demonstrate that analog foundation models achieve competitive accuracy and efficiency compared to state-of-the-art digital quantization methods.
This paper presents a general and scalable method for adapting pre-trained LLMs to run robustly on Analog In-Memory Computing (AIMC) hardware. AIMC offers significant potential for improving the speed and power efficiency of neural network inference by performing computations directly within memory, but it introduces challenges like analog noise, device variability, and strict input/output quantization constraints. The paper shows that off-the-shelf LLMs experience significant accuracy drops when deployed on AIMC hardware due to these non-idealities.
The core contribution is a Hardware-Aware Training (HWA) methodology designed specifically for LLMs, which can be applied without access to the original pre-training data. The proposed pipeline, inspired by LLM-QAT (2305.17888), involves generating synthetic training data using the pre-trained LLM itself and then training an "analog foundation model" via knowledge distillation on this synthetic data. The training incorporates specific HWA techniques to enhance robustness to analog hardware characteristics.
The method assumes a heterogeneous accelerator architecture where matrix-vector multiplications are offloaded to analog cores, while other operations like activation functions and attention are handled digitally in FP16. Key hardware constraints addressed include:
- Static Input Quantization: Inputs to analog cores are quantized to 8 bits with learnable ranges fixed during inference. Initial ranges are set generously and then tightened during training.
- Globally Static Output Quantization: Outputs of analog cores are quantized to 8 bits using fixed ranges that are identical across layers, reflecting limitations of Analog-to-Digital Converters (ADCs). Straight-through estimation is used for gradients.
- Weight Noise: Additive Gaussian noise is injected into weights during the forward pass to simulate analog variability. The noise magnitude is scaled by the per-channel maximum absolute weight, with a small magnitude (0.02-0.03) found to be optimal. Backpropagation uses the noise-free weights.
- Weight Clipping: Iterative clipping of weights based on their standard deviation (±α⋅std, with α between 2.0 and 3.5) is applied after each optimizer step. This removes outliers and tightens weight distributions, which was found to be surprisingly more effective for robustness than noise injection alone.
The training is performed on a large amount of synthetic data (up to 20 billion tokens) generated by sampling from the teacher model. Distillation is crucial for retaining capabilities learned during the original pre-training and instruction tuning. The training is computationally intensive but is a one-time process. The authors used AIHWKIT-Lightning [QNdxOgGmhR], an open-source toolkit for scalable HWA training, implemented in PyTorch (1912.01703).
The paper evaluates the method on Phi-3-mini-4k-instruct (2404.14219) and Llama-3.2-1B-Instruct (2407.21783) across 12 diverse benchmarks covering reasoning, knowledge, instruction following, and safety. Hardware-realistic noise profiles extracted from a PCM-based AIMC chip [10.1038/s41928-023-01010-1] are used during evaluation.
Key results demonstrate:
- Robustness: Analog foundation models significantly outperform off-the-shelf and even state-of-the-art digitally quantized models (LLM-QAT (2305.17888), SpinQuant [ogO6DGE6FZ]) when subjected to hardware-realistic analog noise. They achieve average accuracy levels comparable to models trained with 4-bit weight and 8-bit static input quantization, despite the presence of analog noise and strict quantization constraints.
- Safety and Instruction Following: The HWA training methodology largely preserves the instruction following and safety capabilities of the original models, and these capabilities remain robust under analog noise.
- Digital Deployment: A byproduct of the training is that these analog foundation models can also be quantized post-training (using round-to-nearest) for inference on low-precision digital hardware. The iterative clipping during training results in well-behaved weight distributions that allow for competitive 4-bit digital performance without requiring dynamic activation quantization.
- Test-time Compute Scaling: Analog foundation models show better performance scaling with increased test-time compute (evaluated on MATH-500 (2305.20050) using multiple generations and reward-based selection) compared to models trained with 4-bit weight and 8-bit static input quantization.
The authors acknowledge limitations, including the resource intensity of training (even with reduced data), the remaining accuracy gap compared to FP16 baselines (especially for reasoning tasks), and the persistent risk of LLMs generating harmful content. Despite these limitations, the work successfully demonstrates the feasibility of creating LLMs robust enough for deployment on AIMC hardware, paving the way for more energy-efficient LLM inference.