Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 416 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Compute-Optimal Quantization-Aware Training (2509.22935v1)

Published 26 Sep 2025 in cs.LG and cs.AI

Abstract: Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.

Summary

  • The paper introduces a compute-optimal QAT framework that derives the optimal quantization phase fraction using a log-linear function of the tokens-per-parameter-byte statistic.
  • The unified loss scaling law models loss as a function of model size, token counts, and bit-width, enabling precise prediction of performance and compute waste.
  • The proposed cooldown & QAT fusion technique streamlines training by jointly decaying the learning rate with quantization, yielding significant compute savings.

Compute-Optimal Quantization-Aware Training: Theory, Empirics, and Practical Implications

Introduction

This paper presents a comprehensive analysis of compute allocation strategies for quantization-aware training (QAT) in LLMs, challenging prevailing assumptions about the optimal division of training between full-precision (FP) and quantized phases. The authors provide a rigorous empirical and theoretical framework for determining the compute-optimal QAT fraction, introduce a unified loss scaling law for QAT, and propose a novel cooldown & QAT fusion technique for improved training efficiency. The work is grounded in extensive experiments across a wide range of model sizes (86M–2.2B parameters), compute budgets, and quantization bit-widths (1–6 bits), with strong generalization to different datasets and hyperparameters.

Problem Formulation and Motivation

QAT is the dominant approach for adapting neural networks to low-precision inference, outperforming post-training quantization (PTQ) in accuracy, especially at low bit-widths. The standard practice is to pretrain a model in FP, then fine-tune with QAT. However, the optimal allocation of compute between FP and QAT phases has been poorly understood, with prior work suggesting a fixed QAT fraction (e.g., 10%) regardless of model size or compute budget. This paper demonstrates that such fixed ratios are suboptimal, especially as compute budgets increase, and that the optimal QAT fraction is a function of the tokens-per-parameter-byte statistic, which encapsulates model size, training duration, and quantization width.

Empirical Findings: Scaling Laws and Optimal QAT Fraction

The authors conduct a systematic grid of experiments, varying FP/QAT token allocation, model size, and bit-width. Key findings include:

  • The optimal QAT fraction increases with total compute budget (tokens-per-parameter-byte), contradicting previous claims of a fixed optimal ratio.
  • Low-bit QAT (1–2 bits) is highly sensitive to the QAT fraction, with substantial compute waste if suboptimal allocation is used. For 1-bit QAT, up to 50% compute savings are possible by using the optimal fraction.
  • Larger models tolerate lower bit-widths for QAT and can match FP accuracy at lower precision, given sufficient compute.
  • The optimal QAT fraction decreases with increasing model size for fixed compute, but increases with decreasing bit-width.

The optimal QAT fraction ff^* is accurately predicted by a log-linear function of the tokens-per-parameter-byte statistic:

f(Dtotal,N,B)=exp(alog(DtotalNB))f^*(D_{total}, N, B) = \exp\left(a \cdot \log\left(\frac{D_{total}}{N \cdot B}\right)\right)

where aa is a fitted parameter.

Unified Loss Scaling Law

A central contribution is the derivation of a unified loss scaling law for QAT, modeling final expected loss as a function of model size (NN), FP/QAT token counts (Dfp,DqatD_{fp}, D_{qat}), and bit-width (BB):

L(N,Dqat,Dfp,B)=Chinchilla-like loss+QAT fraction-aware penaltyL(N, D_{qat}, D_{fp}, B) = \text{Chinchilla-like loss} + \text{QAT fraction-aware penalty}

The penalty term incorporates irreducible QAT error, pure QAT penalty, and FP/QAT interaction, all parameterized by tokens-per-parameter-byte. The law achieves high fit quality (R2=0.982R^2 = 0.982–$0.991$ across bit-widths) and enables:

  • Prediction of optimal QAT fraction for arbitrary compute budgets and model sizes
  • Quantification of compute waste for suboptimal QAT allocation
  • Determination of the minimal bit-width required to match FP accuracy for a given model and compute budget
  • Guidance for parameter-precision trade-offs under memory constraints

Cooldown & QAT Fusion: Training Efficiency

The paper introduces a cooldown & QAT fusion technique, where learning rate decay is performed jointly with QAT rather than sequentially after FP training. This approach eliminates redundant FP updates that are effectively discarded during QAT initialization, yielding:

  • Consistent improvements in perplexity for 4- and 6-bit QAT across all model sizes and token counts
  • Compute savings quantified in "wasted tokens" units, with up to 13% savings for large models
  • Marginal improvements for 1- and 2-bit QAT, attributed to the large optimal QAT fraction in these regimes

Implementation Details

Model and Training Setup

  • Architecture: Decoder-only transformer (Llama 2-like), SwiGLU, RoPE, RMSNorm, alternating attention/FFN, tied embeddings/LM head.
  • Optimizer: Adam with decoupled weight decay, bfloat16 mixed precision.
  • Dataset: DCLM, tokenized with Llama 2 tokenizer, concat-and-chunk batching.
  • QAT Algorithms: ParetoQ (state-of-the-art), with Elastic Binarization (1-bit), SEQ (2-bit), LSQ (≥3-bit), per-output-feature quantization scales.
  • Learning Rate Scheduling: Warmup-stable-decay (WSD) for FP, cosine decay for QAT, with fusion as described.

Practical Guidelines

  • Compute Allocation: Use the scaling law to determine ff^* for given NN, BB, and DtotalD_{total}.
  • Bit-Width Selection: For a fixed memory budget, select the lowest bit-width that matches FP accuracy, as higher bit-widths do not improve accuracy but increase memory usage.
  • Training Efficiency: Employ cooldown & QAT fusion to maximize compute utilization, especially for mid/high bit-width QAT.
  • Generalization: The scaling law and optimal fraction prediction generalize to larger models (2.2B) and different datasets (SlimPajama), with minimal sensitivity to hyperparameters.

Theoretical and Practical Implications

Theoretical

  • Scaling laws for QAT are not static; they depend on compute, model size, and bit-width.
  • Unified loss scaling law enables principled resource allocation and model selection for quantized LLMs.
  • Parameter-precision trade-off analysis provides a framework for memory-constrained deployment, relevant for on-device and edge inference.

Practical

  • Suboptimal QAT allocation can result in substantial compute waste, especially for low-bit QAT.
  • Cooldown & QAT fusion should be adopted as standard practice for efficient QAT pipelines.
  • Practitioners can use the provided scaling law to plan training runs, select bit-widths, and optimize for deployment constraints.

Future Directions

  • Interaction of pretraining precision (e.g., FP8, FP4) with QAT scaling laws
  • Extension to multi-stage training pipelines (SFT, RL, multimodal) and their impact on QAT allocation
  • Generalization to other architectures and modalities

Conclusion

This work provides a rigorous framework for compute-optimal QAT in LLMs, overturning previous assumptions about fixed QAT allocation ratios and introducing a unified loss scaling law for principled resource planning. The empirical and theoretical results have direct implications for efficient training and deployment of quantized models, especially in memory- and compute-constrained environments. The proposed cooldown & QAT fusion technique further enhances training efficiency. The findings are robust across model sizes, datasets, and hyperparameters, and the methodology is extensible to future advances in low-precision training and multi-stage pipelines.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Overview

This paper is about how to train big LLMs so they still work well after being made smaller and faster. The trick they paper is called quantization-aware training (QAT), which means the model learns while using lower-precision numbers so it gets used to them. The main goal is to figure out how to split training time between normal high-precision training and QAT to get the best accuracy for the same amount of compute.

Key terms in simple words

  • Quantization: Storing numbers with fewer bits. Think of it like saving photos at a lower resolution to use less space.
  • Bit width: How many bits are used per number (e.g., 1, 2, 4, or 6 bits). Fewer bits = smaller, faster, but harder to stay accurate.
  • Full precision (FP): The “high-resolution” way to train a model, usually with 16 or 32 bits.
  • Tokens: Pieces of text the model reads during training (like words or word-parts).
  • Parameters: The model’s internal “knobs” it tunes to learn.
  • Loss/perplexity: Measures of how well the model predicts text. Lower is better.
  • Tokens-per-parameter-byte: A simple ratio that roughly says “how much training data the model sees per unit of its memory size.” It helps predict how much QAT you should do.

Objectives

The paper asks three practical questions:

  1. Given a fixed training budget, what fraction of training should be QAT versus full-precision to get the best accuracy?
  2. Can we predict that optimal QAT fraction for different model sizes and bit widths using a simple rule?
  3. Can we create a formula (a “scaling law”) that predicts the best training split and the final accuracy, and also helps choose the best bit width under memory limits?

Methods

The authors:

  • Trained Llama-like transformer models ranging from tens of millions to billions of parameters on large text datasets.
  • Used different total training budgets (how many tokens the models read), different QAT bit widths (1, 2, 4, 6 bits), and varied how much time they spent in FP versus QAT phases.
  • Tracked accuracy using validation perplexity (lower is better).
  • Tested a modern QAT method (ParetoQ) and standard training setups to keep comparisons fair.
  • Fit a “loss scaling law,” which is a math formula that predicts model loss based on:
    • Model size (parameters, N)
    • FP tokens (D_fp)
    • QAT tokens (D_qat)
    • Bit width (B)
  • Introduced a simple, intuitive metric—tokens-per-parameter-byte—to compare setups across different model sizes and bit widths. Imagine it as “how many text tokens per unit of model storage,” which turns out to predict how much QAT you should do.
  • Proposed a new training schedule called “cooldown & QAT fusion,” where they decay the learning rate while doing QAT, instead of finishing cooldown in full precision and only then starting QAT.

Main Findings

  • The optimal QAT fraction increases with compute:
    • Earlier advice said “use about 10% of training steps for QAT.” This paper shows that the best QAT fraction isn’t fixed—it grows with total training, especially when viewed through tokens-per-parameter-byte.
    • In simple terms: the more you train overall, the more of that training should be QAT to keep a low-precision model accurate.
  • A single formula predicts accuracy and the best split:
    • Their loss scaling law predicts both the final accuracy and the optimal QAT/FP ratio across model sizes and bit widths.
    • It matches experimental results closely, which means you can plan training ahead of time instead of guessing.
  • Choosing the right bit width under memory limits:
    • For a given memory budget (like on a phone), the formula helps pick the best bit width and model size combination.
    • As training compute increases, using lower bit widths can become optimal for the same memory budget.
  • Big savings from picking the right QAT fraction:
    • Using a “one-size-fits-all” QAT fraction (like 10%) can waste a lot of compute.
    • In extreme low-bit cases (like 1-bit), picking the optimal fraction can achieve the same accuracy with about half the compute.
  • QAT can match full-precision accuracy:
    • With enough training and the right QAT fraction, low-bit models can catch up to full-precision models.
    • Bigger models tend to tolerate lower bit widths better, especially with larger training budgets.
  • “Cooldown & QAT fusion” improves efficiency:
    • Doing learning rate cooldown inside the QAT phase avoids wasted updates and improves accuracy for the same budget.
    • This leads to noticeable compute savings across several model sizes and bit widths.

Why It Matters

  • Better on-device AI: Quantized models are smaller and faster, which is great for phones and laptops. This paper shows how to train them smarter, not just harder.
  • Less guesswork: Instead of using rules-of-thumb (like “10% QAT”), you can use the tokens-per-parameter-byte metric and the scaling law to plan training precisely.
  • Save time and money: Picking the right QAT fraction and bit width can reduce training costs and still reach full-precision quality.
  • Smarter design choices: If you have a fixed memory limit, the scaling law helps you choose the right combination of model size and bit width to get the best performance.

Simple takeaway

If you’re training a model that will run in low precision, don’t leave QAT as a tiny afterthought. As your total training grows, set aside a larger portion for QAT. Use the tokens-per-parameter-byte idea and the provided scaling law to pick the best split and bit width. And consider fusing cooldown with QAT to avoid wasting updates and improve results.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains uncertain or unexplored, to guide follow-on research:

  • Scale generalization: Experiments center on models ≤759M parameters (with a single 2.2B check in an appendix); it remains unknown whether the scaling law and optimal QAT fraction trends hold for 7B–70B+ models and mixture-of-experts architectures.
  • Architectural breadth: Results are specific to Llama 2–style, decoder-only transformers. Effects on encoder–decoder models, MoE, long-context variants, and multimodal architectures are untested.
  • Sequence length effects: All training uses sequence length 1024; the dependence of the optimal QAT fraction and the scaling law on context length (and O(L2) attention cost) is not studied.
  • Dataset/domain generalization: Aside from DCLM and a SlimPajama replication, the robustness of the findings across domain mixtures (e.g., code, math, multilingual), data quality, and curriculum strategies is unquantified.
  • Tokenization and vocabulary: The reliance on a Llama 2 (32k) tokenizer leaves open how tokenization choices affect tokens-per-parameter-byte and the inferred optimal QAT fraction.
  • What is quantized: The paper does not clearly isolate weight-only vs weight+activation QAT; the scaling law’s validity for activation quantization (and different activation precisions) is unknown.
  • Quantization granularity (grouping): Group size G and per-channel/per-tensor schemes are not varied; the unified scaling law omits G, so the impact of granularity on optimal QAT fraction and loss is uncharacterized.
  • Mixed precision across layers: Only uniform bit-widths {1, 2, 4, 6} are explored. Optimal layerwise/matrixwise mixed-precision schedules and their effect on the scaling law are open.
  • Intermediate bit-widths and 8-bit: The continuum (e.g., 3, 5, 8 bits) and non-integer or learned precisions are not evaluated; how the law interpolates across B remains uncertain.
  • QAT algorithm dependence: All results use ParetoQ. Whether the scaling law and optimal fraction predictions transfer to other QAT methods (e.g., LSQ, QDrop, AdaRound-like STE variants) is untested.
  • Optimizer and training hyperparameters: Only Adam with specific β, weight decay, and a particular scheduler is used. Sensitivity to optimizer choice (Adafactor, Lion), gradient clipping, dropout, and weight decay is not mapped.
  • Theoretical justification: The tokens-per-parameter-byte statistic and the specific functional form in the loss model lack a principled derivation from quantization noise or SGD dynamics; only empirical fits are provided.
  • Fit stability and uncertainty: The loss fit is “initialization-sensitive” and selected from multiple random starts, but parameter confidence intervals, robustness diagnostics, and predictive uncertainty are not reported.
  • “Wasted tokens” validation: The wasted-token metric is derived from the fitted law; direct validation by training using retuned token budgets to confirm predicted efficiency gains is limited.
  • Mechanism of QAT surpassing FP: Cases where QAT outperforms FP (figure 5) are observed but not explained; understanding whether quantization acts as regularizer and under which regimes remains open.
  • Compute vs token accounting: The compute duality is handled by a scalar overhead factor r without empirical hardware validation; wall-clock, FLOP, and energy measurements across GPUs/NPUs/kernels are missing.
  • Memory model fidelity: The memory–bit-width trade-off ignores metadata overheads (scales, zero-points, per-group statistics) and activation/KV cache memory; the optimal bit-width under realistic memory accounting may differ.
  • Attention to inference implications: Interactions with KV-cache quantization, activation quantization at inference, and latency/bandwidth constraints are not connected to the training-time scaling law.
  • Online scheduling: The work proposes a priori optimal fractions; online/adaptive control (adjusting QAT fraction based on validation curves or quantization error signals) is not explored.
  • Cooldown & QAT fusion scope: Fusion is shown mainly for 4–6 bits; for 1–2 bits the gains are inconsistent. Sensitivity to re-warmup length, cooldown percentage, scheduler shapes (cosine, linear, WSD variants), and ablations are missing.
  • Start-point and schedule alternatives: Only a single staging (FP then QAT) is studied; interleaved FP/QAT, gradual layerwise quantization, progressive bit-width reduction, and curriculum-based QAT timing remain unexplored.
  • Pretraining precision: Effects of training in FP8/FP4 (optimizer and activations) on the optimal QAT fraction and scaling law are not measured (noted by the authors as future work).
  • Multi-stage pipelines: The dependence of optimal QAT allocation on subsequent SFT, RLHF/DPO, safety finetuning, or multimodal finetunes is untested (also noted as future work).
  • Robustness to distribution shift: Whether the learned schedule and scaling predictions hold under domain shift between pretraining and deployment data is unknown.
  • Downstream task metrics: Evaluation focuses on perplexity; correlations with downstream task accuracy (MMLU, code benchmarks, safety/harms metrics) and whether optimal QAT fractions transfer are unassessed.
  • Reproducibility assets: While hyperparameters are described, full code, checkpoints, and fitted parameter releases (with CIs) are not indicated; this limits independent verification of the scaling law.
  • Hardware–software co-design: Kernel-level overheads of QAT ops, fused kernels, and compiler stacks (Triton, CUDA graphs) that can change r and the optimal schedule are not systematically studied.
  • Layer/component coverage: Quantization of embeddings, LayerNorm/RMSNorm, attention projections vs MLP weights may differ; how component-wise exclusion/inclusion affects the law is not analyzed.
  • Safety and alignment: Potential impacts of QAT fraction and bit-width on harmful outputs, calibration, and uncertainty are not evaluated.
  • Data scaling confounds: The law’s sensitivity to data deduplication, contamination controls, and quality filtering—known to affect scaling laws—is not disentangled here.
  • Calibration of memory–compute plot: The training FLOP model C ≈ 6ND omits attention terms and sequence-length dependence; optimal bit-width predictions in figure 6 may shift under realistic FLOP models.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now by practitioners to improve training efficiency, model quality, and deployment economics based on the paper’s findings and methods.

  • Compute-optimal QAT fraction planning for training pipelines
    • Description: Use the tokens-per-parameter-byte statistic to choose the optimal split between full-precision (FP) pretraining and QAT steps (Dfp vs. Dqat), avoiding standard fixed ratios (e.g., 10%) that waste compute.
    • Sectors: Software, cloud, AI platform providers, MLOps.
    • Tools/Workflows: A “QAT Planner” module (CLI/SDK) that takes model size N, total token budget Dtotal, and bit width B (optionally QAT overhead factor r) and returns f* and recommended learning rate schedule.
    • Assumptions/Dependencies: Availability of QAT (e.g., ParetoQ), LLM-like architectures, access to token-level accounting, reasonable data quality; fit generalizes but calibrating to local data/architectures can improve accuracy.
  • Memory-aware precision selection for deployment
    • Description: Select QAT bit width B based on memory constraints and training budget using the paper’s scaling law (figure 6); for many regimes, lower precision is optimal and matches FP accuracy for larger models.
    • Sectors: Mobile/edge (smartphones, wearables), robotics, embedded systems, healthcare devices, education apps.
    • Tools/Workflows: A “Precision Selector” tool that maps RAM budget to recommended parameter count and bit width for target quality; integrated into on-device ML deployment toolchains.
    • Assumptions/Dependencies: Accurate memory budget estimates, inference bandwidth as bottleneck, scaling law holds across similar transformer architectures; may need validation on non-LLM architectures.
  • Cooldown & QAT fusion learning rate scheduler
    • Description: Replace the classic “FP cooldown → QAT re-warmup” with a fused schedule that continues cooldown within QAT, eliminating redundant FP updates and reducing training cost while improving accuracy at 4–6 bits.
    • Sectors: Software, cloud training, enterprise AI, research labs.
    • Tools/Workflows: A “QAT-Fused Scheduler” plugin for PyTorch/Hugging Face/DeepSpeed; training scripts that resume from constant LR stage directly into QAT and decay LR jointly.
    • Assumptions/Dependencies: Works best when FP stage is non-trivial; benefits vary in 1–2 bit regimes; requires QAT resume capability from FP checkpoints and stable QAT implementation.
  • FinOps and sustainability: wasted token audit
    • Description: Quantify “wasted tokens” from sub-optimal QAT fractions and translate into compute cost and carbon savings; prioritize runs that follow optimal f* to cut training TCO.
    • Sectors: Finance (FinOps), cloud operations, sustainability reporting.
    • Tools/Workflows: Dashboards that track Dfp/Dqat vs. optimal fraction, compute cost per loss improvement, and energy per token saved; policy for internal training budget allocation.
    • Assumptions/Dependencies: Access to loss curves or fitted scaling law; consistent token accounting; carbon intensity data from cloud regions.
  • Cluster/job scheduling with iso-FLOP adjustment for QAT overhead
    • Description: Convert token-based optimal fractions to compute-aware fractions by adjusting QAT tokens (Dqat/r) when QAT overhead is non-negligible; schedule jobs to iso-FLOP budgets.
    • Sectors: Cloud/HPC, enterprise MLOps.
    • Tools/Workflows: Scheduler integration that uses r to rebalance FP/QAT steps on shared clusters; SLAs for compute-optimal training.
    • Assumptions/Dependencies: Measured QAT overhead factor r for local stack; large batch/sequence often makes QAT overhead negligible.
  • Edge product quality boosts without extra training compute
    • Description: Train quantized models that meet or beat FP quality at the same training budget by allocating more tokens to QAT as compute scales; notably impactful for 4–6-bit deployments.
    • Sectors: Mobile AI, robotics, IoT, healthcare devices, education apps.
    • Tools/Workflows: Standard training pipelines with optimized f*; deployment with activation-aware quantization and ParetoQ-style methods.
    • Assumptions/Dependencies: Model size and token budget in the regime where QAT matches FP loss; hardware supports targeted bit widths.
  • Academic experiment design and baselining
    • Description: Use the loss scaling law to plan FP/QAT token splits and precision for baselines; compare architectures and datasets with a principled, compute-aware design.
    • Sectors: Academia, industrial research.
    • Tools/Workflows: Experiment planners that instantiate Dfp/Dqat sweeps; unified reporting of optimal f* and wasted token metrics across studies.
    • Assumptions/Dependencies: LLM-like architectures; need to refit scaling law if changing model families or data quality substantially.
  • Education and training courses/labs
    • Description: Include compute-optimal QAT planning and fused schedulers in ML curricula to teach practical trade-offs between memory, precision, and compute.
    • Sectors: Education.
    • Tools/Workflows: Lab assignments where students fit a small scaling law and compare fixed vs. optimal QAT fractions.
    • Assumptions/Dependencies: Access to small-scale compute and open QAT toolchains; toy datasets.

Long-Term Applications

These applications require further research, standards setting, broader validation, or engineering investment for robust deployment.

  • AutoML-style dynamic precision and QAT fraction schedulers
    • Description: Online controllers that adjust bit width and FP/QAT fractions during training based on validation loss and memory/computation constraints.
    • Sectors: Software platforms, AutoML, cloud training services.
    • Tools/Workflows: “Adaptive QAT Orchestrator” that tunes B, Dfp/Dqat, and LR schedules mid-run.
    • Assumptions/Dependencies: Reliable online loss estimators; robust QAT across precision switches; control-theory or RL-based scheduling.
  • Multi-stage pipelines: optimal QAT across SFT/RL/multimodal training
    • Description: Extend scaling laws to supervised fine-tuning, RLHF, and multimodal phases to allocate FP/QAT compute optimally across stages.
    • Sectors: Enterprise LLMs, conversational AI, multimodal assistants.
    • Tools/Workflows: End-to-end “Stage-Aware QAT Planner” that models cumulative loss across stages with precision-aware allocations.
    • Assumptions/Dependencies: New scaling laws per stage; datasets and objectives differ; needs empirical validation.
  • Cross-architecture and modality generalization
    • Description: Validate and adapt the loss scaling law for CNNs, diffusion models, speech models, and vision-LLMs.
    • Sectors: Vision, audio, multimodal robotics, healthcare imaging.
    • Tools/Workflows: Architecture-agnostic scaling law fitters; per-domain QAT libraries.
    • Assumptions/Dependencies: Non-transformer training dynamics may alter penalties; requires domain-specific calibration.
  • Hardware–software co-design for memory/bandwidth-aware QAT
    • Description: Co-optimize accelerators, kernels, and QAT methods to target the observed memory–precision trade-offs; expose hardware features for fine-grained quantization.
    • Sectors: Semiconductors, systems software, energy-efficient AI.
    • Tools/Workflows: New instructions/kernels for fused QAT operations, activation-aware quantization, and low-overhead validation.
    • Assumptions/Dependencies: Vendor ecosystem support; firmware/driver updates; standardized quantization primitives.
  • Managed cloud services for compute-optimal QAT
    • Description: Offer “QAT-optimized training” SKUs where users specify memory, quality targets, and budgets; service picks precision and FP/QAT splits automatically.
    • Sectors: Cloud providers, enterprise IT.
    • Tools/Workflows: Service APIs that take N, Dtotal, desired perplexity, and RAM budget; automated f* and B selection with fused schedulers.
    • Assumptions/Dependencies: Provider-specific telemetry to fit scaling laws; pricing models aligned with energy savings.
  • Policy and standards for “compute waste” reporting
    • Description: Establish reporting norms that include “wasted tokens” and energy/carbon per unit loss improvement; guide procurement and sustainability goals.
    • Sectors: Public policy, ESG, corporate governance.
    • Tools/Workflows: Auditing frameworks and benchmarks; third-party certification of training efficiency.
    • Assumptions/Dependencies: Agreement on metrics and normalization across datasets/models; industry adoption.
  • Energy-optimal training schedulers
    • Description: Optimize not just FLOPs but energy per unit improvement by selecting precision and QAT fractions that minimize power draw, leveraging bandwidth considerations.
    • Sectors: Energy, sustainability, green AI initiatives.
    • Tools/Workflows: Schedulers that co-optimize DVFS settings, batch sizes, sequence lengths, and QAT overhead; energy-aware scaling law parameters.
    • Assumptions/Dependencies: Accurate energy telemetry; integration with data center power management.
  • OS-level precision management for on-device AI
    • Description: Mobile/embedded OS services that auto-select model bit width based on app memory budget and expected quality, possibly adapting over time as devices learn.
    • Sectors: Mobile, IoT, consumer electronics.
    • Tools/Workflows: OS APIs where apps specify targets; system selects model variant and precision; background QAT updates.
    • Assumptions/Dependencies: On-device training or fine-tuning capabilities; privacy constraints; energy limits.
  • Precision-aware pretraining research (FP8/FP4, mixed-precision)
    • Description: Study how pretraining in low precision interacts with optimal QAT fraction and final loss; design new scaling laws integrating training precision.
    • Sectors: Academia, advanced R&D.
    • Tools/Workflows: Experimental frameworks to compare FP16/BF16 vs. FP8/FP4 training plus QAT; model families beyond Llama-like transformers.
    • Assumptions/Dependencies: Stable low-precision optimizers; mixed-precision kernels; dataset quality sensitivity.
  • Turnkey “QAT scaling law estimator” products
    • Description: Vendor tools that run a small number of calibration trainings to fit local scaling law parameters and then recommend compute-optimal QAT plans for customer models.
    • Sectors: AI tooling vendors, enterprise ML platforms.
    • Tools/Workflows: Auto-calibration pipelines; parameterized loss models; integration with experiment tracking.
    • Assumptions/Dependencies: Representative calibration runs; reproducible training; willingness to adopt model-based planning.

Notes on Global Assumptions and Dependencies

  • Results are strongest for LLM-style transformer architectures and may require calibration for other model families or tasks.
  • Data quality and domain shift affect scaling laws; refitting to local datasets improves predictions.
  • QAT method choice matters (e.g., ParetoQ); implementations must support resume-from-FP and stable low-bit operation.
  • Token count is a proxy for compute; when QAT overhead is notable, use iso-FLOP adjustments (Dqat/r).
  • Benefits increase with larger compute budgets and are most pronounced for lower bit widths (1–4), though fused scheduling showed consistent gains for 4–6 bits.
  • Memory/bandwidth bottlenecks dominate real-world inference; precision selection should consider deployment hardware characteristics.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Adam optimizer: An adaptive stochastic gradient method that uses first and second moment estimates to update parameters. "We use the Adam optimizer (31 = 0.9, 32 = 0.99, 8 = 10-8) with decoupled weight decay of 0.01"
  • automatic mixed precision: A training technique that performs parts of computation in lower precision to improve speed and memory efficiency while maintaining accuracy. "All experiments are trained with bfloat16 automatic mixed precision (Liu et al., 2021)."
  • bfloat16: A 16-bit floating-point format with an 8-bit exponent, commonly used to accelerate training while preserving numerical range. "All experiments are trained with bfloat16 automatic mixed precision (Liu et al., 2021)."
  • Chinchilla loss model: A scaling law that predicts LLM loss as a function of parameter count and data tokens. "The Chinchilla (Hoffmann et al., 2022b) loss model is one of the most commonly used: L(N, D)= E+AN-"+CD-B,"
  • Chinchilla-like loss: A component of the proposed loss function that adopts the Chinchilla-style dependency on parameters and tokens. "Chinchilla-like loss"
  • coefficient of determination (R2): A metric indicating how well a model fits data, representing the proportion of variance explained. "R2 is the coefficient of determination."
  • cooldown & QAT fusion: A training scheme that applies learning rate decay jointly with quantization-aware training to remove redundant FP updates and save compute. "We propose a novel approach: cooldown & QAT fusion-a scheme where learning rate decay is performed jointly with quantization-aware training, eliminating redundant full-precision updates and achieving better accuracy for the same token count."
  • decoder-only transformer: A Transformer architecture that generates outputs autoregressively using only the decoder stack. "We use a decoder-only transformer (Zhang et al., 2021) identical to Llama 2 (Touvron et al., 2023)."
  • DCLM dataset: A large-scale LLMing dataset from DataComp-LM used for training and evaluation. "Training is conducted on the DCLM dataset (Li et al., 2024)"
  • iso-flop levels: Configurations with equal total compute cost (FLOPs), used to compare training strategies fairly. "Dfp + D'yat = const represents not iso-token levels, but rather iso-flop levels."
  • learning rate cooldown: A phase where the learning rate decays to refine the model near convergence. "Currently, a classic way of training models is to perform full FP training with learning rate cooldown, and then start QAT with learning rate re-warmup."
  • LLMs: LLMs—high-parameter neural networks trained on vast text corpora for language tasks. "As LLMs grow in size and on-device applications gain traction (Wahab & Adda, 2025), significant attention has been devoted to reducing inference costs via model compression"
  • loss scaling law: An empirical formula that predicts final model loss from parameters, token counts, and precision settings. "From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and fi- nal model performance across different QAT/FP compute allocation strategies and QAT bit widths."
  • non-embedding FLOPs: The number of floating-point operations excluding the embedding layer, used in scaling analyses. "fitting accuracy as a function of used non-embedding FLOPs (FLOP estimation of model inference without embedding layer calculations), showing that such an approach works better across different model sizes."
  • ParetoQ: A quantization-aware training method that combines techniques to achieve high accuracy at extremely low bit widths. "For QAT algorithms, we rely on ParetoQ (Liu et al., 2025) for our setups, as this method achieves state-of-the-art accuracy across different bit widths by combining different approaches."
  • perplexity: A standard metric for evaluating LLMs that measures uncertainty in predicting tokens. "The dataset is split into training and validation sets, and validation perplexity is used for evaluation."
  • post-training quantization (PTQ): Applying quantization after training without further updates, typically with minimal extra compute. "It has been shown that QAT outperforms post-training quantization (PTQ) (Xiao et al., 2023; Banner et al., 2019), where quantization is applied after training is completed."
  • quantization-aware training (QAT): Training that incorporates quantization operations to let the model adapt to reduced precision. "Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks."
  • quantization granularity: The size of groups quantized together (e.g., per-channel or per-group), affecting quantization error. "where G is the quantization granularity (number of elements in each quantization group)"
  • RMSNorm: Root Mean Square Layer Normalization—a normalization method using RMS statistics instead of mean/variance. "The architecture incorporates SwiGLU activations (Shazeer, 2020), RoPE (Su et al., 2024), RMSNorm (Zhang & Sennrich, 2019), alternating attention and feed-forward layers, and tied embedding and language-modeling head weights."
  • RoPE: Rotary Positional Embeddings—a technique to encode positional information via rotations in attention. "The architecture incorporates SwiGLU activations (Shazeer, 2020), RoPE (Su et al., 2024), RMSNorm (Zhang & Sennrich, 2019), alternating attention and feed-forward layers, and tied embedding and language-modeling head weights."
  • straight-through estimator: A gradient approximation that treats non-differentiable operations as identity during backpropagation. "As quantization operations are non-differentiable, train- ing relies on gradient approximations such as the straight-through estimator (Bengio et al., 2013)."
  • SwiGLU: Swish-Gated Linear Unit—an activation function variant that improves Transformer performance. "The architecture incorporates SwiGLU activations (Shazeer, 2020), RoPE (Su et al., 2024), RMSNorm (Zhang & Sennrich, 2019), alternating attention and feed-forward layers, and tied embedding and language-modeling head weights."
  • tied embedding: Sharing parameters between input embeddings and the output LLMing head to reduce memory and improve alignment. "tied embedding and language-modeling head weights."
  • tokens-per-parameter-byte: A statistic defined as total training tokens divided by model parameter bytes, used to predict optimal QAT allocation. "the tokens-per-parameter-byte statistic."
  • wasted token count: The effective fraction of tokens unnecessarily spent due to sub-optimal QAT/FP allocation. "Wasted token count is the number of tokens that are effectively wasted by a sub-optimal QAT fraction."
  • WSD: Warmup-Stable-Decay learning rate schedule, featuring an initial warmup, a stable phase, and a decay phase. "Wen et al. (2024) show that re-initializing WSD from a post-cooldown checkpoint rather than from a constant stage yields better results."
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 posts and received 387 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube