Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learnable Multipliers in ML & Optimization

Updated 18 February 2026
  • Learnable multipliers (LRMs) are trainable scaling factors integrated into linear transformations to dynamically adjust weights, activations, and dual variables for improved model performance.
  • They decouple scaling from fixed initialization and regularization, enabling adaptive learning, enhanced expressivity, and efficient parameter utilization in both optimization and deep learning tasks.
  • LRMs have been successfully applied in mixed-integer programming dualization, language model matrix adaptation, and low-rank quantization, demonstrating reduced hyperparameter sensitivity and improved empirical benchmarks.

Learnable multipliers (LRMs) are trainable scaling factors—either scalars, vectors, or low-rank matrices—introduced multiplicatively into linear transformations within optimization and deep learning pipelines. By decoupling the scaling of weights, activations, or dual variables from parameter initialization and regularization equilibria, LRMs enable optimal adaptation to statistical properties of data, compressive quantization, or dual bound tightening. Approaches based on LRMs have been proposed in optimization (e.g., mixed-integer programming dualization), LLM matrix-scale adaptation, and low-rank quantization for neural networks. These methodologies demonstrate diverse but coherent principles: amortization of expensive iterative routines, parameter-efficiency, enhanced expressivity, and reduced hyperparameter sensitivity.

1. Theoretical Foundations and Motivation

In deep learning, weight matrices trained with explicit weight decay (WD) and standard optimizers such as AdamW or Muon settle, due to the equilibrium between gradient noise and WD, at norms determined by η\eta (learning rate) and λ\lambda (decay), as Wη/λ\|W\|\propto\sqrt{\eta/\lambda}. This stationarity is agnostic to data, potentially impeding optimal feature learning. In optimization, specifically Lagrangian relaxation for MILPs, dual variable selection involves complex, non-smooth optimization, with computational cost scaling in the number of relaxed constraints. In post-training quantization of LLMs, direct per-parameter scale learning for quantized weights provides maximal flexibility but with daunting parameter count and risk of overfitting.

Introducing LRMs in these contexts enables the data-driven learning of scale, bypasses the inflexibility imposed by fixed equilibria or combinatorial search, and facilitates more parameter-efficient representations that maintain or improve empirical performance (Demelas et al., 2023, Lee et al., 2024, Velikanov et al., 8 Jan 2026).

2. Implementation in Optimization: Predicting Lagrangian Multipliers

For mixed-integer linear programs (MILPs), tight Lagrangian bounds are typically sought by iteratively optimizing the dual variables (multipliers) π\pi associated with relaxed constraints. This process is slow and expensive as problem size scales. "Predicting Accurate Lagrangian Multipliers for Mixed Integer Linear Programs" (Demelas et al., 2023) proposes a learnable model that predicts π\pi directly:

  • Architecture: MILP instances are encoded as bipartite graphs (variable and constraint nodes) with initial features (e.g., objective coefficients, solution statistics, dual values). A stacked graph convolutional network (GCN) produces embeddings for each dualized constraint.
  • Latent Decoding: Each constraint embedding is projected to Gaussian parameters (μc,σc)(\mu_c, \sigma_c), sampling a latent zcz_c per constraint, which is decoded to a predicted multiplier via an MLP, forming π^=λˉ+δ\hat\pi = \bar\lambda + \delta (where λˉ\bar\lambda is the continuous relaxation dual).
  • Objective: The network is trained end-to-end to maximize the bound d(π^)d(\hat\pi), i.e., directly optimizing the tightness of the predicted Lagrangian relaxation, bypassing iterative dual optimization.
  • Empirical Results: The approach closes up to 85% of the optimal Lagrangian gap over the continuous relaxation, delivers dual multipliers that are effective warm starts (reducing bundle solver iteration count and wall-clock time by up to 2×), and is robust to instance size variation (see Section 4 below).

3. Scale Adaptation in Neural Network Pretraining

The "Learnable Multipliers: Freeing the Scale of LLM Matrix Layers" framework (Velikanov et al., 8 Jan 2026) extends the principle to LLM training, addressing the suboptimality of fixed scaling arising from noise–WD equilibrium:

  • Scalar LRMs: Each weight matrix WW is replaced in the forward pass by W=sW\overline{W} = s W, with sRs\in\mathbb{R} a learnable scalar.
  • Vector LRMs: Finer adaptation is enabled via per-row (rRdoutr\in\mathbb{R}^{d_\mathrm{out}}) and per-column (cRdinc\in\mathbb{R}^{d_\mathrm{in}}) multipliers: Wij=riWijcj\overline{W}_{ij} = r_i W_{ij} c_j. This reparametrization frees not only the global norm but also row and column feature norms.
  • Gradient Properties: LRMs are updated via gradients that average over entire rows/columns, leading to lower stochasticity and higher signal-to-noise for the scale parameters.
  • Comparison to μ\muP: Whereas maximal-update parametrization (μP) prescribes fixed, width-dependent multipliers determined by theory and empirical tuning, LRMs learn these scales from data, simplifying or eliminating expensive multiplier hyperparameter sweeps.
  • Empirical Manifestation: LRMs automatically recover sensible scaling with width/depth, increase feature norm diversity, and deliver statistically significant improvements (1–1.2 percentage points on LLM benchmarks) across both AdamW and Muon optimizers.

4. Low-Rank Learnable Multipliers in Quantization

In post-training quantization (PTQ) for LLMs, "LRQ: Optimizing Post-Training Quantization for LLMs by Learning Low-Rank Weight-Scaling Matrices" (Lee et al., 2024) introduces low-rank learnable scaling matrices for elementwise rescaling of quantized weights:

  • Formulation: The full n×mn\times m scaling matrix SS for a block is parameterized as S=exp(UV+u1T+1vT)S = \exp(UV^\top + u \mathbf{1}^T + \mathbf{1} v^T), with URn×rU\in\mathbb{R}^{n\times r}, VRm×rV\in\mathbb{R}^{m\times r}, uRnu\in\mathbb{R}^n, vRmv\in\mathbb{R}^m.
  • Pipeline: For each pre-trained block, calibration data is used to minimize the reconstruction error between full-precision outputs and those produced by quantized weights rescaled by SS. The quantizer is differentiable via a straight-through gradient estimator.
  • Parameter Efficiency: The low-rank structure (rn,mr\ll n,m) provides a favorable tradeoff between expressivity and overfitting, with typical r1024r\approx 1024 for 4096×40964096\times4096 blocks. Full-matrix scale variants achieve similar reconstruction with 30–50% more learnable parameters and significant additional memory cost.
  • Results: LRQ with LRMs achieves near-full-precision accuracy under both 8-bit and aggressive 4-bit quantization, outperforming or matching prior blockwise reconstruction methods (e.g., FlexRound, SmoothQuant) while reducing peak GPU memory footprint during quantization.

5. Empirical Validation and Comparative Benchmarks

Extensive validation in all cited domains demonstrates the utility of LRMs:

Domain Baseline LRM-based Approach Relative Performance (as reported)
MILP dual gap (GAP-CR) Subgradient/bundle LRM GCN Closes up to 85% of gap; halves solve time (Demelas et al., 2023)
LLM pretraining AdamW/μP AdamW+LRM/Muon+LRM +1.21/+1.10+1.10 points avg on suite (Velikanov et al., 8 Jan 2026)
PTQ (LLaMA, CSR, MMLU) FlexRound/SQ LRQ Closes gap to <<1.5%; memory savings (Lee et al., 2024)

Salient observations include:

  • Removing or subsampling CR-duals in MILP LRM GCN inputs degrades bound closure (e.g., from ~78% to ~15%).
  • Per-row/per-column LRMs in LLMs improve amplitude diversity across depth and width, correcting for the inflexibility of fixed noise–WD equilibrium.
  • In quantization, low-rank learnable scales ameliorate overfitting, outperforming full-scale matrices with lower calibration sample requirements, and are effective under both weight-only and weight+activation regimes.

6. Practical and Methodological Considerations

  • Initialization and Regularization: In LLMs, multipliers may require low regularization (e.g., λLRM=2×103\lambda_\mathrm{LRM}=2\times10^{-3}) to prevent numeric drift due to architectural scaling symmetries. For quantization, rank and calibration set size are hyperparameters that control the balance between learning capacity and generalization.
  • Integration: LRMs can be deployed via reparameterization (forward pass over W\overline{W}), with careful gradient clipping to avoid adverse scaling between 2\ell_2-clipped weights and multipliers.
  • Symmetry and Instability: Multiplicative scaling symmetries (e.g., between Q/K pairs in Transformer attention or LayerNorm) can cause unbounded parameter drift if unconstrained. Monitoring or constraining these is necessary for stable deployment.
  • Resource Efficiency: For LRQ quantization, the overhead versus full-matrix scaling is moderate (<2%<2\% increase, but lower peak memory). In LLMs, tuning hyperparameters for μP coefficients is largely obviated by directly learnable multipliers.

7. Limitations, Extensions, and Future Directions

  • Limitations: The efficacy of LRM-based models is dependent on the representativeness of training samples, choice of relaxed constraints (in MILP), or architectural substructure. Very large or structure-specific problems may require further architectural innovations.
  • Extensions: In MILP dualization, potential directions include extending learnable multipliers to Dantzig–Wolfe relaxation and integrating multipliers into branch-and-bound node selection or cutting-plane mechanisms (Demelas et al., 2023). For LLMs, embedding richer attention mechanisms or joint constraint relaxation selection with LRMs could enhance robustness. In quantization, further sparsification or hierarchical low-rank decompositions might reconcile resource use and accuracy under more extreme regimes.

Learnable multipliers represent a unified approach to scale-adaptation through differentiable, data-dependent parametrizations. By amortizing traditionally expensive optimization procedures, reducing manual hyperparameter search, and maintaining or improving empirical benchmarks in their respective settings, LRMs demonstrate broad utility across optimization, neural network training, and model compression. Recent work has validated these benefits for MILPs (Demelas et al., 2023), LLM pretraining (Velikanov et al., 8 Jan 2026), and quantization (Lee et al., 2024), highlighting LRMs as a critical methodological development in data-driven large-scale learning and inference.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learnable Multipliers (LRMs).