Papers
Topics
Authors
Recent
2000 character limit reached

MoR: Mixture of Representations

Updated 4 January 2026
  • MoR is a dynamic, property-aware quantization framework that selects optimal precision formats per tensor to minimize quantization error.
  • It employs group amax mantissa (GAM) scaling and error metrics to adaptively assign low-precision formats, achieving over 98% FP8 utilization.
  • Empirical results on large language models show MoR maintains model quality while improving efficiency, with minimal accuracy loss compared to full-precision baselines.

The Mixture-of-Representations (MoR) framework is a dynamic, property-aware quantization strategy for deep neural network training that adaptively selects among multiple floating-point representations (notably FP8 and BF16) for different tensor regions at runtime. MoR addresses critical limitations of conventional mixed-precision schemes by monitoring the actual distributional and numerical properties of each tensor (or sub-tensor), choosing the minimal-precision format sufficient to bound the empirical quantization error below a set threshold, and thus achieves aggressive low-precision utilization (over 98% of tensors in FP8) without loss of model quality. This enables both maximal efficiency and robustness for large-scale models, presenting an alternative to brittle, hand-engineered scaling or block-partitioning recipes (Su et al., 28 Dec 2025).

1. Conceptual Foundations: Dynamic Quantization and Mixed Precision

Traditional mixed-precision computational pipelines often statically assign high- and low-precision formats (e.g., FP8, BF16) based on tensor type or block partition, possibly fine-tuned per layer. However, in such schemes, tensor-level heterogeneity and intra-tensor variance are not exploited; each block is treated uniformly regardless of its numerical characteristics. MoR replaces this with an adaptive, runtime decision process:

  • At each training step, each tensor (or block) is examined, and simple statistics—the absolute max (amax), relative quantization error histograms or moments under hypothetical quantization, and dynamic range R=maxx/minx0xR = \max|x|/\min_{x \ne 0}|x|—are computed.
  • Based on these, MoR determines, per block or entire tensor, whether quantization to a highly efficient format (e.g., FP8 E4M3) suffices without exceeding an empirical error threshold, or must fall back to a less aggressive format (e.g., BF16, or intermediate E5M2) to maintain fidelity.
  • This per-(sub)tensor process ensures that the vast majority of arithmetic and memory accesses can exploit the highest efficiency offered by hardware (FP8), while only a negligible minority incur higher-cost fallback (Su et al., 28 Dec 2025).

2. MoR Algorithmic Framework and Decision Metrics

The MoR workflow is concretized through specific algorithms and invariants:

  • Group Amax Mantissa (GAM) Scaling: Before quantization, MoR employs GAM to align scaling factors across groups and blocks. For each group gg of blocks BgB_g, a group amax gamax=maxxgxg_{\textrm{amax}} = \max_{x \in g}|x| is computed, and a high-precision group mantissa mgm_g is extracted. Each block computes its own amax and ideal scale, compares mantissas, and selects the exponent conservatively to prevent overflow or clipping. The group mantissa is shared, per-block exponents are stored, and the overall scaling is formalized as

scale=qamaxmaxx,qamax=2e11\text{scale} = \frac{q_{\textrm{amax}}}{\max|x|} \,,\quad q_{\textrm{amax}} = 2^{e-1}-1

where ee is the exponent bit count for the FP format.

  • Quantization Selection: Once scales are set, quantization proceeds by scanning a prioritized list of representations (typically, [E4M3, E5M2, BF16]) for each block bb:
    • For a candidate format TiT_i, a metric Mi(b,metadata)M_i(b, \text{metadata}) (e.g., mean relative quantization error) is evaluated.
    • If MiM_i is below threshold, that format is chosen. Otherwise, the process proceeds to the next, less aggressive format.
    • For two-way E4M3/BF16 MoR, the selection rule is: quantize to E4M3 if mean relative error <4.5%< 4.5\%; otherwise, use BF16.
    • Three-way MoR (E4M3/E5M2/BF16) checks whether E4M3 error is lower than E5M2, else whether the dynamic range fits within E5M2, else falls back to BF16 (Su et al., 28 Dec 2025).

3. Precision Formats and Error-Range Trade-Offs

MoR’s utility is predicated upon the efficiency and limits of FP8 and BF16 representations:

Format Sign bits Exponent bits Mantissa bits Typical Range ε\varepsilon (machine)
E4M3 1 4 3 ±[29,448]\pm[2^{-9},448] 23=0.1252^{-3}=0.125 (12.5%)
E5M2 1 5 2 ±[216,57344]\pm[2^{-16},57344] 22=0.252^{-2}=0.25 (25%)
BF16 1 8 7 ±[2126,2127]\pm[2^{-126},2^{127}] 270.78%2^{-7}\approx 0.78\%
  • FP8 offers a dramatic speedup (throughput and memory bandwidth) as compared to BF16, but at much greater quantization-induced error risk and narrower coverage of rare, extreme-magnitude activations.
  • MoR exploits empirical (rather than pessimistic) error criteria: for uniformly distributed rounding, expected quantization error ε/2\approx \varepsilon/2. Mean or histogram metrics, not worst-case analysis, drive blockwise assignment (Su et al., 28 Dec 2025).

4. Empirical Results: Utilization and Model Quality

Experiments with MoR on LLMs, specifically Nemotron-3 8B, demonstrate the framework’s capacity to maximize hardware efficiency:

  • Using per-channel partitioning with two-way E4M3/BF16 MoR, 98.38% of all linear-layer tensors were quantized to E4M3, with only 1.62% requiring BF16.
  • Training and validation loss, parameter norm trajectories, and downstream benchmarks (e.g., MMLU, WinoGrande, PIQA) tracked within 0.5–1% of full-precision BF16 baselines.
  • The MMLU 5-shot score improved from 62.56% (BF16) to 63.30% (MoR), confirming that aggressive FP8 assignment via MoR is compatible with quality parity or improvement.
  • Blockwise error histograms confirmed that nearly all blocks respect the 4.5% E4M3 mean relative error bound, with only a thin failure tail requiring fallback (Su et al., 28 Dec 2025).

5. Granularity, Partitioning, and Robustness

Unlike traditional mixed-precision/fine-grained scaling recipes, which require highly subdivided blocks (e.g., micro-blocks, channel or 128×128 partitions) to guarantee fidelity, MoR demonstrates that:

  • Coarse partitioning suffices when combined with actual error measurement.
  • By monitoring and bounding error directly, MoR dynamically chooses when to coarsen or refine block granularity only as strictly necessary—often minimizing fragmentation, complexity, and overhead.
  • This design automates trade-offs formerly requiring extensive engineering and manual profiling (Su et al., 28 Dec 2025).

6. Extensibility and Lower-Precision Integrations

MoR generalizes to formats below FP8, such as NVFP4 (NVIDIA 4-bit float), through the same error-based schema:

  • New formats (e.g., NVFP4) are added at the prioritization head, with the decision metric MiM_i constructed to reflect when dynamic range and rounding error are tolerable.
  • Simulation checks (computing empirical relative error on sampled blocks) allow NVFP4 exploitation with a safety net to revert.
  • This architecture is extensible to further reduced-precision innovations, provided empirical invariants on error are tightly enforced (Su et al., 28 Dec 2025).

MoR offers a unified, hardware- and data-driven paradigm distinct from:

  • Static assignment of precision per layer/tensor without runtime adaptation.
  • Blocksize or scaling heuristics that ignore actual quantization error incurred.
  • Fine-grained approaches emphasizing universally minimized partition size, regardless of context.

By dynamically selecting numeric format at runtime per block or tensor and leveraging targeted metrics, MoR maximizes low-precision bandwidth and computational efficiency while preserving convergence and accuracy. This reduces the need for architecture-specific recipes and extensive fine-tuning, and robustly leverages advances in hardware support for FP8 and beyond (Su et al., 28 Dec 2025).

References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Representations (MoR).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube