Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

131 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

AXLearn: Scalable Deep Learning System

Updated 9 July 2025

AXLearn is a production deep learning system offering scalable, high-performance training with key features in modularity and heterogeneous hardware support.
It minimizes code complexity by strictly encapsulating internal interfaces and using hierarchical configuration objects, ensuring constant LoC-complexity during feature integration.
AXLearn achieves competitive training throughput and efficient inference through advanced distributed strategies and tailored hardware-specific optimizations.

AXLearn is a production deep learning system designed to support scalable high-performance training of large models, with a unique emphasis on modularity and native operation across heterogeneous hardware infrastructure. Developed with strict encapsulation of internal interfaces, AXLearn facilitates rapid model development and experimentation, minimizing code complexity for integrating new model variants or deploying across different devices. The system introduces a formal metric—Lines-of-Code (LoC)-complexity—to quantify modularity and demonstrates constant integration complexity as the number of modules scales. AXLearn achieves parity or better performance relative to other state-of-the-art frameworks while providing robust operational and developmental tooling (2507.05411).

1. Modularity and Code Complexity

AXLearn’s core architectural principle is strict encapsulation. Instead of popularity-based subtyping (which propagates changes through a codebase), AXLearn relies on composition with hierarchically defined configuration objects. Each model component exposes a Python-based configuration, and top-level trainers assemble models by recursively composing these configs.

This yields significant practical advantages:

Constant LoC-complexity: The number of lines of code needing modification to integrate a new feature or component—referred to as LoC-complexity(x)—remains asymptotically constant ( $O(1)$ ) even as the number of module variants increases. Typical frameworks incur $O(N)$ or $O(NM)$ LoC changes for $N$ modules and $M$ variants (such as Rotary Position Embedding (RoPE) or Mixture-of-Experts (MoE) integration).
Example: RoPE or MoE functionality can be integrated across hundreds of modules using a ~10-line helper snippet. This injection can be reused across over a thousand experiments, all without interface changes in the parent modules.

This modularity is made possible by composition and encapsulation rather than inheritance chains, with a configuration “traversal” method to replace specified configs in the component tree:

1	replace_config(trainer_cfg, target=FeedForwardLayer, new_cfg=MoELayer.default_config().set(...))

Such approaches are central to enabling rapid experimentation and large-scale model evolution in a codebase serving both research and production.

2. Heterogeneous Infrastructure Support

AXLearn is built natively atop JAX, leveraging XLA for compilation and GSPMD for distributed parallelism. This foundation enables seamless compatibility across a range of accelerators, including GPU (e.g., H100), TPU (e.g., v5e, v5p), and AWS Trainium.

Key infrastructure abstractions include:

Mesh rules: Configuration-based platform targeting, where device-specific defaults (such as INT8 on certain TPUs or FP8 on NVIDIA H100s) are automatically set according to the current hardware mesh. For example, a mesh rule can enforce rematerialization and quantization on a TPU-v5e mesh and 8-way parallelism on H100s, without code changes in the model logic.
Custom kernel and memory policies: The framework supports hand-tuned kernel integration, such as using FlashAttention, to maximize resource utilization on each device class.

AXLearn thus enables practitioners to select target infrastructure based on cost, performance, or operational reliability, supporting mixed cloud and on-premises workloads without architectural refactoring.

3. System Performance and Benchmarking

Despite the focus on modularity and extensibility, AXLearn demonstrates no measurable compromise in training or inference efficiency compared to prior leading systems. Empirical results show:

Training throughput: On 32 $\times$ H100-8 GPUs, AXLearn achieves iteration times of 1.4 seconds, closely matching systems such as Megatron-LM and MaxText, and outperforming PyTorch FSDP. On a TPU-v5p-512 configuration, AXLearn reports an iteration time of 2.5 seconds and Model FLOPS Utilization (MFU) of 66.2%, exceeding PyTorch XLA FSDP (which fails with out-of-memory) and exceeding MaxText on this hardware.
Inference efficiency: For inference on TPU, AXLearn delivers a time-to-first-token (TTFT) up to 500 $\times$ faster and a lower time-per-output-token (TPOT) than vLLM.

Benchmarks confirm that AXLearn’s modular architecture does not entail a cost in computational efficiency, making it suitable for both research prototyping and production-scale training.

4. Feature Integration and Model Variant Injection

Integration of new functionality, such as RoPE or MoE layers, is implemented through configuration tree modification rather than direct codebase modification. Because every module maintains strict encapsulation and exposes publicly modifiable configs, the injection of an experimental variant or architectural change is lightweight—typically confined to a small helper function or a config modifier.

Injection example: To enable MoE in place of standard feed-forward layers, a single traversal replaces all relevant configs with their MoE-augmented variant across the model tree. This process requires an order-of-magnitude fewer code changes than comparable systems and supports the maintenance of “golden configurations” for experiment reproducibility.
Line-of-code savings: Real-world estimates indicate integration of features like RoPE or MoE that would require hundreds or thousands of lines of code in some frameworks require only approximately 10 lines in AXLearn, even at production scale.

This design ensures that experimental advances or customizations can be incorporated by research and engineering teams efficiently and with low risk of interface drift.

5. Parallelization Strategies and Practical Scaling

AXLearn employs advanced distributed strategies to efficiently scale models across large, heterogeneous clusters:

3D Sharding: The framework adapts the three-dimensional sharding strategy used in MoE LLM training (2405.15052), partitioning computation along Data, Expert, and Model axes and allowing efficient scaling to high parameter counts with balanced throughput and communication overhead.
Platform-specific kernel selection: The system selects and tunes kernels automatically based on the execution environment, further improving training efficiency and device utilization.

This approach enables practical deployment of large-scale models, including sparse-activated (MoE) architectures and dense architectures, on both single-platform and multi-platform clusters.

6. Operational Experience and System Design Philosophy

AXLearn’s design and evolution are informed by real-world experience, including:

Functional programming paradigm: The transition from an imperative PyTorch foundation to a JAX/XLA functional ecosystem required innovations around state passing. The InvocationContext abstraction manages parameter passing, PRNG key splitting, and output aggregation, hiding state management complexity from the model author.
Debugging and AOT compilation: Ahead-of-time compilation is leveraged to surface compilation and memory issues early in the experiment design lifecycle, improving development speed and diagnosis productivity.
Production operations: AXLearn supports over 10,000 concurrent experiments, executing on heterogeneous clusters, serving models to more than a billion end users. Stability is ensured through “golden config” files and robust checkpointing/failure recovery, reducing configuration drift and improving traceability.

The operational framework supports rapid iteration and large-scale experimentation, providing robustness required in production training environments while retaining flexibility for research innovation.

7. Impact, Scope, and Extensibility

AXLearn’s combination of modularity, scalable parallelism, and heterogeneous deployment makes it suitable for a broad range of production and research workloads. Its design enables both rapid feature experimentation and high-throughput model training with the following key properties:

Extensibility at scale: Minimal code changes for new modules or experiment variants, even as the ecosystem of supported models grows.
Hardware independence: Seamless migration and scaling across disparate hardware with consistent user APIs and configurations.
Maintained efficiency: Matching or exceeding the speed and efficiency of specialized, less modular frameworks.

The system has enabled rapid adoption and sustained operation across large-scale deployments, confirming the practical benefits of its design philosophy (2507.05411).

AXLearn represents an advanced approach to large-scale model training infrastructure, combining formally quantified modularity (LoC-complexity), cross-hardware operability, and production performance. Its architecture sets a reference standard for extensibility and maintainability in deep learning system design.

PDF Markdown Chat (Upgrade)

References (2)

AXLearn: Modular Large Model Training on Heterogeneous Infrastructure (2025)

Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training (2024)