Astra-Sim Backend: Distributed Training Simulation

Updated 29 December 2025

Astra-Sim Backend is a simulation framework that models distributed training systems using a tri-layer architecture combining workload, system, and network/memory layers.
It employs a graph-based execution engine and analytical network models to simulate diverse parallelization strategies and multi-dimensional topologies at scale.
The backend enables rapid performance analysis for hardware-software co-design, memory disaggregation, and in-network collective operations in deep learning research.

The Astra-Sim backend constitutes the central infrastructure for simulating large-scale distributed training systems with arbitrary model-parallel and data-parallel strategies, heterogeneous network hierarchies, and advanced memory subsystem models. Organized into a tri-layer architecture—Workload, System, and Network/Memory—it couples a graph-based training execution engine with analytical network models and extensible memory interfaces, enabling rapid, first-order design-space exploration at 1,000+ node scale. The backend supports simulation of state-of-the-art parallelization methods, multi-dimensional network topologies, disaggregated memory pools, and in-network collective operations, allowing precise system-level analysis and optimization (Won et al., 2023).

1. Simulation Core Architecture

The backend’s architecture is composed of modular components that abstract the diverse elements of distributed training stacks. The Workload Layer employs a graph-based execution engine, ingesting "execution traces" (ETs) in JSON format, where each ET is a DAG whose nodes represent compute, memory, or communication operations. Each simulated NPU receives a distinct ET, reflecting potentially unique parallelization schedules.

A lightweight, event-driven scheduler at each NPU discovers ready nodes (with all dependencies satisfied), assigns them to the appropriate sub-model (compute, memory, or network), and tracks completion via callback-based notifications. This per-node, event-driven approach enables efficient simulation of complex training execution patterns across arbitrarily partitioned models or pipelined micro-batches.

The System Layer integrates a collective operation scheduler and runtime support. It decomposes collectives (e.g., All-Reduce, Reduce-Scatter, All-Gather, All-to-All) into topology-aware sub-algorithms (ring, halving-doubling, direct) per network dimension. The "Themis" greedy scheduler dynamically reorders communication phases to balance per-link utilization. This layer exposes APIs for scheduling, sending, and receiving across the analytical network back-end, as well as memory access primitives for both local and disaggregated memory.

The Network and Memory back-end is fully analytical, employing closed-form equations rather than packet-accurate simulation and providing multi-dimensional topology generation from specified primitives with parameterized per-link bandwidth and latency. The memory model flexibly encompasses local HBM, remote memory pools, and in-network operations.

2. Graph-Based Training Loop Mechanism

The core simulation mechanism relies on execution traces that encode the control and data dependencies of distributed training. Each ET node contains fields describing the operation type, resource requirements, communication peers, and dependency relationships. An example ET node is:

{
  "id": 42,
  "type": "compute" | "memory" | "comm",
  "flops": 1.2e9,
  "bytes": 256<<20,
  "comm_kind": "AllReduce",
  "peers": [0,1,2,3],
  "parents": [40,41],
  "children": [43,44]
}

During simulation, each NPU maintains a ready queue, issuing nodes to sub-models adhering to the annotated operation type. The resolution of callbacks (e.g., sim_schedule, mem_access) triggers readiness updates for dependent nodes. As a result, arbitrary hybrid parallelisms—including strategies such as FlexFlow, ZeRO-3D, or complex pipeline parallelism—can be simulated as ET graph transformations without altering core engine logic. Micro-batches, pipeline bubbles, and staggered inter-device collectives are directly mapped onto the ET structure via node scheduling and "peers" annotations.

3. Multi-Dimensional Topology Generation and Analytical Network Modeling

A key feature is the dimension-agnostic topology generator, which defines device interconnect structures using "shape strings." For example:

1	Ring(4)_Switch(2)_FC(8)_Ring(2)

specifies a four-dimensional hierarchy:

Dim 1: Groups of 4 NPUs, ring-connected, bandwidth BW₁, latency L₁.
Dim 2: Between each group, connect via a 2-port switch, BW₂, L₂.
Dim 3: Fully-connected groups of size 8, BW₃, L₃.
Dim 4: Within each final block, subgroups of 2 connected in a ring, BW₄, L₄.

Device placement and link generation proceed by grouping NPUs along each dimension, then connecting within groups based on the primitive (ring, fully-connected, switch), as formalized in the backend’s build_topology algorithm.

Closed-form latency and bandwidth equations model network dynamics. For a data chunk of size $M$ bytes over $h$ hops of bandwidth $BW$ and per-hop latency $\alpha$ :

$\text{Time}_{chunk} = h\cdot\alpha + \frac{M}{BW}$

For hierarchical All-Reduce with $N$ dimensions and total data $D$ :

$T = \sum_{d=1}^N \left(\alpha_d h_d + \frac{D / k_d}{BW_d}\right) \quad\text{(reduce-scatter)} + \sum_{d=N}^1 \left(\alpha_d h_d + \frac{D / k_d}{BW_d}\right) \quad\text{(all-gather)}$

where $k_d$ is the fan-out at dimension $d$ .

4. Memory System Modeling and Hierarchical Disaggregation

The memory subsystem supports both high-bandwidth local memory and remote/disaggregated pools, with in-switch collective operations. The local (HBM) access model is:

$T_{local} = \text{Lat}_{mem} + \frac{\text{Size}_{bytes}}{BW_{mem}}$

The remote memory path is hierarchically staged. Defining $G$ (remote groups), $S$ (out-node switches), $P$ (GPUs per node), data chunk size $c$ :

Number of stages: $\lceil (P\cdot\text{Size}_{bytes}) / (G\cdot S\cdot c) \rceil$
Transfer per stage:
- $TX_{rem2out} = c/BW_{mem\rightarrow switch}$
- $TX_{out2in} = (G\cdot c)/(N_{nodes}\cdot BW_{out\rightarrow node})$
- $TX_{in2gpu} = (G\cdot S\cdot c)/(P\cdot BW_{node\rightarrow gpu})$
Total remote access time: $T_{remote} = stages \cdot \max(TX_{rem2out}, TX_{out2in}, TX_{in2gpu})$

In-switch collective communication reuses pipelined-stage equations with adjustments to per-link load, reflecting the detailed in-network reduction points.

5. Configuration, API, and Extensibility

The backend exposes a YAML-centric configuration interface for network, memory, and workload description. Topology, bandwidth, and latency are specified per dimension. Example configuration:

network:
  shape: "Ring(4)_Switch(2)_FC(8)_Ring(2)"
  link_params:
    Dim1: {bw: 200e9, lat: 80e-9}
    Dim2: {bw: 100e9, lat: 120e-9}
    Dim3: {bw: 400e9, lat: 50e-9}
    Dim4: {bw: 50e9, lat: 150e-9}

Memory configurations encompass both local and remote parameters, with options for chunk size, bandwidth between hierarchy levels, switch collectives, and related factors. Parallelization parameters (data and model parallelism factors, micro-batching) and ET file locations are also YAML-configurable.

API usage is streamlined for both batch and programmatic execution. Example:

from astra.simulation import Simulator, load_config
cfg = load_config("my_config.yaml")
sim = Simulator(cfg)
sim.run(steps=1_000)
stats = sim.get_stats()

Codebase organization follows separation into /frontend (graph engine), /network (analytical model), and /memory subfolders.

6. Customization and Extension Mechanisms

Astra-Sim supports extension to new hardware topologies and memory models. Adding a new network primitive such as Mesh(X,Y) involves extending the topology parser, implementing mesh_edges generation, and registering associated collective algorithms. Network congestion or oversubscription scenarios can be modeled by overriding chunk-delay equations with utilization-aware formulations. Support for novel memory technologies (e.g., optical interconnects) requires subclassing the MemoryModel and providing an access time equation, then wiring the new model into the YAML parsing layer.

The PyTorch ExecutionGraphObserver front end captures per-GPU ETs using hooks into torch.autograd, with conversion scripts normalizing to the backend schema. This design decouples front-end workload capture from backend simulation, allowing experimentation with a broad spectrum of scheduling and hardware scenarios.

7. Significance for Distributed Deep Learning Research

By combining graph-based execution, analytical multi-dimensional topologies, and flexible memory modeling, the Astra-Sim backend enables scalable and rapid exploration of system design permutations for large-model distributed training. System designers obtain actionable, first-order performance estimates—balancing NPUs, memory subsystems, and interconnects—without full system deployment. This facilitates both hardware-software co-design and methodical evaluation of emerging collective algorithms, memory disaggregation techniques, and novel network fabrics. The backend’s extensibility permits adoption for new research directions, including custom topologies, in-network computation, and alternative parallelization strategies (Won et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Astra-Sim Backend.