Multi-DC Optical Networks

Updated 30 December 2025

Multi-Datacenter Optical Networks are high-capacity fiber systems connecting geographically dispersed datacenters to support distributed machine learning with pipeline-parallel training.
The CBA framework dynamically adjusts frequency slot allocation and employs MILP-based scheduling to achieve a 31% reduction in iteration time and improved network performance.
Experimental evaluations on NSFNET topologies confirm that real-time resource adaptation and contiguity-aware path selection significantly reduce bubble ratios and blocking probabilities.

Multi-Datacenter Optical Networks constitute the physical and algorithmic foundation for distributed machine learning training that spans geographically separated datacenters interconnected via high-capacity optical fiber networks. These systems are increasingly critical for scaling LLM and deep neural network (DNN) training where hardware resources in a single facility are insufficient. Multi-DC optical networks introduce novel challenges in resource assignment, communication scheduling, and system optimization, necessitating frameworks that co-design pipeline-parallel training algorithms with real-time network state awareness, latency estimation, and traffic engineering. Below, key principles, frameworks, and results from recent advances such as CBA ("Communication-Bound-Aware Cross-Domain Resource Assignment for Pipeline-Parallel Distributed LLM Training in Dynamic Multi-DC Optical Networks" (Fu et al., 23 Dec 2025)) are summarized in rigorous detail alongside representative approaches.

1. Distributed Training over Multi-DC Optical Network Topologies

Multi-DC optical networks are typically abstracted as a graph $G = (V, E)$ where $V$ represents individual datacenters (DCs) and $E$ are fiber links supporting $W$ frequency slots per fiber (e.g., $|W| = 80$ slots, each $12.5$ GHz, NSFNET topology (Fu et al., 23 Dec 2025)). Each link $e$ maintains a binary frequency-slot occupancy vector $s_e[1..W]$ at time $t$ . In pipeline-parallel (PP) distributed LLM training, $L$ layers are partitioned into $P$ stages $P_0, ..., P_{P-1}$ with each stage mapped to a GPU—often spread across multiple DCs. Each of $M$ micro-batches per iteration triggers $(P-1) \cdot M$ inter-DC transmission requests via dynamic optical network traffic, where link occupancy may overlap across requests due to temporal demand.

Key metrics are:

Per-iteration runtime $T_{iter}$ : wall-clock time from start of first forward to end of last backward micro-batch.
Bubble ratio $R_{bubble}$ : proportion of iteration time spent idling owing to communication delays.
Blocking probability $p_{block}$ : fraction of transmission requests that cannot be assigned a feasible path and frequency slot block, inducing delay or cancellation.

2. Communication-Aware Resource Assignment and Scheduling

Recent frameworks such as CBA (Fu et al., 23 Dec 2025) model PP training as a mixed-integer linear program (MILP) seeking to minimize $T_{iter}$ under multi-DC optical network constraints. Decision variables $x_{r,p,i,f}$ indicate the assignment of micro-batch transmission $r$ (corresponding to a stage-to-stage data movement) to optical path $p$ and contiguous frequency slot block $f$ on every link of $p$ .

The communication latency for a request $r$ with payload $c$ traversing path $p$ and slot block $f$ is captured by the $\alpha$ – $\beta$ model: $T_{comm}(r) = \alpha_p + \beta_p \cdot c + \varepsilon_p(c)$ where $\alpha_p$ , $\beta_p$ are path-specific offset/bandwidth parameters updated per iteration, and $\varepsilon_p(c)$ accounts for queuing delays.

Scheduling constraints rigorously ensure that no frequency slot on any link is double-booked and that frequency-slot block assignment remains contiguous on all links of a path.

3. Communication-Bound-Aware (CBA) Dynamic Resource Adaptation

The crux of CBA (Fu et al., 23 Dec 2025) is adaptive, cross-domain orchestration:

Detection of communication-bound tasks: the orchestrator inspects the previous schedule $S_{j-1}$ to label any micro-batch computation as communication-bound if network delays exceed prior dependency completion ( $cur.start\_time > prev.completion\_time + Latency\_DC\_connect$ ).
Dynamic frequency slot demand adjustment: if a transmission was blocked last iteration, decrease its slot demand by one; if labeled communication-bound, increment by one (bounded system-wide) to secure wider spectrum and improve latency.
K-shortest-path search with contiguity-aware path selection: for each transmission, the framework evaluates $K$ candidate paths and slot blocks, calculating a fitness score

$I(p_k, f) = \frac{C_{avail}(p_k, f)}{L(p_k)} \cdot (1 - \rho(p_k))$

where $C_{avail}$ is the contiguity index, $L(p_k)$ is path hop count, and $\rho(p_k)$ is current frequency slot usage fraction.

This heuristic approach enables real-time resource adaptation as network state and model demands evolve during training. No formal worst-case guarantee on solution approximation is provided.

4. Performance Characterization and Benchmarks

Experimental evaluation (Fu et al., 23 Dec 2025) utilizes NSFNET (14 nodes, 21 links; 80 FS/link, 12.5 GHz; 64-QAM modulation), placing GPUs randomly in six DCs for Llama 3 models (8B, 70B, 8 PP stages).

Baseline algorithms: KSP-FF (K-shortest paths, first-fit assignment) and SD-FF (shortest-distance path, first-fit).
Key results (Llama 3 70B, GPipe, $M=128$ micro-batches):

Metric	KSP-FF	SD-FF	CBA (Ours)
Iteration time (s)	102.4	98.7	68.0
Bubble ratio (%)	48.1	45.5	37.9
Blocking prob. (%)	17.3	15.9	13.8

Improvements over best baseline:
- $31.25\%$ reduction in iteration time
- $11.96\%$ decrease in bubble ratio
- $13.20\%$ fewer blocking requests

CBA ablation studies show that disabling communication-bound task labeling or dynamic $\alpha$ – $\beta$ latency updates leads to inferior bubble ratio and blocking probability.

5. Theoretical and Algorithmic Complexity

The per-iteration complexity of CBA (Fu et al., 23 Dec 2025) is $O((P-1)\cdot M \cdot K \cdot (E\log V + W))$ :

$K$ -shortest-path: $O(K (E \log V))$ per request,
Contiguity and fitness computation: $O(W)$ per path.

Given practical values (e.g., $P=8$ , $M=128$ , $K=4$ , $W=80$ ), CBA remains computationally tractable even in large network topologies.

Multi-DC optical networking for distributed machine learning entails tight co-design between application-level pipeline-parallel training and optical network resource management. CrossPipe (Chen et al., 30 Jun 2025) generalizes multi-DC pipeline scheduling as a constraint optimization model, providing both CP solver and greedy near-optimal schedules, explicitly accounting for bandwidth and latency (via $\alpha$ – $\beta$ model) and achieving up to $33.6\%$ reduction in training time compared to static schedules.

Alternate frameworks such as SPP (Luo et al., 2022), HelixPipe (Zhang et al., 1 Jul 2025), TawPipe (Wu et al., 12 Nov 2025), and BaPipe (Zhao et al., 2020) focus on device-level communication patterns, weight-passing schemes, and load-balanced stage partitioning, providing the necessary abstractions for scaling within or across DC boundaries.

CBA represents the state of the art in integrating pipeline-parallel task scheduling with real-time optical network state, adapting spectrum assignment dynamically, and maximizing utilization under stringent multi-DC constraints. Such communication-bound-aware resource assignment mechanisms are fundamental to the sustainable scaling of distributed LLM and DNN training workloads across geographically distributed datacenters.