Papers
Topics
Authors
Recent
2000 character limit reached

Multi-DC Optical Networks

Updated 30 December 2025
  • Multi-Datacenter Optical Networks are high-capacity fiber systems connecting geographically dispersed datacenters to support distributed machine learning with pipeline-parallel training.
  • The CBA framework dynamically adjusts frequency slot allocation and employs MILP-based scheduling to achieve a 31% reduction in iteration time and improved network performance.
  • Experimental evaluations on NSFNET topologies confirm that real-time resource adaptation and contiguity-aware path selection significantly reduce bubble ratios and blocking probabilities.

Multi-Datacenter Optical Networks constitute the physical and algorithmic foundation for distributed machine learning training that spans geographically separated datacenters interconnected via high-capacity optical fiber networks. These systems are increasingly critical for scaling LLM and deep neural network (DNN) training where hardware resources in a single facility are insufficient. Multi-DC optical networks introduce novel challenges in resource assignment, communication scheduling, and system optimization, necessitating frameworks that co-design pipeline-parallel training algorithms with real-time network state awareness, latency estimation, and traffic engineering. Below, key principles, frameworks, and results from recent advances such as CBA ("Communication-Bound-Aware Cross-Domain Resource Assignment for Pipeline-Parallel Distributed LLM Training in Dynamic Multi-DC Optical Networks" (Fu et al., 23 Dec 2025)) are summarized in rigorous detail alongside representative approaches.

1. Distributed Training over Multi-DC Optical Network Topologies

Multi-DC optical networks are typically abstracted as a graph G=(V,E)G = (V, E) where VV represents individual datacenters (DCs) and EE are fiber links supporting WW frequency slots per fiber (e.g., W=80|W| = 80 slots, each $12.5$ GHz, NSFNET topology (Fu et al., 23 Dec 2025)). Each link ee maintains a binary frequency-slot occupancy vector se[1..W]s_e[1..W] at time tt. In pipeline-parallel (PP) distributed LLM training, LL layers are partitioned into PP stages P0,...,PP1P_0, ..., P_{P-1} with each stage mapped to a GPU—often spread across multiple DCs. Each of MM micro-batches per iteration triggers (P1)M(P-1) \cdot M inter-DC transmission requests via dynamic optical network traffic, where link occupancy may overlap across requests due to temporal demand.

Key metrics are:

  • Per-iteration runtime TiterT_{iter}: wall-clock time from start of first forward to end of last backward micro-batch.
  • Bubble ratio RbubbleR_{bubble}: proportion of iteration time spent idling owing to communication delays.
  • Blocking probability pblockp_{block}: fraction of transmission requests that cannot be assigned a feasible path and frequency slot block, inducing delay or cancellation.

2. Communication-Aware Resource Assignment and Scheduling

Recent frameworks such as CBA (Fu et al., 23 Dec 2025) model PP training as a mixed-integer linear program (MILP) seeking to minimize TiterT_{iter} under multi-DC optical network constraints. Decision variables xr,p,i,fx_{r,p,i,f} indicate the assignment of micro-batch transmission rr (corresponding to a stage-to-stage data movement) to optical path pp and contiguous frequency slot block ff on every link of pp.

The communication latency for a request rr with payload cc traversing path pp and slot block ff is captured by the α\alphaβ\beta model: Tcomm(r)=αp+βpc+εp(c)T_{comm}(r) = \alpha_p + \beta_p \cdot c + \varepsilon_p(c) where αp\alpha_p, βp\beta_p are path-specific offset/bandwidth parameters updated per iteration, and εp(c)\varepsilon_p(c) accounts for queuing delays.

Scheduling constraints rigorously ensure that no frequency slot on any link is double-booked and that frequency-slot block assignment remains contiguous on all links of a path.

3. Communication-Bound-Aware (CBA) Dynamic Resource Adaptation

The crux of CBA (Fu et al., 23 Dec 2025) is adaptive, cross-domain orchestration:

  • Detection of communication-bound tasks: the orchestrator inspects the previous schedule Sj1S_{j-1} to label any micro-batch computation as communication-bound if network delays exceed prior dependency completion (cur.start_time>prev.completion_time+Latency_DC_connectcur.start\_time > prev.completion\_time + Latency\_DC\_connect).
  • Dynamic frequency slot demand adjustment: if a transmission was blocked last iteration, decrease its slot demand by one; if labeled communication-bound, increment by one (bounded system-wide) to secure wider spectrum and improve latency.
  • K-shortest-path search with contiguity-aware path selection: for each transmission, the framework evaluates KK candidate paths and slot blocks, calculating a fitness score

I(pk,f)=Cavail(pk,f)L(pk)(1ρ(pk))I(p_k, f) = \frac{C_{avail}(p_k, f)}{L(p_k)} \cdot (1 - \rho(p_k))

where CavailC_{avail} is the contiguity index, L(pk)L(p_k) is path hop count, and ρ(pk)\rho(p_k) is current frequency slot usage fraction.

This heuristic approach enables real-time resource adaptation as network state and model demands evolve during training. No formal worst-case guarantee on solution approximation is provided.

4. Performance Characterization and Benchmarks

Experimental evaluation (Fu et al., 23 Dec 2025) utilizes NSFNET (14 nodes, 21 links; 80 FS/link, 12.5 GHz; 64-QAM modulation), placing GPUs randomly in six DCs for Llama 3 models (8B, 70B, 8 PP stages).

  • Baseline algorithms: KSP-FF (K-shortest paths, first-fit assignment) and SD-FF (shortest-distance path, first-fit).
  • Key results (Llama 3 70B, GPipe, M=128M=128 micro-batches):
Metric KSP-FF SD-FF CBA (Ours)
Iteration time (s) 102.4 98.7 68.0
Bubble ratio (%) 48.1 45.5 37.9
Blocking prob. (%) 17.3 15.9 13.8
  • Improvements over best baseline:
    • 31.25%31.25\% reduction in iteration time
    • 11.96%11.96\% decrease in bubble ratio
    • 13.20%13.20\% fewer blocking requests

CBA ablation studies show that disabling communication-bound task labeling or dynamic α\alphaβ\beta latency updates leads to inferior bubble ratio and blocking probability.

5. Theoretical and Algorithmic Complexity

The per-iteration complexity of CBA (Fu et al., 23 Dec 2025) is O((P1)MK(ElogV+W))O((P-1)\cdot M \cdot K \cdot (E\log V + W)):

  • KK-shortest-path: O(K(ElogV))O(K (E \log V)) per request,
  • Contiguity and fitness computation: O(W)O(W) per path.

Given practical values (e.g., P=8P=8, M=128M=128, K=4K=4, W=80W=80), CBA remains computationally tractable even in large network topologies.

Multi-DC optical networking for distributed machine learning entails tight co-design between application-level pipeline-parallel training and optical network resource management. CrossPipe (Chen et al., 30 Jun 2025) generalizes multi-DC pipeline scheduling as a constraint optimization model, providing both CP solver and greedy near-optimal schedules, explicitly accounting for bandwidth and latency (via α\alphaβ\beta model) and achieving up to 33.6%33.6\% reduction in training time compared to static schedules.

Alternate frameworks such as SPP (Luo et al., 2022), HelixPipe (Zhang et al., 1 Jul 2025), TawPipe (Wu et al., 12 Nov 2025), and BaPipe (Zhao et al., 2020) focus on device-level communication patterns, weight-passing schemes, and load-balanced stage partitioning, providing the necessary abstractions for scaling within or across DC boundaries.

CBA represents the state of the art in integrating pipeline-parallel task scheduling with real-time optical network state, adapting spectrum assignment dynamically, and maximizing utilization under stringent multi-DC constraints. Such communication-bound-aware resource assignment mechanisms are fundamental to the sustainable scaling of distributed LLM and DNN training workloads across geographically distributed datacenters.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Datacenter Optical Networks.