ByteRobust: Scalable GPU Training Management

Updated 3 July 2026

ByteRobust is a robust GPU infrastructure management system designed to maintain an effective training time ratio above 95% despite frequent hardware and software failures.
It employs a dual-plane architecture with a Kubernetes-based control plane and a Python agent for proactive fault detection and real-time recovery.
The system integrates in-place hot updates, warm standby nodes, and over-eviction-aware checkpointing to minimize downtime and accelerate failure scheduling.

ByteRobust is a production-scale GPU infrastructure management system developed by ByteDance to deliver robust, high-availability LLM training over tens to hundreds of thousands of GPUs. Designed for environments where clusters exceed 10,000 GPUs and single-job runs extend for multiple months, ByteRobust targets the central infrastructure challenge posed by high-frequency, heterogeneous failures—both explicit (clear error output) and implicit (silent hangs, NaN emergence, or machine fraction utilization [MFU] drops)—which fundamentally limit productive training time if unmanaged. The principal design objective is to maximize the Effective Training Time Ratio (ETTR), defined as

$\text{ETTR} = \frac{ \text{productive GPU‐training time} }{ \text{wall‐clock elapsed} } \times 100\%$

sustaining ETTR above 95% even for month-long, multi-thousand-GPU LLM jobs (Wan et al., 19 Sep 2025).

1. System Motivation and Failure Characterization

As LLM training scales to hundreds of billions of parameters and tens of thousands of GPUs, "rare" hardware or software failures (CUDA XIDs, kernel panics, switch reboots, silent data corruptions) become statistically likely per hour. At these scales, implicit failures (job hangs, MFU decrements, silent data corruption [SDC]) present with little or no surface-level evidence. Prevalent solutions—log parsing, static timeouts—are inadequate: naive recovery involving restarting and checkpoint reloads (on timeout, NCCL’s default ≈600 seconds) induces unacceptable overhead. Such approaches push cluster-wide ETTR down to 70–80% or worse, especially as multi-stage, multi-month LLM experiments demand frequent user code and configuration changes.

ByteRobust is predicated on three operational realities: (i) hardware/software failures are frequent and inevitable at scale; (ii) implicit and explicit failures require fundamentally different detection paradigms; (iii) modern LLM training is a continuously evolving, multi-stage process requiring in-place code, configuration, and data distribution updates.

2. Architecture: Control and Data Planes

ByteRobust employs a dual-plane architecture:

Control Plane: The “Robust Controller” is implemented as a Kubernetes controller (~20K lines of Go) with a custom metadata store and scheduler. It manages custom resource definitions (CRDs), enforces fault-handling policies, triggers pod restarts, code rollbacks, and activates GPU standbys. Internal scheduling and job management are tightly integrated with ByteDance's in-house clusters (>200,000 GPUs).
Data Plane: Embedded as a Python “Robust Agent” (~5K lines), the data plane operates within each training pod, receiving gRPC heartbeats and control signals. Its four submodules include:
- Monitor: Regularly inspects host, GPU, and network state every 2–30 s via DCGM, dmesg, InfiniBand diagnostics, and integrates with wandb for MFU/loss tracking.
- Diagnoser: On suspension, triggers NVIDIA EUD-style stress tests, intra/inter-node NCCL health checks, and bitwise model alignment tests.
- On-Demand Tracer: Captures Python (py-spy) and C++ call stacks for diagnosing hanging processes.
- Checkpoint Manager: Implements ZeRO-style, frequent in-memory checkpointing with asynchronous device-to-host transfers and cross-parallel-group backups.

Job annotation with ByteRobust policies allows transparent orchestration; pods dynamically respond to code changes, pod standbys, and over-eviction events.

3. Failure Detection and Diagnosis

ByteRobust’s detection and diagnosis pipeline is structured into proactive polling and hierarchical, stop-time analysis:

Proactive Checks: The agent polls hardware and software for CUDA XIDs, temperature anomalies, OOM events, kernel panics, network (IB, UFM) drops at 2–30 s intervals. RDMA bandwidth and TensorCore MFU are surveyed every 30 s. Triggering conditions initiate process suspension and begin automated eviction:

$T_{\text{detect}} \approx \begin{cases} 2–10\text{s} & \text{GPU/host faults}\ 30\text{s} & \text{network faults}\ 10–30\text{s} & \text{MFU/RDMA drops} \end{cases}$
Stop-Time Diagnostics: If fast detection is inconclusive, a hierarchical diagnosis suite initiates:
1. Trace exit codes/logs for user-space errors.
2. Run EUD and single-node NCCL all-to-all health checks.
3. Bitwise output alignment within controlled model/DP×TP×PP configurations using synthetic data.
4. For persistent, complex, or SDC-induced hangs, the dual-phase replay algorithm isolates the failing machine by cross-sectioning groups by $⎣x_i/m⎦$ and $x_i \bmod n$ , efficiently localizing faults.

In practice, most SDCs are single-node, isolatable via background replay with <2 hours of offline group testing, a considerable reduction compared to manual approaches.

4. Fault Tolerance and Recovery Strategies

ByteRobust stacks three fault tolerance mechanisms to minimize downtime:

In-Place Hot-Update: User code tweaks (e.g., kernel changes, hyperparameter patches) are lazily but rapidly injected into containers, leveraging the already-warmed environment to sidestep pod re-launch lag (often tens of minutes).
Warm Standbys: Failure buffering is tuned to P99 daily node failures (derived from a Binomial(n, p) model over historical logs). Pre-provisioned, self-checked GPUs (hardware-equivalent, container images/data preloaded under fast sleep barriers) can be activated instantly for common failures. If node demand exceeds standby pool size $K$ , K are woken immediately, with $F-K$ new nodes asynchronously provisioned, blocking only if necessary. This realization operates within 5.2% of the ideal "oracle" scenario and achieves up to a 10× scheduling speedup over full job requeue.
Over-Eviction-Aware Checkpointing: During 3D ZeRO-style parallel training, intra-group nonblocking shard exchanges and local SSD/CPU RAM backups occur during idle communication slots. P2P optimizer-shard backup ensures that evicting any single PP group or small node set never eliminates all checkpoint replicas. Device-to-host, serialization, and P2P send are multiplexed over separate CUDA/IPC channels, resulting in checkpoint stalls <0.05 s per step—>99% below Megatron-LM’s blocking paradigm.

5. Implementation and Production Deployment

The ByteRobust codebase comprises:

Component	Implementation Language/Size	Core Function
Control Plane	Go, ~20K LoC	CRDs, scheduler, policy engine
Data Plane/Agent	Python, ~5K LoC	Monitoring, diagnosis, checkpointing
Runtime Analyzer	Go, ~12K LoC	Log/metric aggregation, analytics
Checkpoint Manager	Python, ~3K LoC	ZeRO/P2P checkpoint management

ByteRobust has operated in ByteDance’s production environment (200,000+ GPUs) for over a year with no infrastructure-attributable downtime. Over a recent 90-day period, the system processed 44,000 explicit and 6,000 implicit failures. Deployment is transparent: LLM jobs are annotated and orchestrated through the internal scheduler, with ByteRobust-managed pods coexisting seamlessly with standard Kubernetes jobs.

6. Evaluation and Empirical Results

Evaluation spans two ByteDance pretraining runs: a 70B dense LLM (3 months, 9,600 GPUs) and a 200B MoE LLM (1 month, 9,600 GPUs). ByteRobust sustained end-to-end ETTR above 97%, with the worst-case contiguous unproductive block peaking at 50 minutes over three months. Sliding-window ETTR analysis (1-hour) remained above 95% even amidst major feature rollouts. Specifically:

Proactive evictions resolved 56–73% of incidents immediately.
Hot updates handled all 9,000 manual restarts.
Stack aggregation and over-eviction remedied 8–10% of implicit hangs or MFU drops.
Rollback mechanisms surfaced 7–11% of complex software bugs.

Microbenchmarks on 16,384 GPUs report:

Warm standby plus hot update expedites failure scheduling by 10.87× over full requeue, 5.36× over selective reschedule.
Checkpointing incurs <0.03 s block per step (vs. 6–13 s Megatron-LM, 0.2–1.8 s in-memory competitors), with <1% impact on MFU.
Sub-30 s detection latency for GPU, network, host faults (far superior to 600 s timeout baselines).

7. Design Principles and Implications

Key operational lessons and practices include:

Proximity in fault detection: Integrate second-level hardware and MFU querying into the application path to minimize the detection interval.
Coarse isolation instead of brittle exactness: When precise root-cause identification is infeasible, over-evict entire parallel groups to preserve forward training progress.
Support for dynamic code evolution: Treat ongoing code updates and rollbacks as intrinsic, leveraging hot updates and automatic rollbacks to maximize productivity.
Resource buffering tied to empirical failure probability: Size warm standby pools to the empirical P99 failure count, efficiently allocating overhead.
SDC detection remains complex: Persist with background dual-phase group replays and intra-machine validation (e.g., MiniGPT suite) to isolate rare, propagating bit-flip errors.

ByteRobust demonstrates that integrating real-time, lightweight monitoring; systematic, data-driven over-eviction; hierarchical stop-time diagnosis; and rapid, multi-modal recovery can deliver near-optimal ETTR on exascale LLM training jobs, establishing a foundation for future large-scale, continuous training infrastructures (Wan et al., 19 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Robust LLM Training Infrastructure at ByteDance (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByteRobust.