Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 66 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Distributed Scion (Disco): Scalable LLM Training

Updated 7 October 2025

Distributed Scion (Disco) is a distributed optimization framework for large deep learning models that leverages operator norm invariance and adaptive scaling laws.
It employs empirically derived scaling rules for learning rate and batch size to consistently achieve optimal training performance across varied architectures.
It integrates per-layer-group learning rate adaptation and efficient communication strategies to minimize overhead in diverse distributed parallelism setups.

Distributed Scion (Disco) refers to a scalable, distributed optimization and training framework for large deep learning models, built on the Scion optimizer and its associated norm invariance principles. It leverages operator norm–guided scaling rules, per-layer learning rate adaptation, and robust communication strategies to provide practical methodologies for efficient large-scale LLM training and hyperparameter transfer.

1. Operator Norm Invariance and Norm Transfer

A central discovery in the Scion framework is the operator norm invariance under optimal hyperparameters. For both model and dataset scaling, empirical evidence demonstrates that the output layer’s operator norm remains nearly constant at its optimal configuration, regardless of model width ( $N$ ), depth, or dataset size ( $D$ ). The operator norm is computed as

$\|W_o\|_{\mathrm{RMS}\to\infty} = \max_i\left[d_{in} \cdot \|{\mathrm{row}}_i(W_o)\|_{\mathrm{RMS}}\right]$

where $W_o$ is the output projection matrix.

As model size and dataset horizon vary across extensive experimental runs (up to 1.3B parameters and 138B tokens), the optimal norm consistently settles at approximately $2^7$ ( $2^{7.0\pm 0.2}$ for dataset scaling, $2^{7.4\pm 0.2}$ for model scaling) (Filatov et al., 4 Oct 2025). This invariance provides a necessary condition for optimal training: the best-performing (minimum-loss) run for any configuration is always found at the constant norm manifold.

2. Hyperparameter Scaling Laws and Sufficient Conditions

While many $(\eta, B)$ pairs—the learning rate and batch size—achieve the necessary operator norm, only a unique pair $(\eta^*, B^*)$ consistently reaches the lowest loss. Sufficient conditions for optimality are provided by hyperparameter scaling laws: $\eta^*(B, D) \propto B^{0.62} \cdot D^{-0.56}$

$B^*(D) \propto D^{0.45 \pm 0.07}, \qquad \eta^*(D) \propto D^{-0.28 \pm 0.07}$

In the region around the optimum, the trade-off relation $\eta \propto \sqrt{B}$ holds (Filatov et al., 4 Oct 2025). These empirical scaling rules align with the established behavior of Adam, indicating high transferability of optimal hyperparameter configurations across architectures and datasets.

This norm transfer principle can be stated precisely: the constant output norm is necessary but not sufficient for optimality; only the specific scaling rules for $\eta$ and $B$ guarantee lowest achievable loss.

3. Layer-Group Learning Rate Adaptation

Distributed Scion (Disco) further improves upon the Scion optimizer by introducing per-layer-group learning rate schedules. Layers are partitioned into input, hidden, and output groups, and empirical results determine optimal ratios of learning rates: $\eta_{input} : \eta_{hidden} : \eta_{output} = 1 : \frac{1}{8} : 1$ A uniform learning rate serves as a strong baseline, but differentiated rates reduce final loss by up to 6% (Filatov et al., 4 Oct 2025). The output layer in particular is sensitive and benefits from stricter learning rates, while hidden layers are more tolerant to smaller step sizes.

4. Distributed Implementation and Communication Strategy

The Disco implementation is engineered for compatibility with modern distributed parallelization paradigms: DDP (Data Parallelism), FSDP (Fully Sharded Data Parallel), TP (Tensor Parallelism), PP (Pipeline Parallelism), EP (Expert Parallelism), and CP (Context Parallelism) (Filatov et al., 4 Oct 2025). The update and synchronization mechanism is bucketized:

Each device computes local momentum-based LMO (Linear Minimization Oracle) updates.
AllGather routines aggregate updates across devices into contiguous buckets.
Local transformations occur prior to synchronization, minimizing communication overhead.

Disco implements efficient communication patterns which are validated over two thousand training runs, confirming robust reproduction of norm invariance and scaling behaviors.

5. Practical Guidance for Training and Hyperparameter Search

The norm-guided approach imposes monitoring of the operator norm during training. To achieve optimal model performance:

Grid search over $(\eta, B)$ is recommended to locate the regime where the output norm matches the target ( $\approx 2^7$ ).
Apply scaling rules for $\eta^*$ and $B^*$ , with hyperparameter tuning validated at large batch regimes.
Use per-layer-group learning rate schedules to further enhance performance.

Table 1 summarizes optimal scaling relations:

Parameter	Scaling Law	Typical Value (at optimum)
Operator norm $W_o$	$\approx 2^7$	$2^{7.0 \pm 0.2}$
Learning rate $\eta^*$	$\propto D^{-0.28 \pm 0.07}$	$\eta \sim 10^{-3}$ to $10^{-2}$
Batch size $B^*$	$\propto D^{0.45 \pm 0.07}$	$B \sim 512$ to $8192$

6. Empirical Validation and Dataset/Model Scaling

The Distributed Scion (Disco) framework has been subjected to exhaustive empirical paper with up to 1.3B-parameter LLMs on datasets as large as 138B tokens. Logs and reproducible training metrics are released to the research community (Filatov et al., 4 Oct 2025). Experimental validation confirms:

The operator norm tracking accurately guides toward optimal loss configurations.
Scaling rules for Adam are effectively inherited, enabling efficient transfer across architectures.
Per-layer-group learning rate tuning is critical for further reduction of loss, especially as models are scaled.

7. Significance, Limitations, and Future Directions

Distributed Scion (Disco) achieves high efficiency and transferability in large-scale distributed model training by exploiting operator norm invariance and adaptive scaling laws. The necessary–sufficient separation in hyperparameter tuning is explicitly established: achieving the constant output norm is indispensable, yet fine-tuning $\eta$ and $B$ as per empirical rules is what guarantees optimal learning.

A plausible implication is that norm-guided scaling principles could enable zero-shot hyperparameter transfer across unseen model architectures or dataset regimes, provided the operator norm manifold is known for the optimizer in question.

Remaining challenges concern the theoretical characterization of hidden-layer norm dynamics, extension to non-standard architectures, and scaling to even larger distributed systems. Future research is likely to focus on extending norm-transfer methodology, automating learning rate selection at finer granularity, and integrating Disco with evolving parallelization strategies in large-scale AI systems.

PDF Markdown Chat (Pro)

References (1)

Optimal Scaling Needs Optimal Norm (2025)

Follow Topic

Get notified by email when new papers are published related to Distributed Scion (Disco).

Distributed Scion (Disco): Scalable LLM Training

1. Operator Norm Invariance and Norm Transfer

2. Hyperparameter Scaling Laws and Sufficient Conditions

3. Layer-Group Learning Rate Adaptation

4. Distributed Implementation and Communication Strategy

5. Practical Guidance for Training and Hyperparameter Search

6. Empirical Validation and Dataset/Model Scaling

7. Significance, Limitations, and Future Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Distributed Scion (Disco): Scalable LLM Training

1. Operator Norm Invariance and Norm Transfer

2. Hyperparameter Scaling Laws and Sufficient Conditions

3. Layer-Group Learning Rate Adaptation

4. Distributed Implementation and Communication Strategy

5. Practical Guidance for Training and Hyperparameter Search

6. Empirical Validation and Dataset/Model Scaling

7. Significance, Limitations, and Future Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research