Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Scion Optimizer: Networks & Deep Learning

Updated 7 October 2025

Scion Optimizer is a framework unifying next-generation multipath network routing and geometry-aware deep learning optimization for improved scalability and efficiency.
In networking, it implements packet-carried forwarding state and disjoint path selection to enhance bandwidth aggregation, reduce router state, and bolster security.
In deep learning, it employs norm-aware linear minimization oracles to adaptively scale learning rates and batch sizes, yielding robust convergence and predictive power.

The term “Scion Optimizer” describes two distinct, leading-edge contributions unified by the SCION paradigm: (1) its foundational impact on next-generation network architectures, enabling scalable, multipath, and endhost-controlled routing and (2) a family of deep learning optimizers for large-scale neural networks, based on non-Euclidean, norm-aware linear minimization oracles (LMOs), whose principle is most prominently embodied in the Scion optimizer for LLM training. Both usages rely on the transfer of structure—from network state embedded in packet headers, to optimization geometry encoded in parameter update rules—resulting in improved efficiency, scalability, and predictive power.

1. SCION in Network Optimization: Architecture and Control

SCION (Scalability, Control, and Isolation on Next-generation Networks) is an inter-domain network architecture devised to overcome limitations of the traditional Internet, particularly regarding scalability, availability, and security (Barrera et al., 2015). SCION separates control and data planes, shifts forwarding state from routers into “packet-carried forwarding state” (PCFS) within packet headers, and leverages explicit, endhost-selected path information.

The key operational mechanism is segmented path construction: end hosts assemble an end-to-end path by concatenating up-segments (to the ISD core), optional core-segments (across ISDs), and down-segments (to the destination), forming:

$\text{Path} = \{\text{up-segment}, [\text{core-segment}], \text{down-segment}\}$

This explicit path carries ingress/egress interface identifiers for each traversed AS as 8-byte “opaque fields,” reducing router state and enabling efficient, hardware-accelerated symmetric-key operations for cryptographic validation. Average packet header overhead is linear (≈8 bytes/AS), translating to ~40–50 bytes for typical Internet paths (4–5 hops).

SCION achieves isolation through “Isolation Domains” (ISDs), which scope trust roots and policy boundaries, limiting the propagation of compromised configurations or keys. Trust Root Configurations (TRCs) localize certificate verification and prevent single-point-of-failure in global PKI trust. Through cross-signing, verifiability is preserved even when domains are isolated.

Key research highlights include ARPKI/PoliCert for advanced PKI policies, SIBRA extension for guaranteed minimal inter-AS bandwidth against DDoS attacks, and support for anonymous communication (LAP/HORNET).

2. Scion Optimizer for Deep Learning: The Norm-Invariant LMO Approach

In deep learning optimization, the Scion optimizer adopts the LMO (linear minimization oracle) framework, operating over non-Euclidean norm balls and capturing layerwise geometry (Riabinin et al., 19 May 2025, Filatov et al., 4 Oct 2025). Scion, together with Muon and Gluon, departs from traditional optimizers (e.g., Adam) by updating parameters using geometry-aware, norm-adapted directions.

The update for Scion is globally applied across all network layers and takes the form:

$X^{(k+1)} = X^{(k)} - \frac{t^{(k)}}{\|\nabla f(X)\|_*} (\nabla f(X))^\sharp$

where $(\cdot)^\sharp$ denotes the maximizer for the dual norm (e.g., spectral norm for matrices), and the stepsize $t^{(k)}$ is adaptive.

The most significant theoretical discovery is the “operator norm invariance” for output layers: the optimal learning rate/batch size pair $(\eta^*, B^*)$ is determined by maintaining a fixed operator norm (RMS-to- $\infty$ ) of the output layer across model and dataset scales:

$\|W_{\text{out}}\|_{\text{RMS} \to \infty} = \max_i \{ d_{\text{in}}\cdot \| \text{row}_i(W_{\text{out}}) \|_{\text{RMS}} \} \approx 2^{7.0 \pm 0.2}$

This invariance is empirically observed for models up to 1.3B parameters and datasets up to 138B tokens. Hyperparameter settings that satisfy this norm condition yield optimal scaling.

3. Scaling Laws, Norm Transfer, and Layerwise Learning Rates

Joint scaling of learning rate and batch size is governed by the operator norm invariant. In a log–log regression, the optimal learning rate scales as:

$\log_2(\eta^*(B, D)) \approx 0.62\cdot\log_2(B) - 0.56\cdot\log_2(D) + \text{const}$

Fixed data horizon yields $\eta \propto \sqrt{B}$ , and scaling dataset size $D$ yields $B^*(D) \propto D^{0.45 \pm 0.07}$ , $\eta^*(D) \propto D^{-0.28 \pm 0.07}$ . These scaling relationships are consistent with the Adam optimizer’s empirically derived scaling rules, despite fundamentally different update mechanisms.

“Norm transfer” refers to maintaining the same output layer operator norm across models and datasets to achieve optimal scaling. Multiple $(\eta, B)$ pairs can reach this norm, but only one minimizes loss—a necessary condition for optimality but not sufficient.

Grid search over per-layer-group learning rates reveals improved performance with non-uniform assignments; configuration $\eta_\text{input} : \eta_\text{hidden} : \eta_\text{output} = 1 : 1/8 : 1$ yields up to 6% improvement over uniform rates, confirming that output layers are more sensitive.

4. Algorithmic Design and Efficiency: Layerwise Norms and LMOs

Scion's “global” update distinguishes it from Muon (which focuses on hidden layers with Frobenius balls) and Gluon (which solves LMOs layerwise). Under Scion, one may choose induced operator norms per layer, such as spectral norm for weight matrices. Efficient computation employs SVD to extract update directions:

$X^{(k+1)} = X^{(k)} - t^{(k)} U^{(k)} (V^{(k)})^\top$

where $U^{(k)}$ , $V^{(k)}$ are obtained from the SVD of momentum/gradient terms.

The theoretical framework relies on generalized smoothness conditions; for layer $i$ :

$\|\nabla_i f(X) - \nabla_i f(Y)\|_{(i)*} \leq (L_i^0 + L_i^1 \|\nabla_i f(X)\|_{(i)*}) \|X_i - Y_i\|_{(i)}$

The adaptive stepsize is:

$t_i^{(k)} = \|\nabla_i f(X^{(k)})\|_{(i)*} / (L_i^0 + L_i^1 \|\nabla_i f(X^{(k)})\|_{(i)*})$

Empirical studies find $L_i^0 \approx 0$ and layerwise stepsizes proportional to $1/L_i^1$ match practice.

5. Distributed Optimization and Communication Efficiency

The Scion optimizer family extends naturally to distributed environments. EF21-Muon introduces the first communication-efficient, non-Euclidean LMO-based distributed optimizer, employing bidirectional error feedback and contractive compression (Gruntkowska et al., 1 Oct 2025). For both worker-to-server and server-to-worker communication, model/gradient messages are compressed and “corrected” by error feedback, enabling rigorous convergence guarantees under non-Euclidean smoothness.

Experimental results demonstrate up to 7× communication savings with no loss in accuracy for NanoGPT (124M parameters, FineWeb10B dataset), showing the practical benefits of norm-aware updates and adaptive compression.

6. Applications to Multipath Routing and Path Selection

In the context of SCION networks, the “Scion Optimizer” refers to methodologies for endhost-driven, multipath and intelligent path selection. BitTorrent over SCION exemplifies this approach, where path-level peers are defined, each corresponding to address/path tuples, and connections are established over independent SCION paths (Gartner et al., 2023). A disjoint path selection algorithm, which ranks available paths by conflicts (shared interfaces) and hop count, ensures bandwidth aggregation and congestion avoidance:

Algorithm: Disjoint path selection
Input: peers, maxOutgoingConns
Output: List of pathLevelPeers
  for each peer in peers:
    lookup all SCION paths
  Aggregate these into allPaths
  for each pair (path1, path2) in allPaths:
    compute numConflicts(path1, path2)
  Sort allPaths by conflicts and hop count
  Select the top maxOutgoingConns paths
  Return corresponding path-level peers

Performance benchmarks report a 48% goodput improvement over BGP in small-scale BitTorrent experiments, with trade-offs in CPU overhead.

Longitudinal network studies on the SCIONLab testbed reveal significant path diversity, control-plane churn, and asymmetric performance (path discrepancy), necessitating predictive models and anomaly detection within optimizer frameworks (Rossi et al., 8 Sep 2025). Weighted multi-objective ML models and per-hop metrics (for bottleneck localization) enable adaptive throughput/latency trade-offs and reliability enhancement in multipath scenarios.

7. Summary and Future Directions

The Scion optimizer(s), in both network and deep learning domains, operationalizes layerwise and pathwise control, norm invariance, and geometry-awareness, leading to improved efficiency, scaling, and predictable optimization regimes. The norm-invariant principle in LLM optimization provides a unified objective for hyperparameter selection, while multipath path selection and control in networking adaptively optimize real-world traffic.

In practical terms, monitoring operator norms, layer-specific loss landscapes, and path metrics guides both distributed training and network traffic engineering. The Disco implementation and related toolkits (BitTorrent over SCION, ScionPathML) provide robust experimental platforms for further empirical research.

Further directions include (i) refined norm-based rules for lower-level layers, (ii) extended distributed optimization frameworks with dynamic compression, (iii) richer path diversity-exploitation in network protocols, and (iv) convergence between norm-based and adaptive moment parameterization in both domains.

Domain	Scion Optimizer Principle	Key Metrics/Invariant
Networking	Path-aware, multipath selection	Path segment diversity, goodput, RTT
Deep Learning	Layerwise LMO, norm-invariance	Operator norm (RMS- $\infty$ ), scaling laws

The Scion optimizer unifies norm-guided, structure-aware optimization across both inter-domain routing and large-scale neural network training, establishing robust, scalable foundations for next-generation infrastructure, both in networking and AI.