Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 17 tok/s
GPT-5 High 21 tok/s Pro
GPT-4o 90 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 213 tok/s Pro
2000 character limit reached

Atlas Supercomputer Overview

Updated 25 August 2025
  • Atlas Supercomputer is a high-performance computing platform integrating HTC and HPC paradigms to support large-scale scientific workloads.
  • It employs advanced scheduling algorithms, containerization, and dynamic resource allocation techniques to optimize workload execution and maximize resource utilization.
  • The system supports malleable deep neural network training and real-time monitoring through innovative network design and resource management strategies.

The Atlas Supercomputer is a high-performance computing platform designed to support large-scale scientific workloads by integrating traditional high-throughput and high-performance architectures with advanced workload management, scheduling, and resource optimization methodologies. Its deployment and operation draw heavily on lessons learned from large scientific collaborations such as the ATLAS experiment at CERN, with key architectural, operational, and optimization strategies developed to maximize computational throughput, flexibility, and utilization.

1. Architectural Principles and Workload Integration

The Atlas Supercomputer employs a hybrid model that merges high-throughput computing (HTC) with high-performance computing (HPC), a strategy necessitated by the data and simulation demands of modern scientific experiments. Such workloads are typified by large numbers of independent, computationally intensive tasks—most notably Monte Carlo detector simulations, which often comprise 60% of grid jobs and are implemented using frameworks such as AthenaMP (GEANT4 toolkit, ≤2 GB per process) (Oleynik et al., 2017).

To capitalize on unused HPC resources, Atlas leverages opportunistic backfill scheduling policies, allowing short, low-priority tasks to utilize otherwise idle compute cores. Rather than relying exclusively on traditional pilot-based job pull systems common in grid computing, Atlas adapts the PanDA workload management system. Here, a specialized “PanDA Broker” is deployed on Data Transfer Nodes (DTNs)—nodes with network connectivity necessary for job aggregation and dispatch—not on the compute nodes, which typically lack outbound connectivity and local storage (Oleynik et al., 2017).

This adaptation addresses supercomputer-specific constraints and sets the stage for large-scale, sustained exploitation of available compute resources. For instance, Atlas has demonstrated production scales approaching 52M core-hours per year on facilities such as Titan, which corresponds to reaching approximately 5% of the entire Worldwide LHC Computing Grid capacity solely from opportunistic backfill (Oleynik et al., 2017).

2. Scheduling Algorithms and Resource Assignment

Efficient resource allocation within Atlas is underpinned by sophisticated scheduling and job mapping algorithms. The system is equipped to handle the assignment of jobs to supercomputer nodes in multi-core, multiprocessor environments where communication and data locality are essential performance factors. Three classes of algorithms are relevant: parallel simulated annealing, genetic, and composite algorithms (Baranov et al., 2022).

  • Simulated Annealing: Offers rapid convergence to acceptable node/process mappings within strict runtime constraints typical of job scheduling windows (5–15 min). It is recommended for regular mappings, particularly when process count closely matches available cores.
  • Genetic Algorithms: Deliver high-quality, near-optimal assignment by evolving solution populations, beneficial for highly interconnected, parallel tasks, but incurs higher compute overhead.
  • Composite Algorithms: Combine both methods—using simulated annealing for initial population seeding followed by genetic refinement—achieving improved solution quality with moderate runtime increase.

Atlas predominantly employs simulated annealing for its speed, using composite algorithms selectively for very large, communication-intensive jobs, where mapping accuracy directly impacts execution efficiency (Baranov et al., 2022).

3. High-Throughput Execution and Advanced Broker Architectures

The integration of next-generation executor (NGE) architectures marks a significant advancement. The NGE is a pilot-based runtime system facilitating dynamic, multi-generation scheduling and decoupled resource acquisition/execution. Unlike static broker models, which submit fixed job batches unchangeable post-scheduling, NGE enables jobs of varied sizes and heterogeneity to be dynamically packed into available backfill slots (Oleynik et al., 2017).

The architecture includes:

  • PilotManager/UnitManager: Oversee resource and workload descriptions.
  • Pilot Agent/Executors: Submitted via SAGA to batch systems (e.g., PBS), with OpenMPI/ORTE supporting distributed task orchestration.

This pilot-based model ensures linear scalability and predictable overhead when scaling from hundreds to thousands of nodes, as demonstrated in weak and strong scaling experiments. It further supports execution modes adapting in real time to fluctuations in resource availability and can be extended to other large experimental systems, including SKA and LSST (Oleynik et al., 2017).

4. Containerization and Utilization of Idle Resources

To maximize resource utilization, Atlas introduces mechanisms to run additional workloads in a low-priority, containerized fashion that does not interfere with primary scheduled jobs. This is achieved by maintaining a secondary queue of containerized, non-parallel tasks submitted to backfill idle nodes. A master program coordinates job submission and checkpointing, while node-local managers handle job execution and synchronized checkpoint/release on expiration of allocation (Dubenskaya et al., 2019).

This results in effective utilization metrics (u = l − l₍ₐᵤₓ₎, with l as average load and l₍ₐᵤₓ₎ as container management overhead), approaching 99.5–99.6% in simulations. The trade-off factor F quantifies the balance between increased utilization and potential interference with main jobs. Experiments on similar systems demonstrate substantial reductions in idle node counts—by factors of 3–4.9—highlighting the utility of container-based scavenger workloads for Atlas (Dubenskaya et al., 2019).

5. Malleable Deep Neural Network Training on Idle Nodes

Atlas supports advanced training workflows for deep neural networks (DNNs) on transient idle resources through malleable scheduling frameworks such as BFTrainer and MalleTrain (Liu et al., 2021, Ma et al., 24 Apr 2024). These exploit the flexibility of DNN training to scale tasks over fluctuating node availabilities, maximizing resource recovery without impacting scheduled workloads.

The resource assignment problem is formalized as a mixed-integer linear programming (MILP) optimization (Liu et al., 2021), with allocation variables x₍ᵢⱼ₎ determining job-node assignment constrained by node exclusivity and job-specific resource bounds. The system can optimize different administrator or user metrics (e.g., throughput, scaling efficiency) via piecewise linear approximations and SOS2 constraints.

MalleTrain further generalizes this by introducing an online Job Profiling Advisor (JPA) that autonomously profiles job scalability at runtime, feeding empirical performance data into the MILP allocator for dynamic, real-time optimization—eliminating the need for user-supplied models and improving training throughput by up to 22.3% over previous approaches (Ma et al., 24 Apr 2024). This approach is effective for neural architecture search and hyperparameter optimization tasks where scaling behavior is not known a priori.

6. Real-Time Monitoring and Visualization

Efficient operation of Atlas requires visibility into system resource usage, failure states, and performance bottlenecks. 3D real-time monitoring frameworks integrate large-scale data acquisition (on the order of tens of millions of data points per day) and visualization through gaming engines such as Unity 3D (Bergeron et al., 2021).

Key technical elements include time-series aggregation, analytics platforms (Apache Accumulo, D4M), and mathematical models for alert generation (non-negative derivatives, normalized metrics):

Δf(t)=max(0,f(t+Δt)f(t)Δt)\Delta f(t) = \max\left(0, \frac{f(t+\Delta t) - f(t)}{\Delta t}\right)

m^=mmminmmaxmmin\widehat{m} = \frac{m - m_{\text{min}}}{m_{\text{max}} - m_{\text{min}}}

Visualization overlays critical metrics and event-driven alerts directly onto deconstructed hardware representations, facilitating rapid diagnosis, informed scheduling, and operator training.

7. Network Topology Optimization

The scalability and resilience of Atlas’s node interconnect are informed by mathematical graph construction grounded in symplectic Lie algebra (Ramos et al., 2017). Network nodes are represented as weight vectors (lattice positions), with adjacencies determined by root vectors (±2eᵢ and ±eᵢ ± eⱼ), yielding topologies with lower diameters and mean path lengths than traditional mesh and hypercubic architectures:

  • Diameter: For symplectic topology, Ls=MnL_s = M \cdot n, compressing distance for large n.
  • Density: ρ=ν/Ln\rho = \nu / L^n, indicating higher connectedness relative to standard topologies.

For Atlas, increased node connectivity (maximum 2n² for symplectic graphs) and reduced average path length translate into lower communication latency, higher bandwidth, improved fault tolerance, and more efficient scaling.

8. Future Directions and Resource Portfolio

Ongoing evolution of Atlas involves integration with heterogeneous compute architectures (GPUs, accelerators), cloud elasticity, and advanced benchmarking schemes. Integration models include edge services (Harvester) for workload management, containerization (CVMFS, Kubernetes), and hybrid cloud deployments with elastic scaling and on-demand resource acquisition (Megino et al., 2023).

Challenges involve adapting grid workflows to HPC batch models, mitigating file system contention, authenticating multi-institutional jobs, and minimizing cloud egress costs. Resource optimization strategies include event-level multithreading, dynamic node packing, and utilization of spot pricing models.

Comprehensive cost analyses and operational recommendations emphasize the importance of R&D into GPU/accelerator adaptation, standardized APIs, improved authentication, and advanced accounting mechanisms (Megino et al., 2023). Early engagement with facility designers and continuous benchmarking facilitate inclusion of experimental requirements in future supercomputing deployments.


The Atlas Supercomputer thus represents an overview of high-throughput and high-performance paradigms, operational flexibility via advanced scheduling and containerization, malleable DNN training, and mathematically optimized network design—each grounded in published methodologies and empirical evidence from flagship scientific collaborations and computational infrastructure research.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube