Cronos Cluster: ARM-based HPC Platform
- Cronos Cluster is a low-cost, ARM-based computing platform built with Raspberry Pi devices for educational and experimental HPC benchmarking.
- The system leverages Slurm, Open MPI, and HPL benchmarks to assess performance and energy efficiency through both inter-node and intra-node parallelism.
- Research indicates that homogeneous deployments using Raspberry Pi 4 deliver superior stability and efficiency compared to mixed configurations with Raspberry Pi 3B.
The Cronos cluster is a low-cost computing platform constructed from Raspberry Pi 4 and Raspberry Pi 3B microcomputers, specifically architected for educational and experimental research environments. Its principal focus is on the evaluation of computational performance and energy efficiency using standardized high-performance computing (HPC) benchmarks. The system typifies a class of ARM-based clusters, demonstrating feasible approaches for distributed computation, resource management, and quantitative assessment of power-performance characteristics in compact, economically accessible settings (Semken et al., 8 Dec 2025).
1. Architecture and Hardware Composition
Cronos incorporates eight single-board computers organized within a custom chassis and powered by a unified 5 V switching PSU (8 A total capacity). The node breakdown is as follows:
- Raspberry Pi 4 (6 nodes):
- CPU: Broadcom BCM2711, ARM Cortex-A72 (quad-core, 1.5 GHz)
- Memory: 4 GB LPDDR4-3200
- Network: Gigabit Ethernet (over USB 2.0)
- Wireless: 802.11ac Wi-Fi, Bluetooth 5.0
- Storage: microSD (class 10), OS image
- Raspberry Pi 3B (2 nodes):
- CPU: Broadcom BCM2837, ARM Cortex-A53 (quad-core, 1.2 GHz)
- Memory: 1 GB LPDDR2
- Network: 10/100 Ethernet
- Wireless: 802.11n Wi-Fi, Bluetooth 4.1
- Storage: microSD (class 10)
The cluster utilizes a central unmanaged Gigabit switch in a star network topology. Node home directories and HPL benchmark data are exported by a dedicated NFS server. Power consumption is monitored by a clamp meter (“pinza amperimétrica”) at the 5 V bus; current samples are acquired every 5 min during typical 35 min runs, and energy totals are integrated using the trapezoidal rule.
2. Software Stack and Resource Management
The Cronos platform runs Raspberry Pi OS Lite (Debian Buster-derived) with Linux kernel 5.x. Cluster management and job scheduling are provided by Slurm (v20.x–21.x), utilizing Munge for intra-cluster authentication. Slurm is configured with the following parameters:
ProctrackType=proctrack/pgidSelectType=select/cons_resDefMemPerCPU=1000M- Distinct definitions for rpi4[1–6] and rpi3[7–8] nodes
Parallel job execution leverages Open MPI (v4.1.x, compiled with GCC 9.x and TCP/IP btl). Computational kernels use OpenBLAS (v0.3.x) for optimized BLAS routines, with High Performance Linpack (HPL 2.3) as the principal benchmark workload. HPL input files (HPL.dat) are specifically adapted for ARM instruction sets. System state (CPU load/thermal readings) is tracked via Ganglia; power metrics are logged with custom scripts.
3. Benchmarking Procedures
Benchmarking is performed using the HPL benchmark in double precision arithmetic. Key parameters include:
- Matrix dimension (): 2,800–3,000
- Block size (): 224
- Process grid (): 3×2 (homogeneous), 8×1 (heterogeneous case)
Two canonical Slurm job scripts are defined:
- Case A (Inter-node parallelism):
--ntasks-per-node=1leading to 6 MPI ranks over 6 nodes - Case B (Intra-node parallelism):
--ntasks-per-node=4yielding 24 ranks on 6 nodes
Each configuration is executed three times, and results are summarized using mean () and sample standard deviation ().
4. Performance Analysis
Homogeneous Configuration (6×RPi 4):
| Configuration | N | NB | Time [s] | GFLOPS | |
|---|---|---|---|---|---|
| 1 task/node (6 MPI ranks) | 2872 | 224 | 3×2 | 2872.6 | 6.2667 |
| 2889 | 224 | 3×2 | 2889.5 | 6.2299 | |
| 2946 | 224 | 3×2 | 2946.4 | 6.1096 | |
| Mean ± SD | — | — | — | 2910±30.3 | 6.19±0.07 |
| Configuration | N | NB | Time [s] | GFLOPS | |
|---|---|---|---|---|---|
| 4 tasks/node (24 ranks) | — | 224 | 3×2 | 1243±22.5 | 14.48±0.26 |
Heterogeneous Configuration (6 RPi 4 + 2 RPi 3B):
| Configuration | N | NB | Time [s] | GFLOPS | Comments | |
|---|---|---|---|---|---|---|
| 3×2 grid | 2933 | 224 | 3×2 | 2932.8 | 6.1379 | ~0.4% ↑ vs. 6-nodes |
| 8×1 grid | 6873 | 224 | 8×1 | 6873.3 | 2.619 | Poor stability |
Scaling trend is approximately linear up to 6 nodes. Addition of RPi 3B nodes yields negligible performance improvement (~0.4%) and degrades system stability.
5. Power Consumption and Efficiency Metrics
Power is computed as (), with total energy integrated over runtime ():
Performance in GFLOPS is calculated as:
Efficiency () is given by:
Example from best 6-node run: , , , . The paper reports an efficiency metric of , based on direct ratio of GFLOPS to Wh (Semken et al., 8 Dec 2025).
6. Scalability, Stability, and Bottleneck Analysis
Homogeneous configurations show nearly ideal scaling and minimal communication overhead up to 6 RPi 4 nodes. The inclusion of heterogeneous nodes (RPi 3B):
- Induces NTP desynchronization, leading to MPI/Slurm job errors (“Zero Bytes…”).
- Causes instability; RPi 3B nodes occasionally hang, requiring manual intervention (
scontrol restart). - Results in <0.5% performance gain and compromised overall stability.
Principal bottlenecks are underpowered CPUs in RPi 3B, network contention from the single-switch topology, and scheduling complexity. Hybrid computations combining MPI and OpenMP (for instance, distributed PI-calculation workloads) yield improved efficiency when intra-node cores are fully engaged.
7. Design Recommendations and Future Prospects
Empirical evidence favors a homogeneous deployment of Raspberry Pi 4 nodes to enhance stability and streamline resource management. Intra-node parallelism (--ntasks-per-node=4) should be exploited for ≥2× performance gains. Optimal HPL execution occurs with process grid and block size NB=224 on 6 nodes. Maintaining robust NTP synchronization and power reliability is essential.
Suggested improvements include:
- Upgrading remaining nodes to RPi 4 with uniform memory provisioning (4 GB RAM)
- Adding a secondary NIC or L2-capable switch to mitigate network jitter
- Automating power logging with finer granularity
- Conducting comparative studies with small-scale PC clusters for normalized cost, energy, and performance metrics
These recommendations reflect systematic findings on the performance and energy optimization of ARM-based educational clusters (Semken et al., 8 Dec 2025). Plausibly, broader adoption of such configurations may support pedagogical and applied research goals where scalability, economy, and efficiency are prioritized.