Apache SystemML: Scalable ML & DL

Updated 9 May 2026

SystemML is a unified machine learning system featuring a declarative DML front-end, cost-based optimizer, and operator fusion to scale ML and DL algorithms.
It transforms high-level DML scripts into optimized execution plans using static and dynamic rewrites, efficiently leveraging hardware-specific libraries and distributed frameworks.
SystemML seamlessly integrates with Hadoop and Spark, ensuring near-linear scalability, minimal overhead, and effective performance across diverse computing environments.

Apache SystemML is a unified, declarative machine learning system designed to support efficient and scalable automatic compilation and execution of ML and deep learning (DL) algorithms. SystemML targets diverse environments ranging from single-node CPUs and GPUs to large shared Hadoop and Spark clusters, bridging the gap between big-data processing frameworks and specialized ML/DL stacks. Its architecture is characterized by a high-level language front-end, a cost-based optimizing compiler, advanced operator fusion and algebraic rewrite mechanisms—including SPORES for relational equality saturation—and a flexible runtime supporting both single-node and distributed computation (Pansare et al., 2018, Boehm et al., 2018, Boehm, 2015, Wang et al., 2020, Chowdhury et al., 2021).

1. Architecture and Language Front-End

SystemML is structured into three primary layers: the DML language front-end, the optimizing compiler stack, and the runtime execution layer. The Data Mining Language (DML) is a declarative, R-like language for expressing ML and DL algorithms in matrix- and tensor-centric terms. DML scripts permit users to define control flow (for, while, parfor), rely on a library of built-in mathematical and neural network operators (such as conv2d, pooling, softmax), and express forward/backward passes for custom layers or call pre-implemented routines for over 20 supported neural network layers and 6 optimizers (SGD, Adam, RMSProp, etc.).

The front-end also supports interoperability with Python-based workflows: wrappers like Keras2DML and Caffe2DML ingest Keras or Caffe models and produce equivalent DML scripts, enabling integration with Spark ML Pipelines and transfer learning scenarios (Pansare et al., 2018).

2. Compilation Workflow and Optimization

SystemML's compiler parses input DML scripts into Abstract Syntax Trees, which are then converted to High-level Operator DAGs ("HOPs"). The compilation pipeline comprises static rewrites for algebraic simplification, operator fusion, and basic cost estimation, followed by dynamic rewrites driven by matrix shape, sparsity, available memory, and cluster configuration. The logical plan is then lowered to Low-level Operator Plans ("LOPs") that are mapped to either optimized in-JVM kernels (with BLAS/MKL or custom CUDA code) or distributed operators using Spark RDDs or classic MapReduce (Pansare et al., 2018, Boehm, 2015).

Automatic static rewrites include standard algebraic identities and operator fusion (e.g., combining chains of element-wise operations), as well as code generation candidates for specialized patterns like im2col in convolution. Dynamic rewrites evaluate the viability of operator fusions, select dense/sparse physical kernels, and determine whether execution should remain local or be distributed.

The runtime selection and plan generation are guided by a cost model, analytically estimating I/O, computation, and control flow costs (including loops and branches). The model is used iteratively across optimization phases to select among alternative plans, preferring the one with minimal expected end-to-end execution time (Boehm, 2015).

3. Operator Fusion and Execution Planning

SystemML employs a dedicated operator-fusion framework to maximize performance and minimize overhead from intermediate materializations. The fusion optimizer operates in three phases (Boehm et al., 2018):

OFMC Candidate Exploration: Systematically enumerates valid "partial fusion plans" per operator node in the HOP DAG using the Open-Fuse-Merge-Close abstraction. This drives a controlled search through the combinatorial space of fusion templates—Row, Cell, Multi-Agg, Outer—each encoding different access and aggregation patterns for dense, sparse, or compressed data.
Cost-Based Plan Selection: The optimizer uses a detailed cost model for evaluating fusion plans, incorporating device-level bandwidths, computational throughput, and sparsity-aware scaling via nonzero statistics.
Code Generation: Selected plans are lowered to backend-neutral CPlans. Runtime code generation is performed via in-JVM dynamic compilation (Janino), resulting in specialized fused operator classes with minimal overhead (typically <1 ms per operator).

Key design goals are the end-to-end elimination of redundant intermediates, minimized scans over large data, and the application of sparsity-aware and hardware-efficient computation patterns. The result is that, in empirical studies, SystemML achieves up to 21x end-to-end speedups relative to unfused or hand-coded fused baselines, across both single-node and distributed environments (Boehm et al., 2018).

4. Algebraic Rewriting via Relational Equality Saturation

SystemML integrates the SPORES optimization framework, introducing relational algebra (RA)–based equality saturation to achieve complete and cost-guided expression rewrites for linear algebra workloads (Wang et al., 2020). The workflow consists of:

LA→RA Translation: Sub-DAGs of the logical plan representing linear algebra expressions are systematically translated into RA expressions using precise rewrite rules.
E-Graph Construction and Saturation: An E-Graph records all equivalent RA expressions by applying a complete set of RA equivalence identities (e.g., distributivity, associativity, commutativity of sums and products) to maximize the exploration of equivalent plans, exploiting common subexpressions natively.
Cost-Guided Extraction: A cost estimator tracks estimated nonzero counts and computational complexity, supporting greedy or ILP-based extraction of the lowest-cost variant.
RA→LA Re-translation: The selected optimized RA plan is converted back to an LA expression (potentially with fused operators).

Empirical benchmarks show that SPORES can match or improve upon all hand-coded optimizations for generalized linear models, support vector machines, multinomial logistic regression, Poisson NMF, and alternating least squares, with empirical speedups ranging from 1.2x to 5x. Operator fusion and sparsity exploitation are unified under this framework, replacing over 84 ad-hoc hand-tuned rewrites (Wang et al., 2020).

5. Deep Learning and Optimizer Integration

SystemML offers a built-in neural network DSL within DML, supporting both MLPs and CNNs, with explicit support for custom layer definition and differentiable operators. Training support includes minibatch SGD and advanced optimizers such as Adam, RMSProp, and Adagrad (Pansare et al., 2018).

A notable demonstration of extensibility is the implementation of LARS (Layer-wise Adaptive Rate Scaling) within SystemML, facilitated purely at the DML script layer. LARS calculates a per-layer local learning rate $\eta_l = \eta \cdot \|W_l\| / (\|\nabla L(W_l)\| + \epsilon)$ , stabilizing very large batch distributed training. Empirical results on CNNs for MNIST show that, beyond batch sizes of ~8,000, LARS markedly outperforms plain SGD in both train and test accuracy, with generalization error rising much more slowly under LARS. This implementation required no modification to SystemML’s optimizer or runtime engine and leveraged only the existing tensor operators and control constructs of DML (Chowdhury et al., 2021).

6. Hardware and Distributed Execution Strategies

SystemML transparently adapts execution to the underlying hardware and cluster conditions. For single-node execution, the system leverages optimized dense BLAS (OpenBLAS, MKL), various sparse formats (COO, CSR, mCSR), blocking strategies for out-of-core data, and GPU acceleration (cuBLAS, cuDNN) with LRU memory management. For distributed execution, data is partitioned as row- or column-slice “blocks,” mapped to Spark RDDs, and evaluated via high-level transformations utilizing Spark/YARN for fault tolerance and scheduling.

Plan adaptation is automatic, with extensive use of cost estimation to decide when to switch between in-JVM, GPU, or distributed execution; to select among dense/sparse/blocked algorithms; and to optimize parfor constructs for parallel execution patterns matching the data, batch size, and cluster topology (Pansare et al., 2018).

7. Integration with Big Data Ecosystems and Impact

SystemML is built for seamless integration with Hadoop and Apache Spark, using Spark DataFrames or HDFS for data ingestion, and relying on YARN or Mesos for resource management, fault tolerance, and multi-tenancy. Python APIs expose scikit-learn–style interfaces, enabling incorporation into Spark MLlib pipelines for fully integrated ETL, hyperparameter tuning, model training, and scoring.

The effect of SystemML's approach is to unify end-to-end ML and DL workflow specification and execution, eliminating the need to rewrite algorithms for each target environment (single-node, multi-core, GPU, or distributed cluster). Empirical studies consistently demonstrate near-linear scalability for distributed ML workloads and large numerical gains stemming from automatic fusion, algebraic optimization, and hardware-aware plan generation (Pansare et al., 2018, Boehm et al., 2018, Wang et al., 2020).

References:

(Pansare et al., 2018) Deep Learning with Apache SystemML
(Boehm et al., 2018) On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML
(Boehm, 2015) Costing Generated Runtime Execution Plans for Large-Scale Machine Learning Programs
(Wang et al., 2020) SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra
(Chowdhury et al., 2021) Evaluating Deep Learning in SystemML using Layer-wise Adaptive Rate Scaling(LARS) Optimizer