Architecture & Data-Structure Aware LA (ADSALA)

Updated 21 January 2026

ADSALA is a paradigm that automates linear algebra routine generation by leveraging both hardware architecture features and operand structure properties.
It employs multi-layered DSLs and machine learning techniques to optimize code by tailoring SIMD, cache, and threading parameters for modern architectures.
Empirical evaluations demonstrate that ADSALA delivers significant speedups over traditional libraries, especially for small and irregular linear algebra subproblems.

Architecture and Data-Structure Aware Linear Algebra (ADSALA) defines a paradigm for the automated generation and optimization of linear algebra routines so that both the characteristics of target hardware architectures and the structural properties of operands influence all aspects of code production, transformation, and execution. The ADSALA principle pervades several contemporary research efforts, encompassing the synthesis of entire applications from high-level, structure-aware representations, the derivation of block- and register-level data flows informed by architectural parameters, and dynamic, machine-learning-based selection of runtime scheduling parameters for modern multi-core and multi-socket systems. Recent advances integrate domain-specific languages (DSLs), autotuning, and lightweight machine learning to deliver practical speedup over generic library baselines, even when treating the underlying kernels as black boxes.

1. Foundational Concepts and Historical Context

The ADSALA philosophy arises from deficiencies observed in traditional libraries such as BLAS and LAPACK: they deliver portable but suboptimal performance for modern, intricate hardware and for non-canonical problem scales or operand structures. Early efforts (e.g., SLinGen) recognized that relying on black-box, generic building blocks and standard loop-based formulations leads to performance stagnation, particularly for applications with small, fixed-size, structured linear algebra subproblems (Spampinato et al., 2018). Subsequent approaches generalized the insight, layering domain-specific abstraction and architectural introspection at each code generation, lowering, and scheduling stage (Spampinato et al., 2019).

Contemporary ADSALA systems incorporate explicit models of both (i) operand structure—such as symmetry, triangularity, and fixed or batched sizes—and (ii) architectural parameters—such as SIMD (Single Instruction Multiple Data) width, register file size, cache hierarchies, NUMA domains, and threading granularities. This awareness allows code generation tools and run-time libraries to match data access, blocking, vectorization, and parallelism decisions to both algorithmic and hardware characteristics.

2. Program Synthesis via Multi-Layer DSLs and Structure-Awareness

Modern ADSALA frameworks emphasize layered DSL hierarchies. A typical pipeline comprises:

Mathematical DSL (LA): Expresses intent at the level of whole-matrix operations with annotations for operand properties (e.g., $C := A * B$ , $Y := L^{-1} * X$ , symmetry, triangulation) without explicit loops or storage layout.
Partitioning DSL (p-LA): Introduces recursive block-decomposition (PME) to expose algorithmic parallelism while maintaining awareness of structural invariants (e.g., block-diagonal for triangular solves).
Loop-Based DSL (lp-LA): Converts partitionings into concrete loop nests over blocks, with explicit invariants and block traversal orders tailored to exploit both data structure and hardware constraints.
Implementation DSL (LL/C-IR): Encodes micro-tiling, register allocation, vector/memory intrinsics mappings, and low-level optimizations with parameters derived from hardware features (e.g., vector width $\nu$ , block size $b$ ).

Each DSL layer is equipped with formal transformation rules and cost models. Partitioning heuristics avoid breaking matrix structure (e.g., maintaining square diagonal blocks for triangular or symmetric operands). Tiling sizes $b$ and micro-tile dimensions $\nu$ are chosen according to hardware cache sizes, vector length, and register capacity: $b \approx \sqrt{{\rm L1\_size}/(3 \cdot {\rm element\_size})}$ , rounded down to a multiple of $\nu$ (Spampinato et al., 2019). The code generator emits either row- or column-major layouts and maps $\nu$ -sized micro-tiles to intrinsics such as _mm256_load_pd and _mm256_fmadd_pd for double-precision AVX.

For example, in SLinGen (Spampinato et al., 2018), the input is a high-level LA script with structured types; higher-level operators (e.g., Cholesky, Lyapunov, Sylvester, explicit inverse) are recursively decomposed into loop-based representations, then further into vectorized, register-minimal codelets with explicit handling for matrix structure, memory layout, and architectural vector width.

3. Dynamic Runtime Optimization: Machine Learning-Driven Parameter Tuning

A recent extension to ADSALA introduces machine-learning-based runtime optimization to address thread-count selection for multi-threaded BLAS Level-3 (L3) kernels on modern multi-core architectures (Xia et al., 2024, Xia et al., 14 Jan 2026). The central challenge is that the optimal thread count $p^*$ for a BLAS kernel (e.g., GEMM, SYMM, SYRK, TRMM, TRSM) depends nonlinearly on input matrix dimensions $(m,n,k)$ , machine topology (NUMA, memory bandwidth), and kernel implementation (MKL, BLIS).

Workflow

Installation Phase: ADSALA samples the input space quasi-randomly (e.g., scrambled Halton sequence) within a fixed memory footprint and benchmarks the kernel across candidate thread counts $1 \leq p \leq p_{\rm max}$ . For each $(m,n,k,p)$ , it records wall-time, preprocesses features (e.g., $m\cdot k$ , $k\cdot n$ , $m\cdot n$ , $m\cdot k\cdot n/p$ ), applies data transformations (Yeo–Johnson, standardization), performs outlier removal (Local Outlier Factor), and prunes highly correlated features.
Model Training: Regression models (linear, ElasticNet, Bayesian ridge, decision tree, various tree ensembles, SVM, k-NN) are tuned and validated via cross-validation, with XGBoost typically yielding the best trade-off in accuracy and sub-millisecond inference cost. Final model selection is based on maximizing $s = t_{\text{raw}}/(t_{\text{ML}}+t_{\text{eval}})$ , where $t_{\text{raw}}$ is the measured baseline time and $t_{\text{ML}}, t_{\text{eval}}$ are the ML and prediction overheads.
Runtime Phase: For each linear algebra call, ADSALA constructs a feature vector and predicts runtime across candidate thread counts, selecting the $p^*$ minimizing predicted execution time. Caching avoids repeated inference on repeated input. Typical prediction overhead is $0.1$–$0.3$ ms (negligible for $mkn \gg 10^6$ ). The library then invokes the underlying BLAS with $p^*$ threads.

Representative Pseudocode (Thread Selection)

function optimal_threads(m,n,k):
    if (m,n,k) == last_dims:
        return last_p
    best_p = 1
    best_T = ∞
    for p in 1..P_max:
        x = construct_features(m,n,k,p)
        T_hat = model.predict(x)
        if T_hat < best_T:
            best_T = T_hat
            best_p = p
    last_dims = (m,n,k)
    last_p = best_p
    return best_p

p_star = optimal_threads(m,n,k)
omp_set_num_threads(p_star)
call underlying_BLAS(m,n,k,...)

(Xia et al., 2024, Xia et al., 14 Jan 2026)

4. Performance Evaluation and Empirical Impact

The integration of ADSALA’s ML-based thread selection has delivered substantial speedups relative to vendor-supplied BLAS, especially for small or irregularly shaped matrices. Experimental validation spans large-scale platforms:

“Setonix” (AMD EPYC, 64-core Milan, 256 threads): Median speedups over the default (maximum-thread) configuration include DGEMM: $1.54 \times \pm 0.66$ , DSYMM: $2.89 \times \pm 1.80$ , with ranges extending up to $9.05 \times$ for SGEMM and $8.46 \times$ for DSYMM. (Xia et al., 2024)
“Gadi” (Intel Cascade Lake, 24-core, 96 threads): DGEMM: $1.27 \times \pm 0.55$ , DSYMM: $2.28 \times \pm 1.89$ , maximum speedups exceeding $12 \times$ in edge cases with poor thread scaling.
Small-matrix regimes: For memory footprints under 100 MB, average performance gain is $25$–$40$% for GEMM (Xia et al., 14 Jan 2026).
Case studies: For highly non-square matrices, the incorrect thread count chosen by naïve heuristics leads to enormous copying and synchronization overhead (e.g., 81.6 $\times$ speedup for $m\!\times\!k\!\times\!n = 64\!\times\!2048\!\times\!64$ when reducing threads from 96 to 14).
Overhead analysis: ML inference cost is effectively amortized, with typical per-call prediction time much smaller than kernel execution ( $\approx$ 0.2 ms vs. $t_{\rm GEMM} \gg 1$ ms for practical matrix sizes).

The principal sources of speedup are architectural: reduction in thread synchronization barriers (which can scale superlinearly with $p$ ), lower memory panel-copy cost, topology-aware NUMA scheduling, and improved matching of compute-vs-memory cost for different regimes. ADSALA automatically adapts to these factors (Xia et al., 2024).

5. Architectural and Data-Structure Awareness in Code Generation

In the compile-time domain, ADSALA systems such as SLinGen and program generators with layered DSLs (Spampinato et al., 2018, Spampinato et al., 2019) achieve high performance by synthesizing specialized code paths for data-structured operands and hardware with different SIMD, cache, and threading characteristics:

Operand structure: Symmetry, triangularity, positive-definiteness, and storage layout are exposed at the LA DSL and maintained through all lowering stages. This enables full elimination of zero-loads/stores, code unrolling for fixed matrix sizes, and structure-driven tiling.
Register/block tiling: Two-level blocking aligns with hardware registers and SIMD width (e.g., $4\times4$ codelets for AVX double precision), with explicit nu-block aggregation and register-pressure-aware fusion of operations. Algebraic rewriting further increases vectorization opportunities.
Architecture mapping: All parameters—cache block size $b$ , micro-tile size $\nu$ , alignment—are explicit in the code generator, making retargeting to different microarchitectures straightforward. Hardware parameters such as cache size and vector length serve as inputs to the cost/heuristic models used for autotuning.
Example (Cholesky HLAC): The code generator blocks the matrix into $\nu\times\nu$ panels, recurses to smaller units, and emits vectorized, straight-line C code with AVX/FMA intrinsics. Memory layout analysis ensures aligned and contiguous access whenever possible, with explicit tagging of load/store patterns.

In empirical evaluations, such generated code consistently outperforms both generic and library-based implementations for small to moderate operand sizes—often by $2$– $5\times$ for core factorizations and $1.4$– $4\times$ for full application kernels (e.g., Kalman filters, Gaussian process regression on $n=4\dots64$ ) (Spampinato et al., 2018).

6. Limitations, Generality, and Future Directions

Current ADSALA instantiations exhibit several benefits and boundaries:

Generality: Both the ML-driven runtime scheduling method and multi-layer DSL generators extend to all BLAS Level-3 routines, and in principle to BLAS I/II and LAPACK procedures with discrete tunable parameters (Xia et al., 2024, Xia et al., 14 Jan 2026).
Limitations: Installation and data collection phases remain expensive (up to $\sim$ 100 node-hours per subroutine for ML thread selection), and require discrete, finite search spaces for tunable parameters. ML inference cost must remain low relative to kernel execution. Code generators for small-scale applications currently assume dense storage and fixed sizes; extension to sparse representations, batched variable sizes, and more aggressive register allocation is ongoing (Spampinato et al., 2018).
Future Work:
- Automatic hybrid CPU-GPU BLAS backend selection;
- Incorporation of richer hardware features (cache size, memory bandwidth) and data properties (sparsity) as input features for ML models;
- Joint optimization of multiple tunable parameters (e.g., blocking factors, CPU/GPU partitioning);
- Automated pipeline extension for heterogeneous and distributed environments with on-the-fly retraining and incremental adaptation (Xia et al., 2024, Xia et al., 14 Jan 2026).
- Generalization of DSL-based approaches to support variable/batched sizes, non-dense storage formats, and multicore parallelism (Spampinato et al., 2018, Spampinato et al., 2019).

7. Illustrative Example: End-to-End ADSALA Process

The illustrative pipeline for a matrix multiply ( $C = A * B$ ) under the multi-layer DSL approach is as follows (Spampinato et al., 2019):

LA DSL: $C := A * B$ with structure annotations (dimensions, properties).
Partitioning (p-LA): Recursively block $A$ and $B$ along $k$ (e.g., partition $A \rightarrow [A_{11}\,A_{12}]$ ).
Loop-based (lp-LA): Emit loops over block indices with invariant $C[0:i,0:j]=A[0:i,0:k]\ast B[0:k,0:j]$ ; block size $b$ driven by cache size.
Implementation (LL): Micro-tile each block using vector width $\nu$ , unroll micro-loops into explicit register operations.
C-IR: Code emits mm256_load/store/fmadd__ intrinsics for vectorized kernel.
Autotuning/refinement: Across block size $b$ and tile size $\nu$ for each target machine.

These stages formalize the separation between algorithmic intent, structure-aware partitioning, loop derivation, hardware-aware tiling, and low-level mapping, defining the essence of ADSALA systems.

Principal References: