Systolic Array Architecture in MIMO Detection

Updated 7 September 2025

Systolic array architecture is a spatially regular, massively parallel system that performs pipelined, local arithmetic operations for tasks like matrix multiplication and QR decomposition.
In MIMO detection, this architecture maps intensive tasks such as QR decomposition, lattice reduction, and interference cancellation onto simple processing elements for real-time performance.
Modified LLL algorithms (FSR-LLL and ASLR) use relaxed conditions to maximize parallelism, achieving significant throughput improvements with minimal BER degradation.

A systolic array is a spatially regular, massively parallel architecture composed of simple processing elements (PEs) arranged in a grid and connected by local, pipelined data paths. The essential property is the “systolic” (pulsed, rhythmic) movement of data: operands enter the array’s periphery and flow across a fixed interconnection pattern, enabling each PE to perform local arithmetic (e.g., multiply-accumulate) at each clock cycle while passing data to its neighbors. Systolic arrays have been applied to matrix multiplication, QR decomposition, signal processing, deep learning, coding theory, and notably to lattice-reduction-aided detection (LRAD) in MIMO wireless receivers. Their hardware efficiency stems from parallelism, high PE utilization, localized interconnects, and regular control, making them suitable for real-time, high-throughput applications.

1. Systolic Arrays in MIMO Detection

In MIMO detection, the core computational bottleneck arises from the inversion or factorization of large, possibly ill-conditioned, channel matrices. The systolic array architecture described in the context of LRAD provides a densely packed network of PEs to execute arithmetic-intensive matrix operations—specifically QR decomposition, lattice reduction, and linear or successive interference cancellation (SIC) detection steps.

The dataflow is organized such that each PE participates in the QR decomposition (via Givens rotations), the iterative lattice reduction (including column swapping and size reduction steps), and the linear or SIC detection. The resultant array architecture achieves parallelism at the granularity of columns and rows, keeping computation local and communication fixed-pattern. This is essential for low-latency, real-time wireless receivers.

Key benefits include:

Parallel execution of independent operations (e.g., simultaneous column reductions),
Pipelined data movement that overlays computation phases,
Hardware efficiency through minimized routing and lack of global interconnects.

2. Lattice Reduction-Aided Detection (LRAD) Principles

Lattice reduction improves MIMO detection by transforming the channel matrix $H$ into a reduced basis $\hat{H}$ with improved orthogonality, via a unimodular transformation $T$ ( $H = \hat{H} \cdot T$ ). The transmitted symbol vector $x$ is estimated by calculating $T^{-1}x$ , performing linear detection (e.g., ZF or MMSE) or SIC on the reduced lattice, followed by quantization and inverse mapping.

Within the systolic array:

QR decomposition is executed first, using pipelined Givens rotations.
Iterative lattice reduction (modifying $R$ from $QR$ factorization) is interleaved with column swaps and size reductions, tailored for parallel execution.
Detection phase (linear or SIC) is mapped onto the same array for hardware reuse and low-latency deployment.

This arrangement allows the LRAD process to be “hardware friendly,” since lattice reduction only needs to be performed upon channel change, and the same hardware can then execute the simplified detection.

3. Modified LLL Algorithms for Systolic Arrays

Conventional LLL algorithms are inherently serial and thus suboptimal for parallel hardware such as systolic arrays. To resolve this, two tailored variants are proposed:

LLL with Full Size Reduction (FSR-LLL):
- All columns of the $R$ matrix are size-reduced simultaneously in each iteration.
- Procedure:
- 1. Full size reduction across columns.
- 2. For columns violating the lattice-reduction condition, perform rotation to zero off-diagonal elements, then swap columns (rotation occurs before swap to enable parallel operation).
- This organization enables decoupling of full size reduction from column swaps, maximizing parallelism.
All-Swap Lattice Reduction (ASLR):
- All eligible pairs of adjacent columns are tested and swapped simultaneously per iteration.
- Swapping decisions are made in parallel for all even or odd pairs.
- Especially advantageous for larger antenna counts ( $m \geq 8$ ), reducing the average number of swaps per reduction.
Relaxed Reduction Condition (Siegel):
- The Lovász condition (used in classic LLL) involves $\left|r_{i-1,i}\right|^2 \leq (\delta - \frac{|r_{i-1,i-1}|^2}{|r_{i,i}|^2}) |r_{i-1,i-1}|^2$ , requiring multiple elements for computation.
- The Siegel (relaxed) condition simplifies this to $|r_{i,i}|^2 \geq \delta \cdot |r_{i-1,i-1}|^2$ for $i = 2, \ldots, m$ with $\delta \in (1/2,1)$ (chosen as 0.99).
- Verification becomes a comparator operation between neighboring R-diagonal values, streamlining PE interconnect.

This relaxation enables reduced inter-PE communication and computational cost, with negligible BER loss for LR-aided linear detections.

4. Comparative Analysis: FSR-LLL vs ASLR

Simulation and FPGA implementation metrics highlight the following:

Algorithm	BER Degradation (linear)	Avg swaps for $m \geq 8$	FPGA throughput vs serial	Area overhead
FSR-LLL	Negligible	Reference	Baseline	36–38% more
ASLR	Negligible	<65% of FSR-LLL	1.6 $\times$ faster	36–38% more

ASLR exhibits strong scalability in large MIMO (antennas), with further reduction in execution time per lattice reduction, particularly beneficial as the problem dimension grows.
FSR-LLL is comparable for small dimensions, but characterized by more (serial) column swaps.
Both approaches maintain near-ML detection performance when used with the relaxed Siegel condition.
FPGA implementation shows that despite ~36–38% higher slice usage compared to conventional architectures, the benefit in throughput and dataflow determinism is pronounced.

5. Hardware Efficiency and Simulation Results

Orthogonality defect: Modified LLL algorithms result in orthogonality defects for lattice basis very close to conventional LLL (for $\delta=0.99$ ), confirmed via cumulative probability functions.
Bit Error Rate (BER): For linear detection (ZF/MMSE), performance is identical to classical LLL.
FPGA Clock Frequency: Systolic ASLR with square-root-free QR decomposition (SGR) operates at 321–500 ns per channel matrix, with achievable clock frequencies up to 249 MHz.
Parallelism: Givens rotations, size reductions, and swaps are overlapped where possible.
Throughput: The systolic array enables significant performance gains in high-throughput MIMO-OFDM wireless systems, with minor area penalties offset by faster processing.

6. Conclusion and Implementation Outlook

The integration of parallel-friendly lattice reduction algorithms (FSR-LLL, ASLR) utilizing the relaxed Siegel condition within a systolic array enables hardware-efficient, high-throughput MIMO detection:

LRAD in a single array: Both reduction and detection steps can be performed on the same hardware for optimal reuse and real-time operation.
ASLR superiority: For large systems, concurrent swap operations and parallel full-matrix processing make ASLR preferable, minimizing the iterative reduction time.
Hardware optimization: The architecture achieves both high detection performance and hardware efficiency, backed by empirical simulation and FPGA synthesis results.

Future research aims to:

Further reduce hardware complexity via refined scheduling and implementation,
Extend the architecture to larger MIMO dimensions and higher-order modulations,
Explore fixed-point vs floating-point trade-offs, and
Pursue adoption on advanced process nodes and new hardware fabrics.

The systolic array paradigm, as established in LRAD for MIMO, exemplifies the power of spatially regular, pipelined hardware to realize parallel algorithms—provided algorithmic and dataflow modifications (such as ASLR and the Siegel condition) are judiciously designed for compatibility with parallel processing requirements.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Systolic Array Architecture.