Homomorphic Matrix-Vector Multiplication
- Homomorphic matrix‐vector multiplication is a cryptographic primitive that computes y = Mv on encrypted data, ensuring privacy in sensitive computations.
- Advanced techniques such as SIMD packing, baby-step giant-step rotations, and hardware acceleration in BFV, CKKS, and AHE frameworks significantly reduce computational overhead.
- These methods enable practical applications like privacy-preserving data analysis, secure outsourced computation, and encrypted AI inference, with demonstrated speedups over conventional approaches.
Homomorphic matrix-vector multiplication refers to the evaluation of the product where at least one operand (either , , or both) is encrypted under a homomorphic encryption (HE) scheme. This operation is foundational for privacy-preserving data analysis, secure outsourced computation, encrypted search, and a wide range of cryptographic protocols in both fully and additively homomorphic settings. Typically, the algorithmic focus is on maximizing throughput by leveraging SIMD packing and compression methods, minimizing costly homomorphic operations such as rotations or scalar multiplications, and exploiting hardware-level parallelism. The large-scale and computationally expensive nature of this primitive has driven recent advances in both algorithmic and architectural optimization, as reflected in BFV-based, CKKS-based, and AHE-based frameworks.
1. Data Encoding and Slot Representation
Matrix-vector multiplication in HE environments critically depends on packing plaintext data into polynomial coefficients that encode multiple entries per ciphertext. In BFV, messages are encoded in the ring , where ciphertext slots correspond to the coefficients of a plaintext polynomial. Given a vector , encoding consists of populating the first slots with and zero-filling the remaining slots, resulting in and ciphertext .
Packing for matrices is accomplished via diagonal extraction: a matrix is represented by diagonal vectors , where for . These are mapped to plaintext polynomials . Thus, plaintext polynomials fully represent the matrix and one ciphertext holds (Bosworth et al., 12 Dec 2025).
In CKKS, vectors are encoded via coefficient packing with a prescribed scale , forming . Matrices are packed column-wise, encrypting each column as a separate RLWE ciphertext (Bae et al., 20 Mar 2025).
Unpacked additively homomorphic encryption, e.g., EC-ElGamal, uses no SIMD; each is separately encrypted without packing, and no polynomial ring structure is exploited (Ramapragada et al., 20 Apr 2025).
2. Homomorphic Matrix-Vector Multiplication Algorithms
BFV-based implementations (SophOMR) rely on repeated slot-rotations and pointwise plaintext-ciphertext multiplications to compute . To reduce the rotation overhead, a baby-step giant-step (BSGS) optimization is used: factor , precompute rotated copies of , then for each giant-step compute partial sums via pointwise multiplication and accumulate using rotations of blocksum ciphertexts. This method decreases the number of expensive rotations and improves throughput (Bosworth et al., 12 Dec 2025).
CKKS-based matrix-vector multiplication can be efficiently reduced to two cleartext mat-vec operations over integer matrices modulo , together with only homomorphic operations (rescale, rekeying). Specifically, homomorphic mat-vec equals: modulo , where and are unpacked coefficient arrays from the RLWE structure. This approach obviates the need for rotations, Hadamard products, and key-switches, and instead employs two calls to high-throughput BLAS dgemv plus a homomorphic rescale (Bae et al., 20 Mar 2025).
For unpacked EC-ElGamal, matrix-vector multiplication leverages Cussen's compression-reconstruction algorithm. Each row vector is compressed to a short support, plaintext-scalar homomorphic multiplications are performed for each unique value, then reconstruction via cumulative homomorphic sum and outer product accumulation yield the encrypted outputs. This reduces the total expensive EC scalar multiplications from to for -bit element width, while incurring cheap EC point additions (Ramapragada et al., 20 Apr 2025).
3. Computational Complexity and Optimization Strategies
The core cost in BFV-native approaches are the number of rotations and plaintext-ciphertext multiplications. Without optimization, naïve computation requires rotations, multiplications. BSGS reduces this to rotations and multiplications (for ) (Bosworth et al., 12 Dec 2025).
CKKS BLAS-style reduction incurs plaintext multiplications (performed in high-throughput BLAS), with overhead for rescale and key-switch. The typical slowdown relative to unencrypted linear algebra is $4$– for matrix sizes of –, compared to for classical homomorphic approaches relying on slot-wise rotations (Bae et al., 20 Mar 2025).
Compression in AHE (EC-ElGamal) adaptively reduces the high-cost scalar multiplications. For modest (), compression achieves unique values per row, trading most scalar multiplies for EC point adds. Benchmark speedup is for large and (Ramapragada et al., 20 Apr 2025).
4. Hardware Acceleration and Parallelism
SophOMR demonstrates that advanced hardware architectures significantly boost practical throughput. The FPGA-accelerated architecture employs three primary cores:
- PCmul: Plaintext–ciphertext multiplication (using NTT for large integers).
- CCadd: Ciphertext–ciphertext addition.
- Rot: Rotation/key-switch, implemented via ApplyGalois and KeySwitch.
By tuning coefficient-level (PC), instance-level (PI), and NTT-butterfly-level (PB) parallelism (e.g., PC=16, PI=2, PB=64 in the Alveo U55C @200MHz setup), URAM buffers are used for matrix and rotation keys, with double-buffering and pipelined operations. This configuration yielded a speedup over a CPU and utilized 77% DSP, 76% BRAM, and 69% URAM for (Bosworth et al., 12 Dec 2025).
Further scaling is possible—rotation bottlenecks can be addressed via multiple Rot cores or deeper pipelining of the NTT. A plausible implication is that hardware-aware algorithm selection drives real-world feasibility for metadata-private applications such as OMR (Bosworth et al., 12 Dec 2025).
5. Security, Error Analysis, and Parameter Selection
Security in all settings follows the respective primitive's hardness assumption: RLWE-based schemes (BFV, CKKS) offer quantum resistance at ring sizes , modulus sizes , and appropriate scales (). CKKS guarantees relative errors for , well below single-precision; schoolbook AHE EC-ElGamal employs curves (P-256, etc.) establishing 128-bit classical security based on DDH (Bae et al., 20 Mar 2025, Ramapragada et al., 20 Apr 2025).
Efficient HE multiplication requires selecting packing sizes matched to hardware parallelism, bounding error and noise growth (CKKS rescale retains -level scaling), and ensuring feasible decryption (especially for AHE: restricting to a discretized range for baby-step/giant-step inversion).
6. Application Scenarios and Architectural Trade-offs
Homomorphic matrix-vector multiplication is central in privacy-preserving search, AI model inference, and metadata-private communication protocols. In OMR, the server computes encrypted detection of relevant messages with throughput constraints imposed by MatMul cost (Bosworth et al., 12 Dec 2025).
The CKKS BLAS paradigm is tailored for data analytics and AI on encrypted data in cloud environments, leveraging highly optimized floating-point mat-vec routines. Its efficiency augurs broad applicability where scaling, vectorization, and noise control are paramount (Bae et al., 20 Mar 2025).
EC-ElGamal AHE-based methods enable lightweight matrix operations in edge and resource-constrained devices, due to their reduced memory and compute requirements and compatibility with efficient compression (Ramapragada et al., 20 Apr 2025).
Architectural choices reflect security, performance, and resource trade-offs. Deep hardware pipelining, parallel-limb processing, and task-specific packing are essential for high-dimensional workloads.
7. Comparative Performance Metrics
| Scheme / Optimization | Platform | Speedup vs. Baseline | Typical Latency (n=4096, k=50) |
|---|---|---|---|
| BFV-BSGS + FPGA | Alveo U55C@200MHz | 13.86× (vs. CPU) | 2.150 ms |
| CKKS-BLAS | Xeon E5/OpenBLAS | 4–12× (vs. BLAS) | ~8 ms |
| EC-ElGamal AHE + compression | Raspberry Pi 5 | 6–10× (vs. schoolbook) | 2.5–5.0 s (n=1000) |
All latency and speedup metrics are verbatim from reported measurements. These figures demonstrate that algorithm/architecture co-design yields dramatic speedups relative to prior art, with hardware acceleration being critical at scale and compression optimizations being essential in edge settings (Bosworth et al., 12 Dec 2025, Bae et al., 20 Mar 2025, Ramapragada et al., 20 Apr 2025).