Papers
Topics
Authors
Recent
2000 character limit reached

Homomorphic Matrix-Vector Multiplication

Updated 19 December 2025
  • Homomorphic matrix‐vector multiplication is a cryptographic primitive that computes y = Mv on encrypted data, ensuring privacy in sensitive computations.
  • Advanced techniques such as SIMD packing, baby-step giant-step rotations, and hardware acceleration in BFV, CKKS, and AHE frameworks significantly reduce computational overhead.
  • These methods enable practical applications like privacy-preserving data analysis, secure outsourced computation, and encrypted AI inference, with demonstrated speedups over conventional approaches.

Homomorphic matrix-vector multiplication refers to the evaluation of the product y=Mvy = Mv where at least one operand (either MM, vv, or both) is encrypted under a homomorphic encryption (HE) scheme. This operation is foundational for privacy-preserving data analysis, secure outsourced computation, encrypted search, and a wide range of cryptographic protocols in both fully and additively homomorphic settings. Typically, the algorithmic focus is on maximizing throughput by leveraging SIMD packing and compression methods, minimizing costly homomorphic operations such as rotations or scalar multiplications, and exploiting hardware-level parallelism. The large-scale and computationally expensive nature of this primitive has driven recent advances in both algorithmic and architectural optimization, as reflected in BFV-based, CKKS-based, and AHE-based frameworks.

1. Data Encoding and Slot Representation

Matrix-vector multiplication in HE environments critically depends on packing plaintext data into polynomial coefficients that encode multiple entries per ciphertext. In BFV, messages are encoded in the ring Rt=Zt[X]/(Xn+1)R_t=\mathbb{Z}_t[X]/(X^n+1), where ciphertext slots correspond to the NN coefficients of a plaintext polynomial. Given a vector vZtkv\in\mathbb{Z}_t^k, encoding consists of populating the first kk slots with (v0,,vk1)(v_0,\ldots,v_{k-1}) and zero-filling the remaining slots, resulting in ptv(X)=j=0k1vjXjpt_v(X)=\sum_{j=0}^{k-1} v_j X^j and ciphertext ctv=Encrypt(ptv)ct_v=\text{Encrypt}(pt_v).

Packing for matrices is accomplished via diagonal extraction: a matrix MZtN×kM\in\mathbb{Z}_t^{N \times k} is represented by kk diagonal vectors diagi(M)\text{diag}_i(M), where [diagi(M)]j=M[j][(i+j)modk][\text{diag}_i(M)]_j = M[j][(i+j)\bmod k] for j=0,,N1j=0,\ldots,N-1. These are mapped to plaintext polynomials mi(X)=j=0N1diagi(M)jXjm_i(X)=\sum_{j=0}^{N-1} \text{diag}_i(M)_j X^j. Thus, kk plaintext polynomials fully represent the matrix MM and one ciphertext holds vv (Bosworth et al., 12 Dec 2025).

In CKKS, vectors are encoded via coefficient packing with a prescribed scale Δ\Delta, forming Ecdcoeff(v)=j=0n1ΔvjXj\mathsf{Ecd}_{\text{coeff}}(v) = \sum_{j=0}^{n-1}\lfloor \Delta v_j \rceil X^j. Matrices are packed column-wise, encrypting each column as a separate RLWE ciphertext (Bae et al., 20 Mar 2025).

Unpacked additively homomorphic encryption, e.g., EC-ElGamal, uses no SIMD; each xkx_k is separately encrypted without packing, and no polynomial ring structure is exploited (Ramapragada et al., 20 Apr 2025).

2. Homomorphic Matrix-Vector Multiplication Algorithms

BFV-based implementations (SophOMR) rely on repeated slot-rotations and pointwise plaintext-ciphertext multiplications to compute Mv=i=0k1diagi(M)Roti(v)M v = \sum_{i=0}^{k-1} \text{diag}_i(M) \odot \text{Rot}^i(v). To reduce the rotation overhead, a baby-step giant-step (BSGS) optimization is used: factor k=g~b~k=\tilde{g}\cdot\tilde{b}, precompute b~\tilde{b} rotated copies of ctvct_v, then for each giant-step gg compute partial sums via pointwise multiplication and accumulate using rotations of blocksum ciphertexts. This method decreases the number of expensive rotations and improves throughput (Bosworth et al., 12 Dec 2025).

CKKS-based matrix-vector multiplication can be efficiently reduced to two cleartext mat-vec operations over integer matrices modulo QQ_\ell, together with only O(1)O(1) homomorphic operations (rescale, rekeying). Specifically, homomorphic mat-vec equals: [A(1)vp,Bvp][A^{(1)} v_p, B v_p] modulo QQ_\ell, where A(1)A^{(1)} and BB are unpacked coefficient arrays from the RLWE structure. This approach obviates the need for O(n)O(n) rotations, Hadamard products, and key-switches, and instead employs two calls to high-throughput BLAS dgemv plus a homomorphic rescale (Bae et al., 20 Mar 2025).

For unpacked EC-ElGamal, matrix-vector multiplication leverages Cussen's compression-reconstruction algorithm. Each row vector is compressed to a short support, plaintext-scalar homomorphic multiplications are performed for each unique value, then reconstruction via cumulative homomorphic sum and outer product accumulation yield the encrypted outputs. This reduces the total expensive EC scalar multiplications from O(mn)O(mn) to O(mb)O(mb) for bb-bit element width, while incurring O(mn)O(mn) cheap EC point additions (Ramapragada et al., 20 Apr 2025).

3. Computational Complexity and Optimization Strategies

The core cost in BFV-native approaches are the number of rotations and plaintext-ciphertext multiplications. Without optimization, naïve computation requires k1k-1 rotations, kk multiplications. BSGS reduces this to (b~1)+(g~1)(\tilde{b}-1)+(\tilde{g}-1) rotations and kk multiplications (for k=g~b~k=\tilde{g}\cdot\tilde{b}) (Bosworth et al., 12 Dec 2025).

CKKS BLAS-style reduction incurs 2O(mn)2\cdot O(m n) plaintext multiplications (performed in high-throughput BLAS), with O(1)O(1) overhead for rescale and key-switch. The typical slowdown relative to unencrypted linear algebra is $4$–12×12\times for matrix sizes of 1K1\text{K}16K16\text{K}, compared to 104×10^4\times for classical homomorphic approaches relying on slot-wise rotations (Bae et al., 20 Mar 2025).

Compression in AHE (EC-ElGamal) adaptively reduces the high-cost scalar multiplications. For bb modest (b32b\leq32), compression achieves tnt \ll n unique values per row, trading most scalar multiplies for EC point adds. Benchmark speedup is 6×6\times for large nn and b32b\leq32 (Ramapragada et al., 20 Apr 2025).

4. Hardware Acceleration and Parallelism

SophOMR demonstrates that advanced hardware architectures significantly boost practical throughput. The FPGA-accelerated architecture employs three primary cores:

  • PCmul: Plaintext–ciphertext multiplication (using NTT for large integers).
  • CCadd: Ciphertext–ciphertext addition.
  • Rot: Rotation/key-switch, implemented via ApplyGalois and KeySwitch.

By tuning coefficient-level (PC), instance-level (PI), and NTT-butterfly-level (PB) parallelism (e.g., PC=16, PI=2, PB=64 in the Alveo U55C @200MHz setup), URAM buffers are used for matrix and rotation keys, with double-buffering and pipelined operations. This configuration yielded a 13.9×13.9\times speedup over a CPU and utilized 77% DSP, 76% BRAM, and 69% URAM for N=216,k=50N=2^{16}, k=50 (Bosworth et al., 12 Dec 2025).

Further scaling is possible—rotation bottlenecks can be addressed via multiple Rot cores or deeper pipelining of the NTT. A plausible implication is that hardware-aware algorithm selection drives real-world feasibility for metadata-private applications such as OMR (Bosworth et al., 12 Dec 2025).

5. Security, Error Analysis, and Parameter Selection

Security in all settings follows the respective primitive's hardness assumption: RLWE-based schemes (BFV, CKKS) offer quantum resistance at ring sizes n212n\geq2^{12}, modulus sizes Q0254Q_0\approx2^{54}, and appropriate scales (Δ220\Delta\approx2^{20}). CKKS guarantees relative errors <215<2^{-15} for m103m\approx10^3, well below single-precision; schoolbook AHE EC-ElGamal employs curves (P-256, etc.) establishing 128-bit classical security based on DDH (Bae et al., 20 Mar 2025, Ramapragada et al., 20 Apr 2025).

Efficient HE multiplication requires selecting packing sizes kk matched to hardware parallelism, bounding error and noise growth (CKKS rescale retains Δ\Delta-level scaling), and ensuring feasible decryption (especially for AHE: restricting mm to a discretized range for baby-step/giant-step inversion).

6. Application Scenarios and Architectural Trade-offs

Homomorphic matrix-vector multiplication is central in privacy-preserving search, AI model inference, and metadata-private communication protocols. In OMR, the server computes encrypted detection of relevant messages with throughput constraints imposed by MatMul cost (Bosworth et al., 12 Dec 2025).

The CKKS BLAS paradigm is tailored for data analytics and AI on encrypted data in cloud environments, leveraging highly optimized floating-point mat-vec routines. Its efficiency augurs broad applicability where scaling, vectorization, and noise control are paramount (Bae et al., 20 Mar 2025).

EC-ElGamal AHE-based methods enable lightweight matrix operations in edge and resource-constrained devices, due to their reduced memory and compute requirements and compatibility with efficient compression (Ramapragada et al., 20 Apr 2025).

Architectural choices reflect security, performance, and resource trade-offs. Deep hardware pipelining, parallel-limb processing, and task-specific packing are essential for high-dimensional workloads.

7. Comparative Performance Metrics

Scheme / Optimization Platform Speedup vs. Baseline Typical Latency (n=4096, k=50)
BFV-BSGS + FPGA Alveo U55C@200MHz 13.86× (vs. CPU) 2.150 ms
CKKS-BLAS Xeon E5/OpenBLAS 4–12× (vs. BLAS) ~8 ms
EC-ElGamal AHE + compression Raspberry Pi 5 6–10× (vs. schoolbook) 2.5–5.0 s (n=1000)

All latency and speedup metrics are verbatim from reported measurements. These figures demonstrate that algorithm/architecture co-design yields dramatic speedups relative to prior art, with hardware acceleration being critical at scale and compression optimizations being essential in edge settings (Bosworth et al., 12 Dec 2025, Bae et al., 20 Mar 2025, Ramapragada et al., 20 Apr 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Homomorphic Matrix-Vector Multiplication.