Quantum Kernel-based LSTM

Updated 12 December 2025

QK-LSTM is a hybrid model that replaces traditional LSTM gate operations with quantum kernel sums to enable expressive nonlinear sequence modeling.
It employs parameterized quantum circuits to map input histories into high-dimensional quantum feature spaces, enhancing data representation.
Empirical evaluations across NLP, forecasting, and molecular prediction tasks demonstrate significant parameter reduction and competitive accuracy.

The Quantum Kernel-based Long Short-Term Memory (QK-LSTM) network is a hybrid quantum-classical sequential model that fuses quantum kernel techniques with classical LSTM cell architectures to achieve expressive nonlinear sequence modeling while greatly reducing trainable parameter count. QK-LSTM accomplishes this by embedding input histories into high-dimensional quantum feature spaces via parameterized quantum circuits, replacing the standard LSTM gate affine transforms with weighted quantum kernel sums. This paradigm enables efficient compression, robust performance parity with classical models, and computational practicality for deployment on NISQ-era quantum hardware and edge environments (Hsu et al., 20 Nov 2024, Hsu et al., 12 Dec 2024, Beaudoin et al., 29 Apr 2025, Hsu et al., 8 Aug 2025, Lin et al., 4 Dec 2025).

1. Model Architecture and Gate Equations

The QK-LSTM cell preserves the canonical LSTM structure, including the forget ( $f_t$ ), input ( $i_t$ ), cell candidate ( $\tilde{C}_t$ ), output ( $o_t$ ) gates, and the cell-state update. However, each gate's standard linear operation $W[h_{t-1}, x_t] + b$ is replaced by a kernel-based expansion: $\text{gate}_t^{(g)} = \sigma_g\left(\sum_{j=1}^N \alpha_j^{(g)}\,k^{(g)}(v_t, v_j) + b^{(g)}\right)$ where $v_t = [h_{t-1}, x_t]$ , $\{v_j\}$ are reference vectors, $\alpha_j^{(g)}$ and $b^{(g)}$ are trainable scalars, $k^{(g)}$ is a quantum kernel, and $\sigma_g$ is either sigmoid (for $f,i,o$ ) or $\tanh$ (for $\tilde{C}$ ). The cell state update and hidden state readout retain their classical forms: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t,\quad h_t = o_t \odot \tanh(C_t)$ These modifications unify the sequential memory of LSTM with quantum-encoded nonlinear modeling (Hsu et al., 20 Nov 2024, Hsu et al., 12 Dec 2024).

2. Quantum Feature Map and Kernel Computation

QK-LSTM employs an explicit quantum feature map $\phi: \mathcal{X} \to \mathcal{H}$ , realized via an $n$ -qubit parameterized circuit $U(v)$ acting on the initial state $|0\rangle^{\otimes n}$ :

Hadamard layer: $H^{\otimes n}$
Data-encoding rotations: $U_{\text{enc}}(v) = \prod_{k=1}^n R_y(\theta_k(v))\,R_z(\phi_k(v))$
Entangling layer: $\prod_{k=1}^{n-1}\mathrm{CNOT}(k, k+1)$

$|\phi(v)\rangle = U_{\text{ent}}\,U_{\text{enc}}(v)\,H^{\otimes n}|0\rangle^{\otimes n}$

The quantum kernel between two vectors $v, v'$ is defined by the squared inner product of their quantum feature states: $k(v, v') = |\langle\phi(v')|\phi(v)\rangle|^2 = |\langle 0|^{\otimes n} U^\dagger(v') U(v) |0\rangle^{\otimes n}|^2$ The reference set $\{v_j\}$ can be drawn from training data, learned during optimization, or constructed as block-encoded anchors (Hsu et al., 12 Dec 2024, Beaudoin et al., 29 Apr 2025).

3. Parameter Efficiency and Training Methodology

QK-LSTM achieves significant parameter compression by replacing the standard $O(h(d+h))$ gate weights in a classical LSTM (input dim $d$ , hidden dim $h$ ) with $O(N)$ per-gate kernel coefficients: $\text{Classical:} \quad 4h(d+h+1);\qquad \text{QK-LSTM:} \quad 4N + 4$ Empirical cases report compression from 1,873 to 209 parameters (AQI task), or from hundreds of thousands to $O(10^4)$ (molecular property prediction), retaining or exceeding classical model accuracy (Hsu et al., 12 Dec 2024, Beaudoin et al., 29 Apr 2025). Gradients with respect to kernel weights and quantum circuit parameters are computed via BPTT and the quantum parameter-shift rule: $\frac{\partial k(v, v')}{\partial\theta} = k_{\theta+\frac{\pi}{2}}(v, v') - k_{\theta-\frac{\pi}{2}}(v, v')$ Optimizers such as SGD or Adam are employed. Quantum circuits are designed shallow ( $O(n)$ depth) for NISQ viability (Hsu et al., 20 Nov 2024, Hsu et al., 12 Dec 2024).

4. Empirical Performance and Applications

QK-LSTM demonstrates robust performance across diverse tasks:

NLP/sequence modeling: Retains accuracy and convergence speed for part-of-speech tagging with 60% fewer parameters than a classical LSTM (Hsu et al., 20 Nov 2024).
Climate time-series forecasting: Outperforms LSTM for AQI prediction with 42.8% lower RMSE, 31.9% lower MAPE, and higher $R^2$ using just 209 parameters (Hsu et al., 12 Dec 2024).
Molecular property prediction: Maintains ROC-AUC parity with classical models on side-effect classification, allowing a 10× parameter reduction and leveraging SELFIES augmentation for statistically significant gains (Beaudoin et al., 29 Apr 2025).
Human activity recognition (federated setting): Achieves higher accuracy (0.95 vs. 0.90) and uses 32% fewer parameters compared to federated classical LSTM across edge clients (Hsu et al., 8 Aug 2025).
Meta-optimization for QAOA: As a "learning-to-learn" optimizer, QK-LSTM outpaces classical and other quantum sequence models in convergence rate and approximation ratio on Max-Cut, with superior transferability using only 43 trainable parameters (Lin et al., 4 Dec 2025).

Application domains include NLP, time-series/signal forecasting, quantum optimization, molecular machine learning, federated learning, edge and NISQ computing (Hsu et al., 20 Nov 2024, Hsu et al., 12 Dec 2024, Beaudoin et al., 29 Apr 2025, Hsu et al., 8 Aug 2025, Lin et al., 4 Dec 2025).

5. Hybrid and Federated QK-LSTM Variants

Hybrid architectures augment QK-LSTM cells with convolutional feature extractors (DeepConv-QK-LSTM) in time-series contexts such as human activity recognition. Each client node in federated training computes local convolutional embeddings, applies QK-LSTM with local quantum kernel blocks, and synchronizes global parameters via federated averaging. Only classical coefficients are shared; no quantum data leaves the client. Block-product circuits enable shallow depth and support partitioned large-scale deployment (Hsu et al., 8 Aug 2025).

A typical federated loop is:

// Server
initialize θ^0 = {conv weights W, kernel β’s}
for t = 1 to T:
  select S_t clients
  for each client k in S_t:
    send θ^t to k; θ_k^t ← ClientUpdate(k, θ^t)
  aggregate {θ_k^t}
return θ^T
// ClientUpdate
for local epoch e:
  for each batch B:
    forward: DeepConv, QK-LSTM gates via quantum kernels
    loss, backward, update θ ← θ - η ∇_θ L
return updated θ

(Hsu et al., 8 Aug 2025).

6. Implementation Aspects and NISQ/Edge Deployment

QK-LSTM circuits are tailored for present-day NISQ devices:

Quantum resources: Small qubit counts (typically 4–8), $O(n)$ circuit depth, block-product encodings for dimensional scalability.
Noise robustness: Low-depth circuits mitigate decoherence; error-mitigation techniques such as zero-noise extrapolation suggested for fidelity.
Gradient estimation: Parameter-shift rule and stochastic estimation reduce quantum resource demands.
Hardware support: Classical simulation feasible via tensor networks or GPU contraction; real hardware deployment aligns with edge/NISQ constraints (Hsu et al., 20 Nov 2024, Hsu et al., 12 Dec 2024, Hsu et al., 8 Aug 2025).
Distributed/HPC strategies: Reference sets can be partitioned across quantum coprocessors or simulated nodes (Hsu et al., 12 Dec 2024).

7. Limitations and Open Research Directions

Key challenges and open questions:

Reference set selection: Optimal choice and size of $\{v_j\}$ remains an open problem; trade-off between expressivity and computational cost.
Kernel design: Deeper ansätze or richer entangling structures may extract more nuanced features but increase circuit depth and sensitivity to noise.
Shot noise and measurement variance: Repeated measurements for kernel estimation introduce variance; adaptive measurement strategies are required for precision.
Classical-quantum boundary: No statistically significant performance gap between QK-LSTM and classical LSTM on certain tasks, suggesting parity rather than quantum advantage at small scale (Beaudoin et al., 29 Apr 2025).
Hybridization: Integration with advanced preprocessing (SELFIES/SMILES augmentation, convolutional encoders) is effective but requires further domain-specific tuning.

A plausible implication is that QK-LSTM architectures offer practical, parameter-efficient hybrid models for sequential data under hardware constraints, even as expressive capacity is bounded by accessible quantum features. Their primary impact, as demonstrated, is model compression with performance parity and the enablement of hybrid quantum-classical computation in resource-limited or privacy-preserving settings (Hsu et al., 20 Nov 2024, Hsu et al., 12 Dec 2024, Beaudoin et al., 29 Apr 2025, Hsu et al., 8 Aug 2025, Lin et al., 4 Dec 2025).