One-Class SVM: Smooth CVaR Optimization
- The paper introduces a novel OC-SVM framework that integrates smooth CVaR surrogates with signature-based embeddings to achieve tractable risk calibration.
- The methodology uses shuffle-product identities to derive closed-form polynomial surrogates, leading to explicit error bounds and enhanced hypothesis testing.
- Empirical evaluations in anomalous diffusion and RNA modification detection demonstrate improved type I/II error control and increased detection power over traditional methods.
One-class Support Vector Machine (OC-SVM) algorithms optimising smooth Conditional Value-at-Risk (CVaR) objectives constitute a significant advance in novelty detection within path spaces, connecting sequential data analysis, statistical learning, and probability in function spaces. This class of algorithms exploits signature-based feature embeddings and the shuffle-product structure, enabling closed-form polynomial surrogates for risk-sensitive test statistics and new theoretical guarantees for error control and statistical power in hypothesis testing settings (Gasteratos et al., 2 Dec 2025).
1. Signature-based Features and Smooth CVaR Surrogates
Let denote a path whose signature is taken up to truncation level . To approximate the positive part on , a polynomial is introduced. The smooth CVaR surrogate is then defined by
Employing the shuffle-product identity, $(\langle w, S \rangle)^i = \langle w^{\shuffle i}, S \rangle$, Theorem 3.1 shows the surrogate may be rewritten as
$E_\mu\left[ Q_n(\langle w, S(X) \rangle - \rho) \right] = \langle Q_n^{\shuffle}(w - \rho 1), E_\mu[S(X)] \rangle$
where $Q_n^{\shuffle}(\ell) = \sum_{i=0}^n a_i \ell^{\shuffle i} \in (T^{nN}(\mathbb{R}^d))^*$. The result is an explicit polynomial whose coefficients depend only on and shuffle-powers of . This surrogate admits closed-form computation, substantially improving tractability for high-dimensional path data.
2. OC-SVM Formulations and Optimisation Problems
The OC-SVM framework is captured as minimising a regularised CVaR of negative scoring functionals. Given a feature map , typically the truncated signature or its infinite-level version, the population-level objective reads
Replacing by the smooth surrogate yields the smooth-CVaR OC-SVM problem,
For empirical OC-SVM with finite samples , the unconstrained primal is
A constrained quadratic program variant introduces slack variables :
3. Dual Formulation and Signature Kernels
The dual form of the constrained OC-SVM is
for kernel matrix . With , the signature kernel is
Given solution , the primal vector is and the test-score for a new path is . In the smooth-CVaR population version, enters via the closed-form surrogate, replacing empirical averages.
4. Theoretical Error Bounds: Type I and Power
Denoting , the following bounds are established:
- Type I Error (Theorem 3.4): If obeys an transportation-cost inequality, including Gaussian and RDE laws, then there exist constants such that
where and depends on the deviation . Solving provides a quantile bound and super-uniform p-values
- Type II Error (Power, Theorem 3.3): For alternatives with finite first moment,
with for , and the relative entropy. Thus, finite relative entropy ensures nontrivial lower bounds on power.
5. Algorithmic Procedure and Practical Considerations
Population-level smooth-CVaR OC-SVM is implemented via the following high-level steps:
- Empirical expected signature:
- Surrogate objective: $\hat f_n(\rho) = \rho + \frac{1}{1-\alpha} \langle Q_n^{\shuffle}(w - \rho 1), \hat E \rangle$
- Joint optimisation:
- Solved by alternation or explicit polynomial root-finding when .
- Test statistic:
- Hypothesis rejection: Reject if .
For sample-based OC-SVM, standard primal/dual QP with signature kernels is used (e.g., LIBSVM, ThunderSVM). At test time, is compared to the learned bias .
6. Empirical Evaluation: Diffusion and Molecular Biology
Anomalous Diffusion
- Setup: Binary discrimination between standard Brownian motion () and “spiked-BM” () defined by , .
- Statistic: Signature-based distance ; also, linear form in .
- Results:
- AUROC vs : Monotonic increase, no sharp phase at .
- Type I and II control: Empirical p-values ( calibration) give marginal FDR but high conditional variability. Weibull tail-bound (Theorem 3.4) with samples yields super-uniform p-values, tighter FDR and FPR control.
- Comparison: Signature-based distance outperforms TAMSD and is competitive with kernelised OC-SVM.
RNA Modification Detection
- Data: Synthetic 100-nt oligos, three modifications (inosine, m5C, ) at fixed positions; Nanopore direct RNA reads (Leger et al. 2021).
- Preprocessing: Dorado basecalling, Uncalled4 event alignment, per-base segmentation.
- Methods:
- OC-SVM on signature features (, time-augment and invisibility-reset), $3000$ unmodified reads per site, p-values from held-out reads.
- OC-SVM on standard 2D features (mean current and dwell time).
- Results: At BH–FDR level $0.20$, signature OC-SVM yields substantially higher recall (power) for all modification types, with type I error controlled at nominal level.
7. Connections, Scope, and Implications
These developments bridge hypothesis testing, path signatures (Lyons et al.), transportation-cost inequalities (Gasteratos and Jacquier 2023), and robust machine learning. The use of smooth CVaR surrogates via shuffle-product identities establishes new analytic techniques for risk calibration and empirical p-value calculation. Non-asymptotic bounds on error rates generalise beyond Gaussian settings to laws of rough differential equation solutions, supporting broader applications in anomalous diffusion analysis and molecular biology. A plausible implication is further cross-fertilisation with time-series anomaly detection and functional data analysis, leveraging closed-form population objectives and signature kernel methods.
The principal contribution is the integration of population-level risk surrogates, shuffle-product algebra, and theoretical guarantees for novelty detection (Gasteratos et al., 2 Dec 2025). This framework enables more refined control of type I and type II errors and supports robust calibration for high-dimensional non-Euclidean data spaces.