PaAno: Patch-Based Learning for Anomaly Detection

Updated 28 March 2026

The paper introduces PaAno, an unsupervised framework that segments time series into normalized, overlapping patches to enhance anomaly detection precision.
It employs a compact 1D-CNN encoder with triplet and pretext losses to create a discriminative embedding space that effectively isolates anomalous patterns.
Empirical evaluations on TSB-AD benchmarks demonstrate state-of-the-art performance with low computational overhead and strong resilience to concept drift.

Patch-based representation learning for time-series anomaly detection ("PaAno") refers to a family of unsupervised algorithms that segment time series into short, overlapping patches, encode each patch into a vector representation, and detect anomalies based on the similarity (or lack thereof) between new patch embeddings and reference embeddings constructed from normal training data. This paradigm has emerged as an efficient yet highly accurate alternative to heavy transformer-based and foundation model approaches. By imposing a locality bias and leveraging discriminative or self-supervised objectives, patch-based methods such as PaAno achieve state-of-the-art performance on modern, rigorously evaluated benchmarks, with strong resilience to overfitting, concept drift, and real-world operating costs (Park et al., 1 Feb 2026).

1. Patch Extraction and Normalization

PaAno and similar methods utilize a sliding window of fixed length to extract overlapping patches from input time series. For a univariate or multivariate sequence $\mathbf X = (\mathbf x_1,\dots,\mathbf x_N)$ with $\mathbf x_t\in\mathbb R^d$ , a window of length $L$ and stride $s$ produces a collection of normalized patches:

$\mathcal P = \{\mathbf p_t = (\mathbf x_t,\dots,\mathbf x_{t+L-1})\}_{t=1}^{N-L+1}$

A key distinguishing feature is the application of instance normalization to each patch: for every channel, the mean and standard deviation are subtracted/divided across the patch, producing zero-mean, unit-variance segments. This operation empirically enhances robustness to nonstationarity and mitigates distributional shifts. In ablation, removal of patch-level normalization degrades range-wise precision by approximately 5 percentage points (Park et al., 1 Feb 2026).

2. Patch Encoder Architectures

Contemporary PaAno pipelines employ compact 1D-CNN encoders to map each normalized patch $\mathbf p\in\mathbb R^{L\times d}$ to a fixed embedding $\mathbf h\in\mathbb R^l$ . A canonical architecture is:

Four stacked Conv1D-BatchNorm-ReLU blocks with decreasing kernel size and channel dimensionality.
Global average pooling across the temporal axis, yielding a low-dimensional (e.g., $l=64$ ) patch-level vector.

Layer	Kernel Size	Channels	Activation
Conv1D-1	7	$d\to128$	ReLU
Conv1D-2	5	$128\to256$	ReLU
Conv1D-3	3	$256\to128$	ReLU
Conv1D-4	3	$128\to64$	ReLU
Global Avg. Pooling	—	64	—

Lightweight CNN encoders have been shown to outperform much heavier Transformer or foundation-model backbones when paired with appropriate objectives and memory-efficient scoring (Park et al., 1 Feb 2026). Only the CNN branch is retained during inference.

3. Training Objectives: Metric Learning and Pretext Tasks

PaAno uses a compound objective to structure the patch embedding space for anomaly isolation:

Triplet Loss: For each anchor patch, a positive (temporally proximal) and farthest negative (maximal cosine distance among batch samples) are encoded through a nonlinear head $g_\theta$ . The loss

$\mathcal L_{\mathrm{triplet}} = \frac{1}{M}\sum_{i=1}^M \max\left\{0,\, d(\mathbf z_i,\mathbf z^+_i) - d(\mathbf z_i,\mathbf z^-_i) + \delta \right\}$

(with $d(u,v)=1-\frac{u^\top v}{\|u\|\|v\|}$ and margin $\delta=0.5$ ) encourages local temporal similarity and global dissimilarity for non-overlapping windows.

Pretext Loss: A binary classification on patch pairs—each anchor and its preceding patch form a positive pair, while $U$ random patches are negatives. A classification head $c_\theta$ predicts $\operatorname{Pr}(\text{is-adjacent})$ , penalized by cross-entropy. This objective is weighted heavily during the early stages of training and annealed to zero thereafter, promoting encoder sensitivity to ordering in the patch manifold.

Combined, the final loss is

$\mathcal L = \mathcal L_{\mathrm{triplet}} + \lambda\,\mathcal L_{\mathrm{pretext}}$

with $\lambda$ decaying linearly during the first 10% of iterations (Park et al., 1 Feb 2026).

This two-pronged approach underpins PaAno's discriminative power: the triplet loss prevents embedding collapse and separates anomalous patches; the pretext loss imbues the encoder with sequential context. Sensitivity studies confirm its essential role in robust representation learning.

4. Anomaly Scoring and Memory Bank Construction

After training, patch embeddings from normal training data comprise the reference memory bank $\mathcal M$ . For efficiency and redundancy reduction, K-means clustering picks $K$ centroids, and only the nearest original embeddings to each centroid are retained as $\hat{\mathcal M}$ . At inference, each test time step $t$ is scored by encoding all $L$ overlapping patches covering $t$ :

Each patch $\mathbf p$ is embedded and compared (cosine distance) to its $k$ nearest reference patches in $\hat{\mathcal M}$ .
The patch anomaly score is the mean distance to these $k$ neighbors.
The time-step anomaly score $s_t$ is the average of the patch-level scores.

No post-hoc temporal smoothing or local thresholding is applied; this direct scoring pipeline is a core design choice distinguishing PaAno from prior point-adjusted paradigms (Park et al., 1 Feb 2026).

5. Evaluation Protocols and Empirical Performance

PaAno is evaluated on the TSB-AD-U and TSB-AD-M benchmarks. Each consists of long univariate or multivariate series with separate normal-only training and mixed test splits. Metrics include area-under-curve (PR, ROC), Range-F1, and VUS-PR (integral of PR across all thresholds). PaAno achieves top ranks on every measured index—e.g., VUS-PR 0.52 and point-F1 0.51 (univariate) versus 0.43 and 0.43 (multivariate), with the next best performing method trailing by 10–15% across all measures. Model size is consistently sub-0.5M parameters; inference time for 50–110k time steps is under 15s on commodity GPUs.

Ablations demonstrate:

Instance normalization: critical for drift and regime transitions.
Triplet and pretext loss: each confers 2–6% improvements; the farthest-negative triplet sampling is superior to InfoNCE.
Patch length and memory bank size tolerance: performance holds for $L\in[32,128]$ , $k$ in $\{1,3,5,10\}$ , memory compressed to 1–20% of original patches.
The compact CNN encoder and memory-based scoring admit real-time, resource-constrained deployment.

6. Comparative Algorithmic Landscape

The PaAno design (patch extraction, normalization, metric learning objective, nearest-neighbor memory scoring) distinguishes itself from many contemporaneous foundation-model and deep transformer approaches:

TimeRep (Han et al., 16 Sep 2025) leverages intermediate representations from frozen foundation models, but selects the best encoder layer and patch token via validation and forms a reference set of intermediate representations for nearest-neighbor scoring. Its adaptive memory bank addresses concept drift, outperforming both classical and deep learning baselines on the UCR Anomaly Archive.
PatchAD (Zhong et al., 2024) employs multi-scale patch extraction, a four-headed MLP-mixer, and a symmetric KL-based contrastive loss to align inter-patch and intra-patch views, supplemented with a dual projection constraint. PatchAD achieves significant improvements in classical F1 and AUC scores, despite a small parameter budget.
MOMEMTO (Yoon et al., 23 Sep 2025) constructs explicit patch-level memory items, refining embeddings via attention and updating memory through few-shot adaptation and multi-domain fine-tuning, thus mitigating over-generalization.
TriP-LLM (Yu et al., 31 Jul 2025) combines multi-branch local/global patch tokenization with a frozen pretrained LLM, but at a substantially higher model and memory cost.
PatchTrAD (Vilhes et al., 10 Apr 2025) and transformer-based variants achieve competitive ROC-AUC but typically at higher inference and training complexities.
CPatchBLS (Li et al., 2024) and TransDe (Zhang et al., 19 Apr 2025) further expand the contrastive and multi-view patch paradigm, leveraging broad learning (CPatchBLS) and decomposition/multi-scale fusion (TransDe), and integrating KL-divergence contrastive objectives for rapid, lightweight anomaly detection.

These works collectively highlight that patch granularity, local normalization, and discriminative objectives produce more robust anomaly detectors than standard reconstruction- or forecasting-based frameworks, especially under rigorous, non–point-adjusted evaluation.

7. Discussion and Implications

The patch-based representation learning paradigm offers several generalizable observations. First, locality bias in patch extraction aligns with the empirical observation that most time-series anomalies exhibit short-range, highly local deviations. Second, discriminative metric learning (in particular, farthest-negative triplet mining) creates embedding spaces where anomalous patches are easily segregated from normal clusters, supporting reliable nonparametric scoring. Third, self-supervised pretext tasks, such as temporal continuity prediction, ensure embedding sequential structure is respected, addressing the common failure point of representation collapse in purely contrastive settings.

PaAno demonstrates that lightweight models with strong patch-level objectives can consistently outperform heavy architectures in both detection accuracy and computational efficiency, particularly under benchmarks designed to penalize overfitting, label leakage, and excessive smoothing (Park et al., 1 Feb 2026). This suggests that future research may see renewed attention toward locality-aware, memory-efficient, and discriminatively trained patch-level models for time-series anomaly detection in real-world, resource-constrained settings.