Deformable Patch Embedding in ConvTimeNet

Updated 10 June 2026

The paper introduces deformable patch embedding, enabling an end-to-end learned mechanism that adaptively shifts and resizes patches to capture variable-length temporal patterns.
It leverages a lightweight convolutional predictor to adjust patch offsets and scales, addressing the limitations of traditional fixed-window approaches in multivariate time series.
Empirical results show that this method improves classification accuracy by 2–4% over uniform patching techniques on datasets like FingerMovements and DDG audio.

Deformable patch embedding is a data-driven mechanism for adaptively extracting semantically meaningful, instance-specific patches from sequential or spatial data. In ConvTimeNet, this mechanism—termed the DePatch layer—is employed to address two central challenges of multivariate time series modeling: adaptive local perception and multi-scale dependency modeling. Deformable patch embedding learns, end-to-end, how to shift and resize candidate temporal patches, sampling subseries at non-uniform, learned locations to better capture the intrinsic, variable-length patterns and dynamics within the data. This model design draws inspiration from similar principles explored for image modeling in the Deformable Patch-based Transformer (DPT) for vision (Cheng et al., 2024, Chen et al., 2021).

1. Motivation and Conceptual Framework

In conventional time series analysis, representations are often formed from individual points or fixed-length patches sampled with a uniform stride. While this approach increases the semantic density of tokens, it fails to accommodate the temporal variability and heterogeneity of real-world sequences, where patterns (such as event onsets, peaks, and changing regimes) may not align with hard-segmented boundaries. Deformable patch embedding is introduced to overcome two key deficiencies:

Non-adaptive patching restricts local context to rigid, fixed-width segments; this inflexibility can fragment or miss semantically coherent events.
Low-level representations derived from individual time steps are insufficiently informative, leading to suboptimal feature hierarchies for downstream convolutional processing.

The DePatch mechanism automatically learns offsets (shifts) and scaling factors (variable widths) for each patch, producing a set of adaptively extracted sub-series that encode richer and more localized semantics. This approach parallels deformable patching in vision transformers, where it was shown to preserve object and region semantics by adaptively shaping patch boundaries in a data-driven manner (Chen et al., 2021, Cheng et al., 2024).

2. Mathematical Formulation

Given a multivariate time series $X \in \mathbb{R}^{C\times T}$ (with $C$ channels and length $T$ ), deformable patch embedding proceeds as follows:

Anchor patch initialization:

$N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$

Initial patch centers are $t_{c,i} = (i-1)S + \frac{P}{2}$ , for $i=1, \ldots, N$ .

Offset and scale prediction: Each anchored patch window is processed via a lightweight feature extractor $g(\cdot)$ (typically one or two 1D convolutions), followed by a predictor $H(\cdot)$ (e.g., $1\times1$ convolution or MLP) that outputs:

$[\Delta t_{c,i}, \Delta P_{i}] = H\big( g( X[:, t_{c,i}-P/2 : t_{c,i}+P/2] ) \big)$

Adaptive patch computation: The new patch boundaries are,

$C$ 0

$C$ 1

$C$ 2

where $C$ 3 is the number of sampled points, obtained by linear interpolation.

Patch embedding: The flattened, interpolated patch is projected to embedding space via a learned matrix $C$ 4,

$C$ 5

Stacking all $C$ 6 yields $C$ 7 for downstream processing.

This construction is fully differentiable. No normalization is applied at the DePatch output, with BatchNorm applied downstream in the convolutional blocks (Cheng et al., 2024).

3. Layerwise Architecture and Implementation

The deformable patch embedding procedure is explicitly staged:

Anchor patch extraction: $C$ 8 overlapping windows of length $C$ 9 and stride $T$ 0 are extracted (with padding if necessary).
Offset/scaling prediction: Each patch $T$ 1 is processed by $T$ 2 (typically 1–2 1D convolutions with BatchNorm and GELU), then $T$ 3 predicts offsets and widths, yielding $T$ 4 without a final activation function, permitting both positive and negative shifts.
Recomputation and sampling: Patch boundaries $T$ 5 are recomputed, then $T$ 6 sample locations are uniformly generated and interpolated from $T$ 7.
Patch projection: Each patch is flattened and mapped to its embedding $T$ 8; embeddings are stacked into a $T$ 9 token matrix.

A compact pseudocode representation is provided in the primary source:

$N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 9 This decomposition enables precise localization and scaling of tokens, supporting end-to-end learning under standard task losses (Cheng et al., 2024).

4. Integration with Downstream Hierarchies

Following deformable patch embedding, the output $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 0 becomes the input to multi-stage, fully convolutional hierarchies:

Each stage comprises $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 1 convolutional blocks, each combining depthwise 1D convolutions (with kernel size $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 2 increasing for deeper stages), pointwise convolutions, GELU nonlinearity, BatchNorm, and a residual connection.
A reparameterization trick merges a parallel small-kernel branch into the depthwise convolution during inference for efficiency.
The hierarchical design enables global temporal coverage, progressively enlarging receptive fields and capturing multi-scale dependencies within the now semantically enriched sequence of patch tokens.

No explicit normalization or regularization is imposed at the patch embedding output, with normalization deferred to downstream convolutional components (Cheng et al., 2024).

5. Empirical Evaluation and Comparative Analyses

Empirical ablations conducted over 10 classification datasets demonstrate:

Uniform (fixed) patching leads to 3–5% accuracy improvement over pointwise input.
Deformable patch embedding (DePatch-Conv-Conv, i.e., two conv layers in $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 3) provides a further 2–4% gain, consistently outperforming all other tested patching strategies.
On the FingerMovements dataset, pointwise, uniform, and deformable patching achieve 55%, 66%, and 68% accuracy, respectively.
On the DDG audio dataset, deformable patching increases accuracy from 54% (uniform) to 66%.

These results confirm that adaptive patching better preserves local temporal semantics than fixed slicing, leading to significant performance improvements in both time series and, by analogy, vision tasks (Cheng et al., 2024, Chen et al., 2021).

Setting	FM Accuracy (%)	DDG Accuracy (%)	Avg Gain vs. Uniform (%)
Pointwise	55	—	—
Uniform Patch	66	54	+3–5
DePatch-Conv-Conv	68	66	+2–4 additional

The deformable patch paradigm originated for spatial data in vision transformers. In DPT (Chen et al., 2021), the DePatch module learns offsets and adaptive scales for each patch in $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 4 images, with predicted patch center shifts $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 5 and scales $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 6. Patch content is then sampled using a regular (e.g., $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 7) grid within the adaptive patch boundaries, followed by bilinear interpolation and linear embedding. The module adds marginal parameter and computation overhead, and demonstrates empirical improvements of 1–2.5% in classification and 1–3.5 mAP in detection tasks over their rigid baseline. Module ablations reveal that both shifts and scales are critical; inclusion of both provides a cumulative accuracy improvement (Chen et al., 2021).

A plausible implication is that—across both temporal and spatial modalities—data-driven adaptation of patch location and size is universally beneficial, especially for modeling variable-width or semantically diverse events.

7. Implementation Considerations and Limitations

The offset/scaling predictor $N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2$ 8 is lightweight (1–2 convolutional or linear layers) and is initialized to behave as fixed patching in early training.
The entire DePatch layer, including offset/scale prediction and interpolation, is differentiable and integrates seamlessly into end-to-end training.
No explicit regularization is imposed on patch parameters; constraints arise from the parameterization and limits of batch statistics.
The deformable patch embedding is agnostic to the downstream sequence model (convolutional or self-attentive) and can theoretically be adapted to various architectures.
In both ConvTimeNet and DPT, parameter and computational overhead is modest relative to the observed gains.

No evidence is provided for explicit limitations or pathologies specific to deformable patching beyond the added architectural complexity. Further research directions may include optimization of patch predictor architectures and assessment of generalization across radically different domains (Cheng et al., 2024, Chen et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

ConvTimeNet: A Deep Hierarchical Fully Convolutional Model for Multivariate Time Series Analysis (2024)

DPT: Deformable Patch-based Transformer for Visual Recognition (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deformable Patch Embedding (ConvTimeNet).