Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deformable Patch Embedding in ConvTimeNet

Updated 10 June 2026
  • The paper introduces deformable patch embedding, enabling an end-to-end learned mechanism that adaptively shifts and resizes patches to capture variable-length temporal patterns.
  • It leverages a lightweight convolutional predictor to adjust patch offsets and scales, addressing the limitations of traditional fixed-window approaches in multivariate time series.
  • Empirical results show that this method improves classification accuracy by 2–4% over uniform patching techniques on datasets like FingerMovements and DDG audio.

Deformable patch embedding is a data-driven mechanism for adaptively extracting semantically meaningful, instance-specific patches from sequential or spatial data. In ConvTimeNet, this mechanism—termed the DePatch layer—is employed to address two central challenges of multivariate time series modeling: adaptive local perception and multi-scale dependency modeling. Deformable patch embedding learns, end-to-end, how to shift and resize candidate temporal patches, sampling subseries at non-uniform, learned locations to better capture the intrinsic, variable-length patterns and dynamics within the data. This model design draws inspiration from similar principles explored for image modeling in the Deformable Patch-based Transformer (DPT) for vision (Cheng et al., 2024, Chen et al., 2021).

1. Motivation and Conceptual Framework

In conventional time series analysis, representations are often formed from individual points or fixed-length patches sampled with a uniform stride. While this approach increases the semantic density of tokens, it fails to accommodate the temporal variability and heterogeneity of real-world sequences, where patterns (such as event onsets, peaks, and changing regimes) may not align with hard-segmented boundaries. Deformable patch embedding is introduced to overcome two key deficiencies:

  • Non-adaptive patching restricts local context to rigid, fixed-width segments; this inflexibility can fragment or miss semantically coherent events.
  • Low-level representations derived from individual time steps are insufficiently informative, leading to suboptimal feature hierarchies for downstream convolutional processing.

The DePatch mechanism automatically learns offsets (shifts) and scaling factors (variable widths) for each patch, producing a set of adaptively extracted sub-series that encode richer and more localized semantics. This approach parallels deformable patching in vision transformers, where it was shown to preserve object and region semantics by adaptively shaping patch boundaries in a data-driven manner (Chen et al., 2021, Cheng et al., 2024).

2. Mathematical Formulation

Given a multivariate time series X∈RC×TX \in \mathbb{R}^{C\times T} (with CC channels and length TT), deformable patch embedding proceeds as follows:

  1. Anchor patch initialization:

N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 2

Initial patch centers are tc,i=(i−1)S+P2t_{c,i} = (i-1)S + \frac{P}{2}, for i=1,…,Ni=1, \ldots, N.

  1. Offset and scale prediction: Each anchored patch window is processed via a lightweight feature extractor g(⋅)g(\cdot) (typically one or two 1D convolutions), followed by a predictor H(⋅)H(\cdot) (e.g., 1×11\times1 convolution or MLP) that outputs:

[Δtc,i,ΔPi]=H(g(X[:,tc,i−P/2:tc,i+P/2]))[\Delta t_{c,i}, \Delta P_{i}] = H\big( g( X[:, t_{c,i}-P/2 : t_{c,i}+P/2] ) \big)

  1. Adaptive patch computation: The new patch boundaries are,

CC0

CC1

CC2

where CC3 is the number of sampled points, obtained by linear interpolation.

  1. Patch embedding: The flattened, interpolated patch is projected to embedding space via a learned matrix CC4,

CC5

Stacking all CC6 yields CC7 for downstream processing.

This construction is fully differentiable. No normalization is applied at the DePatch output, with BatchNorm applied downstream in the convolutional blocks (Cheng et al., 2024).

3. Layerwise Architecture and Implementation

The deformable patch embedding procedure is explicitly staged:

  1. Anchor patch extraction: CC8 overlapping windows of length CC9 and stride TT0 are extracted (with padding if necessary).
  2. Offset/scaling prediction: Each patch TT1 is processed by TT2 (typically 1–2 1D convolutions with BatchNorm and GELU), then TT3 predicts offsets and widths, yielding TT4 without a final activation function, permitting both positive and negative shifts.
  3. Recomputation and sampling: Patch boundaries TT5 are recomputed, then TT6 sample locations are uniformly generated and interpolated from TT7.
  4. Patch projection: Each patch is flattened and mapped to its embedding TT8; embeddings are stacked into a TT9 token matrix.

A compact pseudocode representation is provided in the primary source:

N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 29 This decomposition enables precise localization and scaling of tokens, supporting end-to-end learning under standard task losses (Cheng et al., 2024).

4. Integration with Downstream Hierarchies

Following deformable patch embedding, the output N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 20 becomes the input to multi-stage, fully convolutional hierarchies:

  • Each stage comprises N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 21 convolutional blocks, each combining depthwise 1D convolutions (with kernel size N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 22 increasing for deeper stages), pointwise convolutions, GELU nonlinearity, BatchNorm, and a residual connection.
  • A reparameterization trick merges a parallel small-kernel branch into the depthwise convolution during inference for efficiency.
  • The hierarchical design enables global temporal coverage, progressively enlarging receptive fields and capturing multi-scale dependencies within the now semantically enriched sequence of patch tokens.

No explicit normalization or regularization is imposed at the patch embedding output, with normalization deferred to downstream convolutional components (Cheng et al., 2024).

5. Empirical Evaluation and Comparative Analyses

Empirical ablations conducted over 10 classification datasets demonstrate:

  • Uniform (fixed) patching leads to 3–5% accuracy improvement over pointwise input.
  • Deformable patch embedding (DePatch-Conv-Conv, i.e., two conv layers in N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 23) provides a further 2–4% gain, consistently outperforming all other tested patching strategies.
  • On the FingerMovements dataset, pointwise, uniform, and deformable patching achieve 55%, 66%, and 68% accuracy, respectively.
  • On the DDG audio dataset, deformable patching increases accuracy from 54% (uniform) to 66%.

These results confirm that adaptive patching better preserves local temporal semantics than fixed slicing, leading to significant performance improvements in both time series and, by analogy, vision tasks (Cheng et al., 2024, Chen et al., 2021).

Setting FM Accuracy (%) DDG Accuracy (%) Avg Gain vs. Uniform (%)
Pointwise 55 — —
Uniform Patch 66 54 +3–5
DePatch-Conv-Conv 68 66 +2–4 additional

The deformable patch paradigm originated for spatial data in vision transformers. In DPT (Chen et al., 2021), the DePatch module learns offsets and adaptive scales for each patch in N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 24 images, with predicted patch center shifts N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 25 and scales N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 26. Patch content is then sampled using a regular (e.g., N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 27) grid within the adaptive patch boundaries, followed by bilinear interpolation and linear embedding. The module adds marginal parameter and computation overhead, and demonstrates empirical improvements of 1–2.5% in classification and 1–3.5 mAP in detection tasks over their rigid baseline. Module ablations reveal that both shifts and scales are critical; inclusion of both provides a cumulative accuracy improvement (Chen et al., 2021).

A plausible implication is that—across both temporal and spatial modalities—data-driven adaptation of patch location and size is universally beneficial, especially for modeling variable-width or semantically diverse events.

7. Implementation Considerations and Limitations

  • The offset/scaling predictor N=⌊T−PS⌋+2N = \left\lfloor \frac{T - P}{S} \right\rfloor + 28 is lightweight (1–2 convolutional or linear layers) and is initialized to behave as fixed patching in early training.
  • The entire DePatch layer, including offset/scale prediction and interpolation, is differentiable and integrates seamlessly into end-to-end training.
  • No explicit regularization is imposed on patch parameters; constraints arise from the parameterization and limits of batch statistics.
  • The deformable patch embedding is agnostic to the downstream sequence model (convolutional or self-attentive) and can theoretically be adapted to various architectures.
  • In both ConvTimeNet and DPT, parameter and computational overhead is modest relative to the observed gains.

No evidence is provided for explicit limitations or pathologies specific to deformable patching beyond the added architectural complexity. Further research directions may include optimization of patch predictor architectures and assessment of generalization across radically different domains (Cheng et al., 2024, Chen et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deformable Patch Embedding (ConvTimeNet).