Inter-Patch Temporal Dependencies

Updated 25 January 2026

Inter-patch temporal dependencies is defined by dividing sequences into patches to capture both short-term and long-range temporal dynamics using tools like self-attention and MLP-based mixing.
It employs adaptive segmentation methods such as fixed-length, dynamic, and multi-scale patching, which improve the efficiency and accuracy of models in tasks like video understanding and time series forecasting.
Empirical studies demonstrate that incorporating patch-level dependency modeling can significantly reduce errors and enhance performance across various spatiotemporal and anomaly detection applications.

Inter-patch temporal dependencies refer to the explicit modeling and integration of temporal relationships between discrete, localized segments (“patches”) within a sequence or video. Instead of analyzing entire sequences directly, modern architectures partition the input into patches—contiguous blocks spanning fixed or adaptive intervals—and then deploy specialized mechanisms (e.g., self-attention, learnable blending, MLP mixing) to capture the dynamics, interactions, and causal links between these segments. This paradigm has become central in video understanding, time series forecasting, anomaly detection, and spatio-temporal modeling, where both short-term and long-range dependencies across patches underpin robust prediction and representation learning.

1. Mathematical Formulations and Patch Construction

The definition of a patch varies by domain but generally entails dividing a sequence $X \in \mathbb{R}^{L \times C}$ (length $L$ , channels $C$ ) or video $X \in \mathbb{R}^{n \times p \times z}$ (frames $n$ , spatial patches $p$ , embedding dimension $z$ ) into contiguous blocks of length $P$ (for time series) or per-frame spatial regions (for video). Common constructions:

Fixed-length, stride-based segmentation: $N = \lfloor (L-P)/S \rfloor + 1$ non-overlapping or overlapping patches, $X^{(j)} = X[jS : jS+P-1]$ (Villaboni et al., 22 Mar 2025, Nagrath, 18 Jan 2026, Wang et al., 30 Nov 2025, Liu et al., 2024).
Dynamic boundary detection: Patch boundaries at high-entropy transition points, found by thresholds over Shannon conditional entropy of quantized tokens $H(x_t)$ (Abeywickrama et al., 30 Sep 2025).
Multi-scale patching: Multiple patch sizes yield coarse and fine-grained partitions; stacking patch-mixer blocks produces features at varying temporal resolutions (Wu et al., 2024, Xie et al., 22 Jan 2025).
Video patches: $X \in \mathbb{R}^{n \times p \times z}$ ; each frame decomposed into $p$ spatial patches; PatchBlender injects a learnable blending matrix $R \in \mathbb{R}^{n \times n}$ along the frame axis (Prato et al., 2022).

2. Architectures for Modeling Inter-Patch Dependencies

Self-Attention across Patches

Global patch attention: Project patch embeddings into $D$ -dimensional tokens. Multi-head self-attention computes $Q, K, V$ , aggregates patch representations globally: $A = \mathrm{softmax}(QK^\top/\sqrt{d_k})$ , $Z = A V$ (Wang et al., 30 Nov 2025, Nagrath, 18 Jan 2026, Liu et al., 2024).
Unified attention over flattened patch tokens: Flatten all patch (and channel) tokens into a single 2D array; the resulting attention weights model any patch-to-patch and channel-to-channel dependency in a single pass (Liu et al., 2024).
Dispatcher modules: For high-dimensional data, context is aggregated via learnable dispatcher embeddings and distributed back (two cross-attention stages), reducing memory and compute from quadratic $O(N^2p^2)$ to linear $O(kN p)$ in patch count (Liu et al., 2024).
Blending and temporal smoothing: PatchBlender applies a learnable $R$ to linearly mix same-spatial-location patches across frames, creating a motion prior that complements attention (Prato et al., 2022).
MLP-based mixing: Parallel intra- and inter-patch MLPs operate respectively along time within a patch (local, short-range) and across patches (global, long-range), promoting both local memory and global context (Ye et al., 2024, Wu et al., 2024).

Specialization by Channel or Scale

Dual-branch architectures: Separate attention over channel and temporal axes, with a global patch fusion module combining results for direct multi-step or collaborative prediction (Wang et al., 30 Nov 2025, Villaboni et al., 22 Mar 2025).
Multi-scale patch mixing and scale fusion: Models stack patch-mixer blocks of varying sizes; a scale-fusion attention module merges predictions across scales for enhanced robustness (Wu et al., 2024, Xie et al., 22 Jan 2025).

3. Adaptive and Information-Theoretic Boundary Strategies

Inter-patch temporal dependencies depend critically on where patches begin and end:

Entropy-guided dynamic patching: Conditional entropy peaks define true transition points; boundaries enforced where $H(x_t) > \theta$ and $H(x_t)-H(x_{t-1}) > \gamma$ (Abeywickrama et al., 30 Sep 2025). This preserves structure at natural event changes and improves representation learning relative to fixed-interval patching.
Instance-wise patch normalization: DeCoP interpolates global (instance-level) and local (patch-level) means/variances for robust normalization, blending global stability with local specificity and mitigating non-stationary drift (Wu et al., 18 Sep 2025).
Domain-aware patches: For spatial graphs and epidemic modeling, a “patch” may be a node in a mobility graph; learned adjacency and mobility rates are time-varying and induced by spatio-temporal attention (Mao et al., 2023).

4. Mechanisms for Modeling and Enforcing Inter-Patch Dynamics

Key methodologies include:

Attention masks and blending matrices: Custom masks restrict attention to certain patch combinations; PatchBlender’s learnable $R$ matrix explicitly enables or disables temporal smoothing (Prato et al., 2022, Bian et al., 2024), enforcing model reliance on cross-patch information only where required.
Contrastive objectives and loss functions: DeCoP applies instance-level contrastive loss, aligning representations from original and denoised windows, focusing global similarity on stable, cross-patch features (Wu et al., 18 Sep 2025).
Order and similarity prediction: PSTRP frames inter-patch relationships as multi-label classification (temporal order) and distance matrix regression (similarity), directly capturing both sequentiality and likeness (Shen et al., 2024).
Physical models as constraints: MPSTAN fuses GAT-based attention with the metapopulation SIR model, where learned inter-patch (mobility) rates govern ODEs coupling node states; this ensures neural estimates respect true transmission dynamics (Mao et al., 2023).

5. Empirical Evidence and Ablation Studies

Across domains, modeling inter-patch temporal dependencies improves downstream performance and robustness, especially for tasks requiring long-range reasoning:

Model	Task/Dataset	Dependency Handling	Key Result	Ablation Effect
PatchBlender (Prato et al., 2022)	Video action/MOVi-A, SSv2	R-based motion prior	+20–30% reduction in MSE vs ViT	Frame shuffling degrades performance; off-diagonal mixing critical
EntroPE (Abeywickrama et al., 30 Sep 2025)	Time series (ETTh1, Electricity)	Entropy-dynamic patching	8–20% reduction in MSE, +25–30% efficiency	Removing EDP: +5–10% MSE; fixed length, no gain
IIP-Mixer (Ye et al., 2024)	Battery RUL	Parallel intra/inter MLP	Best RUL prediction	Dropping inter-patch mixer: major loss, confirming global role
D-CTNet (Wang et al., 30 Nov 2025)	MTS forecasting	Global patch attention fusion	+14–29% gain on long horizons	Removing fusion module: 13–29% worse
Sensorformer (Qin et al., 6 Jan 2025)	High-dim time series	Sensor comp.+patch attention	3–12% MSE gain vs PatchTST/iTransformer	Removing comp. or attn: each 3–12% loss
PSTRP (Shen et al., 2024)	Video anomaly	SSL order/similarity	AUROC improvement +1–2%	Removing dist.-matrix task hurts AUROC
DeCoP (Wu et al., 18 Sep 2025)	Pretraining	Multi-scale DCL/ICM	3% lower MSE, 37% FLOPs	No DCL: 18–20% dropin cross-domain F1

In summary, ablation studies confirm the necessity of explicit patch-level dependency modeling. Models that forgo inter-patch mechanisms—either by using only intra-patch operations, fixed patch boundaries, or non-adaptive blending—demonstrate statistically and practically inferior performance on both forecasting and representation tasks.

6. Applications and Domain Adaptations

Video Transformers: PatchBlender, SCT, and PSTRP apply learnable blending, shifted attention, and self-supervised relation prediction to capture both short-term motion and long-range order, substantially boosting action recognition and anomaly detection accuracy (Prato et al., 2022, Zha et al., 2021, Shen et al., 2024).
Long-horizon Time Series Forecasting: EntroPE, Sensorformer, Sentinel, and DeCoP achieve state-of-the-art results in electricity, weather, and financial series, explicitly modeling non-local dependencies, multi-scale patterns, causal lags, and adaptive boundaries (Abeywickrama et al., 30 Sep 2025, Qin et al., 6 Jan 2025, Villaboni et al., 22 Mar 2025, Wu et al., 18 Sep 2025).
Graph and Epidemic Modeling: MPSTAN integrates learned inter-patch rates (mobility, infection, recovery) with neural predictions; the joint loss drives models to capture multi-patch transmission and dynamic adjacency (Mao et al., 2023).
Anomaly Detection: Through coarse-grained, multi-scale attention over patch blocks and inter-variate fusion, models like MtsCID increase sensitivity to subtle, group-level anomalies missed by fine-grained step-wise approaches (Xie et al., 22 Jan 2025).

7. Limitations, Open Problems, and Future Directions

Despite their empirical success, patch-based dependency models confront several challenges:

Granularity tradeoff: Fixed patch sizes may dilute important local structure; dynamic patching (EntroPE) mitigates this but introduces computational and batching complexity (Abeywickrama et al., 30 Sep 2025, Wu et al., 18 Sep 2025).
Boundary effects and context leakage: Choice and adaptivity of boundaries determine temporal coherence; poor segmentation weakens inter-patch modeling (Abeywickrama et al., 30 Sep 2025, Bian et al., 2024).
Computational cost and scalability: Large numbers of patches lead to quadratic complexity; dispatcher and global compression modules partially alleviate these costs but may underrepresent fine interactions (Liu et al., 2024, Qin et al., 6 Jan 2025).
Cross-domain and multi-modal transferability: Robustness to non-stationary environments and distribution drift requires adaptive normalization and contrastive learning (DeCoP, D-CTNet); more work is needed to sustain performance in truly out-of-distribution settings (Wu et al., 18 Sep 2025, Wang et al., 30 Nov 2025).

Inter-patch temporal dependency modeling thus represents a powerful, general paradigm across spatiotemporal vision, time series, and graph domains. State-of-the-art architectures leverage adaptive boundary strategies, multi-stage attention, multi-scale fusion, and explicit physical/semantic knowledge embedding to synthesize expressive, robust representations with proven gains in accuracy, generalization, and efficiency.