Multi-Granularity Temporal Modeling (MGTM)

Updated 16 January 2026

MGTM is a modeling framework that integrates multiple temporal resolutions to capture fine-scale dynamics and coarse structural trends.
It employs techniques such as cross-granularity attention and hierarchical fusion to effectively extract and combine temporal features.
MGTM has demonstrated improved accuracy in forecasting, video analysis, and spatio-temporal learning by outperforming single-granularity methods.

Multi-Granularity Temporal Modeling (MGTM) is a family of frameworks and principles in time series analysis, spatio-temporal learning, video, and event-based processing that explicitly leverages information from multiple temporal resolutions—or "granularities"—within a single model. Unlike traditional approaches that operate at a fixed timescale (e.g., per-frame, per-timestep, or per-block), MGTM models fuse representations across several resolutions, enabling the capture of both fine-scale dynamics and coarse structural trends. This approach has demonstrated substantial improvements over single-granularity methods in generative modeling, forecasting, representation learning, sequence classification, and more, across diverse domains including time series, video, event-based perception, and spatial-temporal graph modeling.

1. Formal Definitions and Granularity Structures

MGTM methodologies define multiple resolutions by segmenting temporal data into non-overlapping or sliding windows at varying scales. Consider a multivariate time series: $X \in \mathbb{R}^{w \times m} = [x_1,\ldots, x_w]$ where $w$ is the sequence length and $m$ is channel dimensionality. MGTM divides $X$ into segments $S_i \in \mathbb{R}^{j \times m}$ at the fine-grain (timestamp-level) and aggregates into larger windows for coarse-grain (segment-level) representations (Ye et al., 2023). In video or event data, this segmentation might involve chunking into sub-sequences at different temporal resolutions (e.g., per-frame, per-chunk/block) (Shi et al., 14 Apr 2025), or constructing complementary representations such as voxel-based and point-based streams in event-based vision (Lin et al., 2024). In the spatio-temporal domain, granularities can be minute, hour, and day sequences, each with associated temporal embeddings (Zhao et al., 2024).

A generalized formalism for $K$ granularities is:

$G = \{g_1,\ldots,g_K\}$ , with $g_k$ denoting the $k$ -th granularity.
Each $g_k$ corresponds to a different window size or chunking interval, with associated look-back horizon and (possibly) prediction window.

2. Model Architectures for Multi-Granularity Fusion

MGTM architectures integrate multiple temporal scales through parallel or hierarchical pathways. Modern strategies include:

Cross-Granularity Attention: Fine-level (timestamp/vector) representations and coarse-level (segment/global) embeddings are aligned via Transformer-style attention. For example, the Multi-Grain Unsupervised Graph (MUG) framework uses self-attention pooling within each segment to summarize timestamps, then cross-attention to combine with segment-level encoding (Ye et al., 2023).
Hierarchical/Sequential Fusion: Inputs are encoded for each granularity, linearly aligned, and concatenated for joint attention modeling. SIFM for sea ice forecasting processes daily, weekly, and monthly SIC map sequences through Swin Transformer backbones, aligns dimensions, concatenates representations, and performs Transformer-based intra- and inter-granularity fusion with per-granularity FFNs (Xu et al., 2024).
Parallel Branching and Lambda Attention: In tasks with prominent motion and structure (e.g., medical image interpolation), GaraMoSt employs fully parallel, multi-branch modules (e.g., MG-MSFE) with independent receptive field radii, extracting motion/structural features at coarse and fine granularities and fusing these via learned attention (Xu et al., 2024).
Chunk-based Pipeline for Video: Mavors encodes video as sequences of spatially high-res chunks (Intra-chunk Vision Encoder), preserving fine spatio-temporal detail, followed by aggregation via an Inter-chunk Feature Aggregator with chunk-level positional encodings (Shi et al., 14 Apr 2025).
Coarse-to-fine or Recursive Refinement: For trajectory prediction or sequence generation, MGTM can employ recursive refinement networks operating at successively finer granularities. MGTraj predicts coarse-grained goal endpoints and recursively refines intermediate trajectories using weight-shared transformers (Sun et al., 11 Sep 2025). In diffusion models, MG-TSD links granularity levels to specific steps of the forward noising process, guiding denoising via explicit coarse targets (Fan et al., 2024).

3. Training Strategies and Objective Functions

MGTM frameworks impose supervision or self-supervision at multiple scales:

Retrieval/Retrieval-style Losses: MUG uses unsupervised retrieval, with the model tasked to retrieve corresponding coarse/fine representations from a candidate set, using a rank-normalized cross-entropy loss based on Spearman similarity (Ye et al., 2023).
Multi-Grain Losses: SIFM aggregates per-granularity MSE losses over all predicted temporal scales, possibly weighted by importance coefficients (Xu et al., 2024).
Guided Diffusion Losses: MG-TSD defines a multi-granularity guidance loss by assigning coarse-grain smoothed targets as supervision at predetermined diffusion steps, regularizing the denoising process through explicit scale correspondence (Fan et al., 2024).
Auxiliary Tasks: In trajectory modeling, explicit velocity prediction at each granularity serves as an auxiliary loss, promoting consistency and better dynamic modeling (Sun et al., 11 Sep 2025).
Cross-scale Consistency Constraints: In multi-scale spectrum prediction, consistency between integrated predictions at adjacent scales is strictly enforced (Rasti et al., 19 Feb 2025): $w$ 0

MGTM pipelines are often trained end-to-end, with either fixed architectures or shared weights across scales, supporting both deterministic and distributional outputs.

4. Applications and Empirical Impact

MGTM achieves systematic improvements across multiple domains:

Time Series Representation Learning: Multi-granularity fusion yields substantially higher classification and retrieval accuracy, notably improving robustness under noisy/contaminated conditions (Ye et al., 2023).
Long-Range Video Understanding: Chunk-based MGTM architectures, such as Mavors, preserve both spatial detail and long-range temporal reasoning, outperforming sparse sampling and token-compression baselines on captioning and QA tasks (Shi et al., 14 Apr 2025).
Spatio-Temporal Forecasting: Traffic and Arctic sea-ice prediction models integrating multi-granularity input outperform single-scale baselines in RMSE, MAE, and specialized domain metrics (e.g., sea-ice edge error) (Zhao et al., 2024, Xu et al., 2024).
Compression and Generation: Multi-granularity trajectory factorization for video coding (e.g., MTTF) dramatically reduces bitrate (>70% BD-rate saving vs. VVC), while maintaining or improving perceptual fidelity (Yin et al., 2024).
Motion Deblurring and Frame Interpolation: Event-based deblurring combines coarse (voxel) and fine (point-cloud) representations for state-of-the-art PSNR/SSIM/LPIPS (Lin et al., 2024). Medical frame interpolation via parallel granularity-specific Lambda attention improves both accuracy and artifact suppression (Xu et al., 2024).
Wireless Networking: Multi-granularity spectrum forecasting, embedded in O-RAN, increases utilization by >20% and reduces error by ~30% compared to single-scale approaches (Rasti et al., 19 Feb 2025).

5. Theoretical Insights and Fusions Across Domains

MGTM integrates with both discriminative and generative modeling paradigms. Critical theoretical insights include:

Diffusion–Smoothing Analogy: The forward diffusion process in DDPM mirrors the successive smoothing of fine-to-coarse temporal resolution; MG-TSD exploits this by aligning denoising targets with real data smoothed to match the diffusion-step granularity, inducing multi-scale regularization (Fan et al., 2024).
Frequency-Domain Discrepancy: Granularity variation alters the joint amplitude–frequency and phase–frequency distribution in time series. The General Time-series Model (GTM) encodes granularity metadata and operates in both temporal and frequency domains, leveraging Fourier Knowledge Attention for granularity-aware representation (He et al., 5 Feb 2025).
Cross-domain Flexibility: MGTM is adaptable to spatial-temporal graphs (STMGF for traffic, Mavors for long-range video), event-based vision, human motion, and multimodal architectures.

6. Architectural and Practical Considerations

Adoption of MGTM involves various design trade-offs:

Window or Chunk Size Selection: The choice of window/chunk size at each granularity critically impacts the ability to capture relevant temporal features (Ye et al., 2023, Shi et al., 14 Apr 2025).
Shared vs. Independent Weights: Some frameworks (e.g., MGTraj) enforce shared temporal encoders across scales; others maintain fully parallel branches (e.g., GaraMoSt’s Lambda layers (Xu et al., 2024)).
Computational Efficiency: Parallel architectures (e.g., MG-MSFE) support real-time applications with minimal computational overhead, while hierarchical fusion can increase parameter count.
Extendability: Most published methods currently consider two or three scales; extension to a hierarchy with more levels, or even continuous scale modeling, is a subject of ongoing research (Ye et al., 2023).
Regularization and Consistency: Several frameworks enforce cross-scale consistency either via hard constraints or architectural fusion, but explicit multi-granularity supervision is less common.

7. Limitations and Future Directions

While MGTM consistently delivers gains, it introduces challenges:

Determining optimal granularity levels often requires cross-validation or domain knowledge, and fixed a priori windows may not adapt to nonstationary or irregularly sampled data.
Certain implementations incur increased computational and memory costs owing to multi-branch fusion or additional encoders for each granularity.
Extending beyond two or three scales to accommodate more complex data hierarchies remains nontrivial.
For stochastic generative modeling, proper alignment of coarsened targets and diffusion steps poses new regularization and algorithmic questions.

Future work is likely to address adaptive granularity selection, application to self-supervised or unsupervised regimes, architectural unification across more modalities (event data, video, spatio-temporal graphs), and theoretical analysis of scale-regularized deep representations. MGTM is poised to remain a central paradigm for temporal modeling across time series analysis, video understanding, event-based perception, and multi-modal sequence learning.

Key cited works:

(Ye et al., 2023) Multi-Granularity Framework for Unsupervised Representation Learning of Time Series (Xu et al., 2024) SIFM: A Foundation Model for Multi-granularity Arctic Sea Ice Forecasting (Zhao et al., 2024) STMGF: An Effective Spatial-Temporal Multi-Granularity Framework for Traffic Forecasting (Lin et al., 2024) Event-based Motion Deblurring via Multi-Temporal Granularity Fusion (Shi et al., 14 Apr 2025) Mavors: Multi-granularity Video Representation for Multimodal LLM (Yin et al., 2024) Generative Human Video Compression with Multi-granularity Temporal Trajectory Factorization (He et al., 5 Feb 2025) General Time-series Model for Universal Knowledge Representation of Multivariate Time-Series data (Fan et al., 2024) MG-TSD: Multi-Granularity Time Series Diffusion Models with Guided Learning Process (Sun et al., 11 Sep 2025) MGTraj: Multi-Granularity Goal-Guided Human Trajectory Prediction with Recursive Refinement Network (Xu et al., 2024) GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images (Rasti et al., 19 Feb 2025) Highly Dynamic and Flexible Spatio-Temporal Spectrum Management with AI-Driven O-RAN: A Multi-Granularity Marketplace Framework