Multiresolution Decoder-Only Model
- Multiresolution decoder-only models are deep learning architectures that integrate multiple resolution inputs using dedicated embedding, fusion, and autoregressive modules.
- They combine coarse- and fine-grained contexts to improve predictions in tasks such as time series forecasting and 3D shape reconstruction.
- Empirical studies show reduced forecasting errors and state-of-the-art accuracy, demonstrating their practical impact in capturing long-term dependencies.
A multiresolution decoder-only model is an architectural paradigm in deep learning that enhances sequence modeling and generative tasks by incorporating representations at multiple resolutions within a decoder-only backbone. Recent work in time series forecasting and 3D shape representation has demonstrated utility in fusing coarse- and fine-grained context, yielding improvements in zero-shot prediction, sequence completion, and progressive refinement. In these models, the decoder is designed to process inputs from distinct resolutions, often with specialized architectural modules for embedding, fusion, and autoregressive decoding.
1. Architectural Principles of Multiresolution Decoder-Only Models
Multiresolution decoder-only models extend standard decoder-only designs by enabling simultaneous processing of multiple resolutions of input context. In time series forecasting, the Cisco Time Series Model (TSM) exemplifies this approach by accepting two fixed-length contexts: a coarse context (e.g., hourly aggregates) and a fine context (e.g., minute-level data) (Gou et al., 25 Nov 2025). Both contexts are independently normalized and partitioned into non-overlapping patches. Each patch is embedded using a residual structure:
producing high-dimensional representations . A learnable resolution embedding is added to each token embedding to indicate its provenance (coarse or fine). A special token is inserted between coarse and fine tokens, resulting in a sequence fed to a deep transformer-based decoder, typically with dozens of layers.
In 3D shape modeling, the Multiresolution Deep Implicit Functions (MDIF) framework hierarchically organizes latent grids to support progressive decoder-only optimization. Each grid is associated with a decoder , enabling coarse-to-fine reconstruction (Chen et al., 2021).
2. Multiresolution Data Encoding and Fusion
Encoding in multiresolution models involves representing sequential or spatial data at distinct granularities and merging these representations for downstream decoding. In the Cisco TSM, normalization is performed separately on and , followed by patching each into sequences of length-32 vectors. Each is tokenized and augmented with a corresponding resolution embedding, and these sequences are concatenated with an intervening special token:
where , , and .
For MDIF, trilinear interpolation allows latent code access at arbitrary spatial coordinates across hierarchically organized grids. Residual connections via upscaled global grids are concatenated at finer levels, allowing fusion of global and local structure:
3. Autoregressive Decoding and Progressive Inference
Multiresolution decoder-only architectures perform autoregressive decoding by generating outputs at the finest desired resolution, while maintaining or updating coarse context to inform long-range dependencies. In time series, after each forecasting step of fine-resolution tokens, new coarse patches are computed by aggregation:
these are appended to the coarse context sequence for subsequent autoregressive rounds (Gou et al., 25 Nov 2025).
For 3D shapes, progressive inference with MDIF proceeds by first decoding the coarsest representation and recursively evaluating finer residuals up to a chosen detail level :
This enables computational efficiency by restricting high-resolution decoding to regions of interest.
4. Training Regimen, Regularization, and Loss Objectives
Multiresolution decoder-only models are typically trained with objectives designed to match predicted outputs to ground truth at multiple quantiles or resolutions. The Cisco TSM uses a zero-shot training protocol with composite loss:
where is the quantile loss, and regularization includes weight decay, gradient clipping, loss value clipping, and cosine-annealed learning rates for AdamW and the Muon optimizer (Gou et al., 25 Nov 2025).
MDIF employs latent-grid dropout during encoder–decoder training (drop rate , not applied to global connections), promoting resilience to missing features. Decoder-only latent optimization minimizes fit error and regularization on visible and occluded regions, using Adam with fixed hyperparameters (Chen et al., 2021).
5. Empirical Performance and Case Studies
Multiresolution decoder-only architectures have demonstrated empirical advantages over single-resolution or extended context decoder-only models:
| Model | Domain | MASE (lower is better) |
|---|---|---|
| Cisco TSM (512,512) | Observability | 0.4569 |
| TimesFM-2.5 (512) | Observability | 0.7290 |
| TimesFM-2.5 (1024) | Observability | 0.6159 |
| Cisco TSM (512,512) | GIFT-Eval | 0.7365 |
| TimesFM-2.5 (1024) | GIFT-Eval | 0.6828 |
| TimesFM-2.0 (1024) | GIFT-Eval | 0.7620 |
The Cisco TSM achieves substantial error reduction on observability data and retains competitive results on public forecasting benchmarks compared to extended-context single-resolution architectures (Gou et al., 25 Nov 2025). Qualitative studies reveal patterns where coarse resolution inputs allow the model to capture long-term periodicities and mitigate high-frequency noise. Increasing the coarse context (e.g., from 512 to 1024 hours) systematically improves forecast accuracy and enables discovery of cyclic structure not evident at finer resolutions.
MDIF obtains state-of-the-art accuracy in 3D auto-encoding, point-cloud completion, and voxel super-resolution, with the capacity for true progressive refinement by selection of detail level at inference. Completion accuracy is measured by Chamfer L2 distance and F-Score (Chen et al., 2021). Latent-grid dropout further enhances decoder-only optimization under partial data conditions.
6. Applications and Implications
In time series analysis, multiresolution decoder-only models facilitate zero-shot long-context forecasting, especially in observability and monitoring domains where long-term patterns (e.g., weekly cycles) are influential. The ability to incorporate both high-frequency and low-frequency information allows for more robust extrapolation without downstream fine-tuning.
For 3D shape reconstruction, the hierarchical latent grid approach in MDIF enables both encoder–decoder and decoder-only optimization, supporting applications in shape completion from partial observations, super-resolution, and progressive mesh refinement.
A plausible implication is that multiresolution fusion via decoder-only modules provides a general mechanism for leveraging heterogeneous contextual information in autoregressive generation tasks. The explicit architectural separation of resolutions, combined with trainable fusion schemes, offers a scalable pathway for supporting long-term dependencies in sequence and spatial modeling.
7. Related Research Directions
The multiresolution paradigm aligns with ongoing efforts to improve scalability and compositionality in deep generative models. Hierarchical representations, progressive decoding, and specialized regularization (e.g., latent dropout, quantile losses) are converging themes. Models such as TimesFM [Das et al. 2024] and IM-Net-style MLP architectures represent foundational designs extended by multiresolution capabilities. Future research may investigate joint modeling of additional resolutions, adaptive context selection, and broader transferability across domains, particularly in cross-modal or multi-task autoregressive frameworks (Gou et al., 25 Nov 2025, Chen et al., 2021).