Multiresolution Decoder-Only Model

Updated 2 December 2025

Multiresolution decoder-only models are deep learning architectures that integrate multiple resolution inputs using dedicated embedding, fusion, and autoregressive modules.
They combine coarse- and fine-grained contexts to improve predictions in tasks such as time series forecasting and 3D shape reconstruction.
Empirical studies show reduced forecasting errors and state-of-the-art accuracy, demonstrating their practical impact in capturing long-term dependencies.

A multiresolution decoder-only model is an architectural paradigm in deep learning that enhances sequence modeling and generative tasks by incorporating representations at multiple resolutions within a decoder-only backbone. Recent work in time series forecasting and 3D shape representation has demonstrated utility in fusing coarse- and fine-grained context, yielding improvements in zero-shot prediction, sequence completion, and progressive refinement. In these models, the decoder is designed to process inputs from distinct resolutions, often with specialized architectural modules for embedding, fusion, and autoregressive decoding.

1. Architectural Principles of Multiresolution Decoder-Only Models

Multiresolution decoder-only models extend standard decoder-only designs by enabling simultaneous processing of multiple resolutions of input context. In time series forecasting, the Cisco Time Series Model (TSM) exemplifies this approach by accepting two fixed-length contexts: a coarse context $x_c \in \mathbb{R}^{512}$ (e.g., hourly aggregates) and a fine context $x_f \in \mathbb{R}^{512}$ (e.g., minute-level data) (Gou et al., 25 Nov 2025). Both contexts are independently normalized and partitioned into non-overlapping patches. Each patch $u_i \in \mathbb{R}^{32}$ is embedded using a residual structure:

$g_{\mathrm{in}}(u) = W_o\,\mathrm{SiLU}(W_hu) + W_r u,$

producing high-dimensional representations $h_i$ . A learnable resolution embedding $\mathrm{RE}(z_i) \in \mathbb{R}^{1280}$ is added to each token embedding to indicate its provenance (coarse or fine). A special token $\mathrm{ST} \in \mathbb{R}^{1280}$ is inserted between coarse and fine tokens, resulting in a sequence fed to a deep transformer-based decoder, typically with dozens of layers.

In 3D shape modeling, the Multiresolution Deep Implicit Functions (MDIF) framework hierarchically organizes latent grids $\{Z_0, ..., Z_{N-1}\}$ to support progressive decoder-only optimization. Each grid $Z_{\ell} \in \mathbb{R}^{G_{\ell} \times G_{\ell} \times G_{\ell} \times C_{\ell}}$ is associated with a decoder $D_{\ell}$ , enabling coarse-to-fine reconstruction (Chen et al., 2021).

2. Multiresolution Data Encoding and Fusion

Encoding in multiresolution models involves representing sequential or spatial data at distinct granularities and merging these representations for downstream decoding. In the Cisco TSM, normalization is performed separately on $x_c$ and $x_f$ , followed by patching each into sequences of length-32 vectors. Each is tokenized and augmented with a corresponding resolution embedding, and these sequences are concatenated with an intervening special token:

$H^{(0)} = [\hat{h}_1^c, ..., \hat{h}_{16}^c, \mathrm{ST}, \hat{h}_1^f, ..., \hat{h}_{16}^f]$

where $\hat{h}_i^c = h_i^c + r_i^c$ , $\hat{h}_i^f = h_i^f + r_i^f$ , and $r_i^\ast = \mathrm{RE}(z_i)$ .

For MDIF, trilinear interpolation allows latent code access at arbitrary spatial coordinates across hierarchically organized grids. Residual connections via upscaled global grids $\hat{Z}_\ell$ are concatenated at finer levels, allowing fusion of global and local structure:

$\hat{\phi}(x) = D_0(z_0, x) + \sum_{\ell = 1}^{N-1} D_\ell([z_\ell(x); \hat{z}_\ell(x)])$

3. Autoregressive Decoding and Progressive Inference

Multiresolution decoder-only architectures perform autoregressive decoding by generating outputs at the finest desired resolution, while maintaining or updating coarse context to inform long-range dependencies. In time series, after each forecasting step of $L=128$ fine-resolution tokens, new coarse patches are computed by aggregation:

$\tilde{y}_k = \frac{1}{60} \sum_{j=(k-1)\cdot 60 + 1}^{k \cdot 60} \hat{y}_j,$

these are appended to the coarse context sequence for subsequent autoregressive rounds (Gou et al., 25 Nov 2025).

For 3D shapes, progressive inference with MDIF proceeds by first decoding the coarsest representation and recursively evaluating finer residuals up to a chosen detail level $\ell_{\max}$ :

$\hat{\phi}_{\ell_{\max}}(x) = s_0(x) + \sum_{\ell = 1}^{\ell_{\max}} R_\ell(x)$

This enables computational efficiency by restricting high-resolution decoding to regions of interest.

4. Training Regimen, Regularization, and Loss Objectives

Multiresolution decoder-only models are typically trained with objectives designed to match predicted outputs to ground truth at multiple quantiles or resolutions. The Cisco TSM uses a zero-shot training protocol with composite loss:

$L_{\text{point}} = \| \hat{y}_{0.5} - y \|_2^2,\quad L_{\text{quant}} = \sum_{q \in \{ 0.1...0.9 \} } \sum_{j=1}^{128} \ell_q(\hat{y}_{q,j}, y_j),\quad L = L_{\text{point}} + L_{\text{quant}}$

where $\ell_q$ is the quantile loss, and regularization includes weight decay, gradient clipping, loss value clipping, and cosine-annealed learning rates for AdamW and the Muon optimizer (Gou et al., 25 Nov 2025).

MDIF employs latent-grid dropout during encoder–decoder training (drop rate $p=0.5$ , not applied to global connections), promoting resilience to missing features. Decoder-only latent optimization minimizes fit error and regularization on visible and occluded regions, using Adam with fixed hyperparameters (Chen et al., 2021).

5. Empirical Performance and Case Studies

Multiresolution decoder-only architectures have demonstrated empirical advantages over single-resolution or extended context decoder-only models:

Model	Domain	MASE (lower is better)
Cisco TSM (512,512)	Observability	0.4569
TimesFM-2.5 (512)	Observability	0.7290
TimesFM-2.5 (1024)	Observability	0.6159
Cisco TSM (512,512)	GIFT-Eval	0.7365
TimesFM-2.5 (1024)	GIFT-Eval	0.6828
TimesFM-2.0 (1024)	GIFT-Eval	0.7620

The Cisco TSM achieves substantial error reduction on observability data and retains competitive results on public forecasting benchmarks compared to extended-context single-resolution architectures (Gou et al., 25 Nov 2025). Qualitative studies reveal patterns where coarse resolution inputs allow the model to capture long-term periodicities and mitigate high-frequency noise. Increasing the coarse context (e.g., from 512 to 1024 hours) systematically improves forecast accuracy and enables discovery of cyclic structure not evident at finer resolutions.

MDIF obtains state-of-the-art accuracy in 3D auto-encoding, point-cloud completion, and voxel super-resolution, with the capacity for true progressive refinement by selection of detail level at inference. Completion accuracy is measured by Chamfer L2 distance and F-Score (Chen et al., 2021). Latent-grid dropout further enhances decoder-only optimization under partial data conditions.

6. Applications and Implications

In time series analysis, multiresolution decoder-only models facilitate zero-shot long-context forecasting, especially in observability and monitoring domains where long-term patterns (e.g., weekly cycles) are influential. The ability to incorporate both high-frequency and low-frequency information allows for more robust extrapolation without downstream fine-tuning.

For 3D shape reconstruction, the hierarchical latent grid approach in MDIF enables both encoder–decoder and decoder-only optimization, supporting applications in shape completion from partial observations, super-resolution, and progressive mesh refinement.

A plausible implication is that multiresolution fusion via decoder-only modules provides a general mechanism for leveraging heterogeneous contextual information in autoregressive generation tasks. The explicit architectural separation of resolutions, combined with trainable fusion schemes, offers a scalable pathway for supporting long-term dependencies in sequence and spatial modeling.

The multiresolution paradigm aligns with ongoing efforts to improve scalability and compositionality in deep generative models. Hierarchical representations, progressive decoding, and specialized regularization (e.g., latent dropout, quantile losses) are converging themes. Models such as TimesFM [Das et al. 2024] and IM-Net-style MLP architectures represent foundational designs extended by multiresolution capabilities. Future research may investigate joint modeling of additional resolutions, adaptive context selection, and broader transferability across domains, particularly in cross-modal or multi-task autoregressive frameworks (Gou et al., 25 Nov 2025, Chen et al., 2021).

PDF Markdown Chat (Pro)

References (2)

Cisco Time Series Model Technical Report (2025)

Multiresolution Deep Implicit Functions for 3D Shape Representation (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multiresolution Decoder-Only Model.

Multiresolution Decoder-Only Model

1. Architectural Principles of Multiresolution Decoder-Only Models

2. Multiresolution Data Encoding and Fusion

3. Autoregressive Decoding and Progressive Inference

4. Training Regimen, Regularization, and Loss Objectives

5. Empirical Performance and Case Studies

6. Applications and Implications

7. Related Research Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics