Papers
Topics
Authors
Recent
Search
2000 character limit reached

TimePerceiver: Temporal Perception Framework

Updated 23 January 2026
  • TimePerceiver is a framework that unifies temporal forecasting, sequence modeling, and biologically-inspired time perception through flexible input-target segmentation.
  • It employs advanced encoder-decoder architectures with latent bottleneck tokens and query-based decoding to efficiently compress and process complex temporal data.
  • Empirical evaluations across time-series and video tasks show improved prediction accuracy and significant token compression, reinforcing its practical and scalable design.

The TimePerceiver framework encompasses a family of models and modules for temporal perception and sequence modeling, with technical instantiations in generalized time-series forecasting (Lee et al., 27 Dec 2025), biologically-inspired time perception in agents (Lourenço et al., 2023), and compact temporal-spatial encoding in vision-LLMs (Kong et al., 21 May 2025). These models are unified by their capacity to encode, compress, and leverage temporal dependencies, often under demanding constraints such as arbitrary input-target segmentation, low token budgets, or biologically realistic inference.

1. Unified Formalization of Temporal Perception and Forecasting

TimePerceiver fundamentally generalizes the paradigm of time-series prediction by allowing arbitrary segmentation of input and target positions along the temporal axis. The formal task is, given a multivariate time series X=[x1,,xT]RC×T\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_T] \in \mathbb{R}^{C \times T}, to select arbitrary index sets I,J{1,,T}\mathcal{I}, \mathcal{J} \subseteq \{1, \ldots, T\} (disjoint, summing to the full interval), and learn gθg_\theta such that

XJ^=gθ(XI,I,J)\widehat{\mathbf{X}_\mathcal{J}} = g_\theta(\mathbf{X}_\mathcal{I}, \mathcal{I}, \mathcal{J})

minimizes the normalized mean squared error

L(θ)=1JCjJxj^xj22.\mathcal{L}(\theta) = \frac{1}{|\mathcal{J}|C}\sum_{j\in \mathcal{J}}\|\widehat{\mathbf{x}_j} - \mathbf{x}_j\|^2_2.

This generalization extends classic forecasting (extrapolation), interpolation, and imputation as special cases, enabling the model to handle complex temporal prediction objectives and arbitrary positioning of missing data.

2. Encoder and Compression Architectures

2.1. Time-Series Encoder-Decoder (Generalized Forecasting)

The encoder processes input patches via a hierarchical scheme:

  • Patch Tokenization: Time series are split into N=T/PN = T/P disjoint patches of length PP.
  • Positional Embedding: Temporal (Etem\mathbf{E}^{\mathrm{tem}}) and channel (Echan\mathbf{E}^{\mathrm{chan}}) embeddings of dimension DD are added to each patch.
  • Latent Bottleneck Mechanism: Key innovation is the introduction of MM (MNM \ll N) learnable latent tokens Z(0)\mathbf{Z}^{(0)}, which interact with patch embeddings via cross-attention: Z(1)=AttnBlock(Z(0),H(0),H(0)),\mathbf{Z}^{(1)} = \mathrm{AttnBlock}(\mathbf{Z}^{(0)}, \mathbf{H}^{(0)}, \mathbf{H}^{(0)}), followed by KK layers of latent self-attention: Z(k+1)=AttnBlock(Z(k),Z(k),Z(k)),\mathbf{Z}^{(k+1)} = \mathrm{AttnBlock}(\mathbf{Z}^{(k)}, \mathbf{Z}^{(k)}, \mathbf{Z}^{(k)}), and re-expansion by cross-attention back to input tokens.

2.2. Temporal-Spatial Compression in Vision Models

In video domain applications, as in Clapper VLM (Kong et al., 21 May 2025), TimePerceiver operates in a slow-fast scheme combined with substantial token compression:

  • Slow Path: Key-frame pooling extracts high-resolution spatial tokens (e.g., $196$ tokens from $784$ per frame).
  • Fast Path (TimePerceiver Module): Ingests all $4$ frames per segment, compresses temporal dynamics via cross-attention into M=49M = 49 tokens per $4$-frame segment utilizing

Z=Softmax(QK/d)V,Z = \mathrm{Softmax}(QK^\top/\sqrt{d})V,

where Q,K,VQ, K, V are learned projections of pooled features.

  • Compression Efficiency: Achieves 13×13 \times reduction (from $784$ to $61$ tokens/frame).

3. Decoder Design and Query-Based Retrieval

The decoder employs learnable queries corresponding to target temporal positions and channels:

  • For each target patch and channel, query embeddings are constructed, stacked as Q(0)\mathbf{Q}^{(0)}.
  • Decoding proceeds via cross-attention against encoded representations, and projects queries back to patch space: Q(1)=AttnBlock(Q(0),H(1),H(1)),\mathbf{Q}^{(1)} = \mathrm{AttnBlock}(\mathbf{Q}^{(0)}, \mathbf{H}^{(1)}, \mathbf{H}^{(1)}), followed by a learned projection producing forecast patches.

This query-based mechanism enables flexible and efficient retrieval for arbitrary target sets, with constant parameterization w.r.t. prediction horizon.

4. Training, Complexity, and Scalability

TimePerceiver training regime features single-stage, MSE-driven optimization over diversified temporal objectives. Training employs random sampling over input and target segments, handling contiguous, disjoint, and mixed patterns robustly; there is no distinct pre-training phase, and both encoder and decoder are jointly optimized. Instance normalization (RevIN) is used to address distributional shifts typical in multivariate time series.

Complexity reduction is achieved primarily through latent bottleneck architectures and cross-attention:

  • Encoder: Reduces quadratic cost O(N2)\mathcal{O}(N^2) of full self-attention to O(NM+KM2)\mathcal{O}(NM + KM^2).
  • Video VLMs: In Clapper, this translates to a reduction by over 160×160 \times in FLOPs compared to naive full-token attention.

5. Biologically-Inspired Time Perception

The TimePerceiver framework in (Lourenço et al., 2023) formalizes dual-source time perception via:

  • External Timing (ET): Bayesian inference (τ^\hat\tau via Gaussian process likelihood) from sensory streams (e.g., LIDAR vector observations), modeling perception of external clock rate.
  • Internal Timing (IT): Temporal-difference learning with exponentially decaying microstimuli, emulating dopaminergic reward-prediction error. The agent’s decision policies leverage eligibility traces and value functions

Qt(s,a)=wtxt(s,a),Q_t(s,a) = w_t^\top x_t(s,a),

with

δt=rt+γmaxaQt(st+1,a)Qt(st,at).\delta_t = r_t + \gamma \max_a Q_t(s_{t+1}, a) - Q_t(s_t, a_t).

Integration of ET and IT enables cross-modal, biologically plausible timing behavior, matching animal psychometric curves and reward-prediction error dynamics.

6. Empirical Evaluation and Performance

Across datasets ETTh1/2, ETTm1/2, Weather, Solar, Electricity, and Traffic, TimePerceiver achieves:

  • Best MSE in $55/80$ settings; second-best $17/80$.
  • Average rank $1.375$ (MSE), $1.550$ (MAE).
  • 8.5%8.5\% lower MSE vs. iTransformer; 5.6%5.6\% vs. CARD.

Ablation studies show generalized objectives (vs. standard forecasting) improve MSE by 5.0%5.0\% and MAE by 3.4%3.4\%.

TimePerceiver delivers significant compression (from $784$ to 61\sim 61 tokens/frame), maintaining or improving QA accuracy:

  • TempCompass: 65.5%65.5\% vs. 63.1%63.1\% baseline.
  • MVBench: 57.2%57.2\% vs. 55.4%55.4\%.
  • VideoMME: 59.3%59.3\% vs. 59.1%59.1\%.
  • MLVU: 69.8%69.8\%, using only $6$k visual tokens/video.

Ablations (compression variants) confirm optimal trade-offs at 13×13\times token reduction via TimePerceiver.

  • TD-error traces reproduce animal reward-prediction-error dynamics.
  • Psychometric (“Long” choice fraction) curves match mice: sigmoidal, centered at τ=4\tau=4.
  • Weber’s law (scalar timing) emergent from external module.
  • Maximum-likelihood estimator recovers intrinsic parameters (e.g., microstimuli count) reliably from empirical behavior.

7. Relation to Prior and Adjacent Work

TimePerceiver draws upon and extends architectures such as Perceiver IO, Crossformer, CARD, and iTransformer. Key advances include latent bottlenecking for attention cost reduction, unified query-driven decoding aligned with flexible temporal objectives, and integration of both external and internal timing for biologically realistic perception. In VLMs, TimePerceiver advances compact spatio-temporal tokenization and compression strategies for large-scale video understanding under strict token/FLOP constraints.

8. Tabular Summary of Technical Variants

Domain Encoder Bottleneck Decoder Type Compression
Time-series Latent tokens (MM) Query cross-attn O(NM)\mathcal{O}(NM)
Video VLM Cross-attn module MLP + LLM 13×13\times tokens
Biological timing Microstimuli bank TD value function N/A

The framework accommodates both scientific investigation of time perception and practical deployment in high-throughput sequence models. Its modularity enables integration across domains, from neuroscience-inspired agents to deep learning for forecasting and video understanding.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimePerceiver Framework.