TimePerceiver: Temporal Perception Framework
- TimePerceiver is a framework that unifies temporal forecasting, sequence modeling, and biologically-inspired time perception through flexible input-target segmentation.
- It employs advanced encoder-decoder architectures with latent bottleneck tokens and query-based decoding to efficiently compress and process complex temporal data.
- Empirical evaluations across time-series and video tasks show improved prediction accuracy and significant token compression, reinforcing its practical and scalable design.
The TimePerceiver framework encompasses a family of models and modules for temporal perception and sequence modeling, with technical instantiations in generalized time-series forecasting (Lee et al., 27 Dec 2025), biologically-inspired time perception in agents (Lourenço et al., 2023), and compact temporal-spatial encoding in vision-LLMs (Kong et al., 21 May 2025). These models are unified by their capacity to encode, compress, and leverage temporal dependencies, often under demanding constraints such as arbitrary input-target segmentation, low token budgets, or biologically realistic inference.
1. Unified Formalization of Temporal Perception and Forecasting
TimePerceiver fundamentally generalizes the paradigm of time-series prediction by allowing arbitrary segmentation of input and target positions along the temporal axis. The formal task is, given a multivariate time series , to select arbitrary index sets (disjoint, summing to the full interval), and learn such that
minimizes the normalized mean squared error
This generalization extends classic forecasting (extrapolation), interpolation, and imputation as special cases, enabling the model to handle complex temporal prediction objectives and arbitrary positioning of missing data.
2. Encoder and Compression Architectures
2.1. Time-Series Encoder-Decoder (Generalized Forecasting)
The encoder processes input patches via a hierarchical scheme:
- Patch Tokenization: Time series are split into disjoint patches of length .
- Positional Embedding: Temporal () and channel () embeddings of dimension are added to each patch.
- Latent Bottleneck Mechanism: Key innovation is the introduction of () learnable latent tokens , which interact with patch embeddings via cross-attention: followed by layers of latent self-attention: and re-expansion by cross-attention back to input tokens.
2.2. Temporal-Spatial Compression in Vision Models
In video domain applications, as in Clapper VLM (Kong et al., 21 May 2025), TimePerceiver operates in a slow-fast scheme combined with substantial token compression:
- Slow Path: Key-frame pooling extracts high-resolution spatial tokens (e.g., $196$ tokens from $784$ per frame).
- Fast Path (TimePerceiver Module): Ingests all $4$ frames per segment, compresses temporal dynamics via cross-attention into tokens per $4$-frame segment utilizing
where are learned projections of pooled features.
- Compression Efficiency: Achieves reduction (from $784$ to $61$ tokens/frame).
3. Decoder Design and Query-Based Retrieval
The decoder employs learnable queries corresponding to target temporal positions and channels:
- For each target patch and channel, query embeddings are constructed, stacked as .
- Decoding proceeds via cross-attention against encoded representations, and projects queries back to patch space: followed by a learned projection producing forecast patches.
This query-based mechanism enables flexible and efficient retrieval for arbitrary target sets, with constant parameterization w.r.t. prediction horizon.
4. Training, Complexity, and Scalability
TimePerceiver training regime features single-stage, MSE-driven optimization over diversified temporal objectives. Training employs random sampling over input and target segments, handling contiguous, disjoint, and mixed patterns robustly; there is no distinct pre-training phase, and both encoder and decoder are jointly optimized. Instance normalization (RevIN) is used to address distributional shifts typical in multivariate time series.
Complexity reduction is achieved primarily through latent bottleneck architectures and cross-attention:
- Encoder: Reduces quadratic cost of full self-attention to .
- Video VLMs: In Clapper, this translates to a reduction by over in FLOPs compared to naive full-token attention.
5. Biologically-Inspired Time Perception
The TimePerceiver framework in (Lourenço et al., 2023) formalizes dual-source time perception via:
- External Timing (ET): Bayesian inference ( via Gaussian process likelihood) from sensory streams (e.g., LIDAR vector observations), modeling perception of external clock rate.
- Internal Timing (IT): Temporal-difference learning with exponentially decaying microstimuli, emulating dopaminergic reward-prediction error. The agent’s decision policies leverage eligibility traces and value functions
with
Integration of ET and IT enables cross-modal, biologically plausible timing behavior, matching animal psychometric curves and reward-prediction error dynamics.
6. Empirical Evaluation and Performance
6.1. Time-Series Forecasting Benchmarks (Lee et al., 27 Dec 2025)
Across datasets ETTh1/2, ETTm1/2, Weather, Solar, Electricity, and Traffic, TimePerceiver achieves:
- Best MSE in $55/80$ settings; second-best $17/80$.
- Average rank $1.375$ (MSE), $1.550$ (MAE).
- lower MSE vs. iTransformer; vs. CARD.
Ablation studies show generalized objectives (vs. standard forecasting) improve MSE by and MAE by .
6.2. Video Understanding (Clapper) (Kong et al., 21 May 2025)
TimePerceiver delivers significant compression (from $784$ to tokens/frame), maintaining or improving QA accuracy:
- TempCompass: vs. baseline.
- MVBench: vs. .
- VideoMME: vs. .
- MLVU: , using only $6$k visual tokens/video.
Ablations (compression variants) confirm optimal trade-offs at token reduction via TimePerceiver.
6.3. Biological Timing Validity (Lourenço et al., 2023)
- TD-error traces reproduce animal reward-prediction-error dynamics.
- Psychometric (“Long” choice fraction) curves match mice: sigmoidal, centered at .
- Weber’s law (scalar timing) emergent from external module.
- Maximum-likelihood estimator recovers intrinsic parameters (e.g., microstimuli count) reliably from empirical behavior.
7. Relation to Prior and Adjacent Work
TimePerceiver draws upon and extends architectures such as Perceiver IO, Crossformer, CARD, and iTransformer. Key advances include latent bottlenecking for attention cost reduction, unified query-driven decoding aligned with flexible temporal objectives, and integration of both external and internal timing for biologically realistic perception. In VLMs, TimePerceiver advances compact spatio-temporal tokenization and compression strategies for large-scale video understanding under strict token/FLOP constraints.
8. Tabular Summary of Technical Variants
| Domain | Encoder Bottleneck | Decoder Type | Compression |
|---|---|---|---|
| Time-series | Latent tokens () | Query cross-attn | |
| Video VLM | Cross-attn module | MLP + LLM | tokens |
| Biological timing | Microstimuli bank | TD value function | N/A |
The framework accommodates both scientific investigation of time perception and practical deployment in high-throughput sequence models. Its modularity enables integration across domains, from neuroscience-inspired agents to deep learning for forecasting and video understanding.