Temporal Experts Averaging (TEA)

Updated 10 February 2026

Temporal Experts Averaging (TEA) is a learning framework that constructs temporally specialized experts using SI regularization to adapt to unseen future domains.
It employs PCA and ARIMA to forecast future weight configurations, enabling an adaptive averaging of expert parameters in the model’s weight space.
TEA demonstrates enhanced out-of-distribution accuracy with minimal computational overhead, scaling efficiently to large models and diverse datasets.

Temporal Experts Averaging (TEA) is a learning framework designed to address Temporal Domain Generalization (TDG)—the challenge of generalizing to unseen future domains under temporal distribution shift. Rather than attempting to explicitly predict the full set of model parameters for future domains, a task that is computationally infeasible for large-scale models, TEA leverages a strategy of regularized expert construction and adaptive ensembling in the weight space, enabling full-model adaptation within practical computational budgets (Liu et al., 30 Sep 2025).

1. Problem Setting and Objectives

The TDG problem considers a sequence of source domains $\{D_1, \dots, D_S\}$ indexed by timestamps $t_1 < \cdots < t_S$ , with each domain $D_i$ drawn from a (possibly drifting) data distribution $p_i$ . The goal is to learn a single parameter vector $\theta_{\rm TEA}$ that realizes a predictor $f(\cdot; \theta_{\rm TEA})$ exhibiting robust generalization to a future domain at time $t_f > t_S$ —without access to that domain’s data or any retraining on $D_f$ . TEA’s methodological focus is to maximize future generalization by strategically constructing and combining “experts” representing temporally localized optima, and to do so for the entire model, not just the classifier head.

2. Algorithmic Framework

2.1 Base Model Formation and Expert Construction

The process begins with empirical risk minimization (ERM) to learn a base model $\theta_{\rm base}$ :

$\theta_{\rm base} = \arg\min_{\theta} \sum_{i=1}^S \mathbb{E}_{(x, y) \sim D_i}[\ell(f(x;\theta), y)]$

From this domain-agnostic base, $S$ “temporal experts” $\{\theta_i\}_{i=1}^S$ are sequentially derived via fine-tuning on each $D_i$ in reverse chronological order ( $i = S \to 1$ ), each time initializing from $\theta_{i+1}$ (with $\theta_{S+1} \equiv \theta_{\rm base}$ ). Importantly, Synaptic Intelligence (SI) regularization is applied to preserve parameter locality, modifying the fine-tuning objective:

$\mathcal{L}_{\rm SI}(\theta) = \mathcal{L}_{\rm ERM}(\theta) + c_{\rm si}\sum_{j} \omega_j (\theta_j - \theta_{i+1, j})^2$

where $\omega_j$ denotes SI-assigned importance for parameter $j$ , and $c_{\rm si}$ controls regularization strength. $K$ model checkpoints are sampled during this process and averaged to yield each $\theta_i$ , mitigating stochastic effects.

2.2 Principal Components Analysis of Weight Trajectories

For adaptive combination, TEA models the “temporal trajectory” of expert weights. Each expert’s deviation is defined as $\delta\theta_i = \theta_i - \theta_{\rm base}$ , and these are aggregated into a matrix $\Delta \in \mathbb{R}^{d \times S}$ . Principal component analysis (PCA) is performed on $\Delta\Delta^\top$ , yielding a set of principal vectors $\{v_p\}_{p=1}^P$ . Each expert is projected into this $P$ -dimensional subspace by $c_i^p = \langle \delta\theta_i, v_p \rangle$ , forming a low-dimensional trajectory representation $\mathbf{c}_i$ for each expert.

2.3 Adaptive Averaging via Temporal Forecasting

For each principal component, the sequence $\{(t_i, c_i^p)\}_{i=1}^S$ is treated as a univariate time series, and an ARIMA(1,1,1) model is used to forecast the future projection $\hat{c}^p(t_f)$ . The forecasted vector $\mathbf{c}_f = (\hat{c}^1(t_f), ..., \hat{c}^P(t_f))$ represents the anticipated configuration at time $t_f$ . Distances $d_i = \|\mathbf{c}_i - \mathbf{c}_f\|_2$ are computed, and adaptive coefficients are defined by

$\alpha_i = \frac{(d_{\max} - d_i)^r}{\sum_{j=1}^S (d_{\max} - d_j)^r}$

where $d_{\max} = \max_i d_i$ and $r > 0$ is a sharpness parameter that interpolates between uniform and winner-takes-all weighting.

2.4 Final Model Synthesis

The final TEA model is synthesized via weight-space averaging:

$\theta_{\rm TEA} = \sum_{i=1}^S \alpha_i \theta_i$

At inference time, no further adaptation is performed; $f(\cdot; \theta_{\rm TEA})$ is evaluated directly.

3. Theoretical Foundations

TEA’s advantage is supported by a theoretical analysis decomposing ensembling error under smooth drift. A first-order Taylor expansion demonstrates that weight-space ensembling approximates function-space ensembling up to $O(\Delta^2)$ , where $\Delta = \max_i \|\theta_i - \theta_{\rm TEA}\|$ .

The error is further characterized via the bias–variance–covariance–locality (BVCL) decomposition:

$\mathbb{E}_f\left[\left(\sum_i \alpha_i \mathrm{bias}_i\right)^2\right] + \sum_i \alpha_i^2 \mathrm{var}_i + \sum_{i \neq j} \alpha_i \alpha_j \mathrm{cov}_{i,j} + O(\overline{\Delta}^2)$

Optimal generalization is achieved by balancing: (1) functional diversity among experts to minimize covariance, effected by domain specialization; (2) parameter proximity to ensure ensembling validity, achieved via SI regularization. The adaptively estimated $\alpha_i$ coefficients interpolate between variance minimization (which prefers uniform averaging) and bias minimization (which prefers the expert nearest to the predicted future configuration). Key analytic results formalize these trade-offs, including Lemma 2 (variance minimization by uniform weights) and Lemma 3 (worst-case bias minimization by assigning all weight to the best-forecasted expert), with the full BVCL decomposition and proofs provided in the appendix of the reference (Liu et al., 30 Sep 2025).

4. Empirical Performance and Benchmarks

TEA is systematically evaluated on seven TDG benchmarks spanning vision and text, various model architectures (ConvNet, CNN, DenseNet-121, ResNet-18/50, DistilBERT), and dataset scales (from $\sim$ !30K to 2M samples; models from 29K to 66M parameters). Benchmarks include Rotated MNIST, Yearbook, FMoW, CLEAR-10/100, HuffPost, and Arxiv paper titles.

Across all settings, TEA demonstrates consistent superiority over previous TDG and classifier-head–only baselines (GI ’21, LSSAE ’22, DRAIN ’23, EvoS ’24, W-Diff ’24). The mean out-of-distribution (OOD)-averaged accuracy rises to 75.76% (from 69.74% for DiWA and 71.87% for Mixup), with maximum observed relative gains up to 69% on individual tasks. TEA sustains this performance with only $\sim$ 1.33 $\times$ the GPU hours of ERM, notably outperforming GI/LSSAE (7–12 $\times$ slower) and W-Diff (81 $\times$ slower).

Critical ablation studies confirm that both the construction of temporally specialized, SI-regularized experts and the data-driven adaptive averaging mechanism are necessary. Ablations reveal that uniform averaging or adaptive averaging on base snapshots yields lower OOD-avg (74.6–74.7%) compared to full TEA (75.8%). Partial fine-tuning (using only selected domains) and limited memory scenarios (continual domain generalization with 10% buffer) preserve most of TEA’s effectiveness, and TEA remains competitive even under abrupt domain shifts.

5. Scalability, Resource Considerations, and Limitations

TEA is designed to minimize the computational cost relative to methods that require full-model weight prediction for future domains. The end-to-end process—comprising initial ERM plus $S$ SI-regularized fine-tuning passes—incurs an overall wall-clock cost of approximately 1.33 $\times$ ERM alone. PCA and ARIMA components are negligible for $S \lesssim 50$ domains. The method is validated for models up to DistilBERT (66M parameters) and datasets up to millions of points.

The efficiency gains are coupled to the assumption of smooth temporal drift and the availability of clearly separated, discrete domains. When domains are shuffled to simulate abrupt shift, TEA's advantage narrows, though it remains on par with strong DG baselines. A plausible implication is that extensions would be required to address regimes characterized by sudden or adversarial domain transitions.

6. Extensions, Open Directions, and Implications

Several extensions are noted as promising avenues for further research. These include: (i) adapting TEA to continuous TDG regimes by employing continuous-time forecasting instead of ARIMA on discrete timestamps; (ii) employing alternative forecasting methods such as Gaussian Processes or vector autoregressive models in the PCA subspace; (iii) meta-learning the averaging coefficients $\alpha$ via a small neural network trained on validation data; and (iv) scaling TEA to LLM-scale parameters in order to address temporal adaptation for billion-parameter LLMs.

The analysis and empirical results collectively indicate that by structuring parameter-space diversity with locality-preserving SI and by adaptively forecasting parameter trajectories, TEA is able to generalize more effectively and efficiently to future domains under temporal distribution shift than previously proposed TDG approaches (Liu et al., 30 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Scaling Up Temporal Domain Generalization via Temporal Experts Averaging (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Experts Averaging (TEA).

Temporal Experts Averaging (TEA)

1. Problem Setting and Objectives

2. Algorithmic Framework

2.1 Base Model Formation and Expert Construction

2.2 Principal Components Analysis of Weight Trajectories

2.3 Adaptive Averaging via Temporal Forecasting

2.4 Final Model Synthesis

3. Theoretical Foundations

4. Empirical Performance and Benchmarks

5. Scalability, Resource Considerations, and Limitations

6. Extensions, Open Directions, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Temporal Experts Averaging (TEA)

1. Problem Setting and Objectives

2. Algorithmic Framework

2.1 Base Model Formation and Expert Construction

2.2 Principal Components Analysis of Weight Trajectories

2.3 Adaptive Averaging via Temporal Forecasting

2.4 Final Model Synthesis

3. Theoretical Foundations

4. Empirical Performance and Benchmarks

5. Scalability, Resource Considerations, and Limitations

6. Extensions, Open Directions, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research