Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temporal Experts Averaging (TEA)

Updated 10 February 2026
  • Temporal Experts Averaging (TEA) is a learning framework that constructs temporally specialized experts using SI regularization to adapt to unseen future domains.
  • It employs PCA and ARIMA to forecast future weight configurations, enabling an adaptive averaging of expert parameters in the model’s weight space.
  • TEA demonstrates enhanced out-of-distribution accuracy with minimal computational overhead, scaling efficiently to large models and diverse datasets.

Temporal Experts Averaging (TEA) is a learning framework designed to address Temporal Domain Generalization (TDG)—the challenge of generalizing to unseen future domains under temporal distribution shift. Rather than attempting to explicitly predict the full set of model parameters for future domains, a task that is computationally infeasible for large-scale models, TEA leverages a strategy of regularized expert construction and adaptive ensembling in the weight space, enabling full-model adaptation within practical computational budgets (Liu et al., 30 Sep 2025).

1. Problem Setting and Objectives

The TDG problem considers a sequence of source domains {D1,,DS}\{D_1, \dots, D_S\} indexed by timestamps t1<<tSt_1 < \cdots < t_S, with each domain DiD_i drawn from a (possibly drifting) data distribution pip_i. The goal is to learn a single parameter vector θTEA\theta_{\rm TEA} that realizes a predictor f(;θTEA)f(\cdot; \theta_{\rm TEA}) exhibiting robust generalization to a future domain at time tf>tSt_f > t_S—without access to that domain’s data or any retraining on DfD_f. TEA’s methodological focus is to maximize future generalization by strategically constructing and combining “experts” representing temporally localized optima, and to do so for the entire model, not just the classifier head.

2. Algorithmic Framework

2.1 Base Model Formation and Expert Construction

The process begins with empirical risk minimization (ERM) to learn a base model θbase\theta_{\rm base}:

θbase=argminθi=1SE(x,y)Di[(f(x;θ),y)]\theta_{\rm base} = \arg\min_{\theta} \sum_{i=1}^S \mathbb{E}_{(x, y) \sim D_i}[\ell(f(x;\theta), y)]

From this domain-agnostic base, SS “temporal experts” {θi}i=1S\{\theta_i\}_{i=1}^S are sequentially derived via fine-tuning on each DiD_i in reverse chronological order (i=S1i = S \to 1), each time initializing from θi+1\theta_{i+1} (with θS+1θbase\theta_{S+1} \equiv \theta_{\rm base}). Importantly, Synaptic Intelligence (SI) regularization is applied to preserve parameter locality, modifying the fine-tuning objective:

LSI(θ)=LERM(θ)+csijωj(θjθi+1,j)2\mathcal{L}_{\rm SI}(\theta) = \mathcal{L}_{\rm ERM}(\theta) + c_{\rm si}\sum_{j} \omega_j (\theta_j - \theta_{i+1, j})^2

where ωj\omega_j denotes SI-assigned importance for parameter jj, and csic_{\rm si} controls regularization strength. KK model checkpoints are sampled during this process and averaged to yield each θi\theta_i, mitigating stochastic effects.

2.2 Principal Components Analysis of Weight Trajectories

For adaptive combination, TEA models the “temporal trajectory” of expert weights. Each expert’s deviation is defined as δθi=θiθbase\delta\theta_i = \theta_i - \theta_{\rm base}, and these are aggregated into a matrix ΔRd×S\Delta \in \mathbb{R}^{d \times S}. Principal component analysis (PCA) is performed on ΔΔ\Delta\Delta^\top, yielding a set of principal vectors {vp}p=1P\{v_p\}_{p=1}^P. Each expert is projected into this PP-dimensional subspace by cip=δθi,vpc_i^p = \langle \delta\theta_i, v_p \rangle, forming a low-dimensional trajectory representation ci\mathbf{c}_i for each expert.

2.3 Adaptive Averaging via Temporal Forecasting

For each principal component, the sequence {(ti,cip)}i=1S\{(t_i, c_i^p)\}_{i=1}^S is treated as a univariate time series, and an ARIMA(1,1,1) model is used to forecast the future projection c^p(tf)\hat{c}^p(t_f). The forecasted vector cf=(c^1(tf),...,c^P(tf))\mathbf{c}_f = (\hat{c}^1(t_f), ..., \hat{c}^P(t_f)) represents the anticipated configuration at time tft_f. Distances di=cicf2d_i = \|\mathbf{c}_i - \mathbf{c}_f\|_2 are computed, and adaptive coefficients are defined by

αi=(dmaxdi)rj=1S(dmaxdj)r\alpha_i = \frac{(d_{\max} - d_i)^r}{\sum_{j=1}^S (d_{\max} - d_j)^r}

where dmax=maxidid_{\max} = \max_i d_i and r>0r > 0 is a sharpness parameter that interpolates between uniform and winner-takes-all weighting.

2.4 Final Model Synthesis

The final TEA model is synthesized via weight-space averaging:

θTEA=i=1Sαiθi\theta_{\rm TEA} = \sum_{i=1}^S \alpha_i \theta_i

At inference time, no further adaptation is performed; f(;θTEA)f(\cdot; \theta_{\rm TEA}) is evaluated directly.

3. Theoretical Foundations

TEA’s advantage is supported by a theoretical analysis decomposing ensembling error under smooth drift. A first-order Taylor expansion demonstrates that weight-space ensembling approximates function-space ensembling up to O(Δ2)O(\Delta^2), where Δ=maxiθiθTEA\Delta = \max_i \|\theta_i - \theta_{\rm TEA}\|.

The error is further characterized via the bias–variance–covariance–locality (BVCL) decomposition:

Ef[(iαibiasi)2]+iαi2vari+ijαiαjcovi,j+O(Δ2)\mathbb{E}_f\left[\left(\sum_i \alpha_i \mathrm{bias}_i\right)^2\right] + \sum_i \alpha_i^2 \mathrm{var}_i + \sum_{i \neq j} \alpha_i \alpha_j \mathrm{cov}_{i,j} + O(\overline{\Delta}^2)

Optimal generalization is achieved by balancing: (1) functional diversity among experts to minimize covariance, effected by domain specialization; (2) parameter proximity to ensure ensembling validity, achieved via SI regularization. The adaptively estimated αi\alpha_i coefficients interpolate between variance minimization (which prefers uniform averaging) and bias minimization (which prefers the expert nearest to the predicted future configuration). Key analytic results formalize these trade-offs, including Lemma 2 (variance minimization by uniform weights) and Lemma 3 (worst-case bias minimization by assigning all weight to the best-forecasted expert), with the full BVCL decomposition and proofs provided in the appendix of the reference (Liu et al., 30 Sep 2025).

4. Empirical Performance and Benchmarks

TEA is systematically evaluated on seven TDG benchmarks spanning vision and text, various model architectures (ConvNet, CNN, DenseNet-121, ResNet-18/50, DistilBERT), and dataset scales (from \sim!30K to 2M samples; models from 29K to 66M parameters). Benchmarks include Rotated MNIST, Yearbook, FMoW, CLEAR-10/100, HuffPost, and Arxiv paper titles.

Across all settings, TEA demonstrates consistent superiority over previous TDG and classifier-head–only baselines (GI ’21, LSSAE ’22, DRAIN ’23, EvoS ’24, W-Diff ’24). The mean out-of-distribution (OOD)-averaged accuracy rises to 75.76% (from 69.74% for DiWA and 71.87% for Mixup), with maximum observed relative gains up to 69% on individual tasks. TEA sustains this performance with only \sim1.33×\times the GPU hours of ERM, notably outperforming GI/LSSAE (7–12×\times slower) and W-Diff (81×\times slower).

Critical ablation studies confirm that both the construction of temporally specialized, SI-regularized experts and the data-driven adaptive averaging mechanism are necessary. Ablations reveal that uniform averaging or adaptive averaging on base snapshots yields lower OOD-avg (74.6–74.7%) compared to full TEA (75.8%). Partial fine-tuning (using only selected domains) and limited memory scenarios (continual domain generalization with 10% buffer) preserve most of TEA’s effectiveness, and TEA remains competitive even under abrupt domain shifts.

5. Scalability, Resource Considerations, and Limitations

TEA is designed to minimize the computational cost relative to methods that require full-model weight prediction for future domains. The end-to-end process—comprising initial ERM plus SS SI-regularized fine-tuning passes—incurs an overall wall-clock cost of approximately 1.33×\times ERM alone. PCA and ARIMA components are negligible for S50S \lesssim 50 domains. The method is validated for models up to DistilBERT (66M parameters) and datasets up to millions of points.

The efficiency gains are coupled to the assumption of smooth temporal drift and the availability of clearly separated, discrete domains. When domains are shuffled to simulate abrupt shift, TEA's advantage narrows, though it remains on par with strong DG baselines. A plausible implication is that extensions would be required to address regimes characterized by sudden or adversarial domain transitions.

6. Extensions, Open Directions, and Implications

Several extensions are noted as promising avenues for further research. These include: (i) adapting TEA to continuous TDG regimes by employing continuous-time forecasting instead of ARIMA on discrete timestamps; (ii) employing alternative forecasting methods such as Gaussian Processes or vector autoregressive models in the PCA subspace; (iii) meta-learning the averaging coefficients α\alpha via a small neural network trained on validation data; and (iv) scaling TEA to LLM-scale parameters in order to address temporal adaptation for billion-parameter LLMs.

The analysis and empirical results collectively indicate that by structuring parameter-space diversity with locality-preserving SI and by adaptively forecasting parameter trajectories, TEA is able to generalize more effectively and efficiently to future domains under temporal distribution shift than previously proposed TDG approaches (Liu et al., 30 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temporal Experts Averaging (TEA).