Papers
Topics
Authors
Recent
2000 character limit reached

Temporal Integrated Gradients (TIG)

Updated 9 January 2026
  • Temporal Integrated Gradients (TIG) are a principled attribution method that generalizes Integrated Gradients to time-series data by quantifying feature contributions at each time step.
  • They incorporate temporality-aware integration paths and segment-wise masking to mitigate out-of-distribution artifacts and preserve local temporal dependencies.
  • Empirical evaluations using metrics like CPD and CPP demonstrate TIG’s enhanced fidelity and robustness over traditional explainability methods for sequential models.

Temporal Integrated Gradients (TIG) is an axiomatic attribution method for interpreting neural sequence models, generalizing the well-established Integrated Gradients (IG) technique to time-series inputs. TIG quantifies the contribution of each input feature at every time step by accumulating gradients along a path from a user-chosen baseline to the observed temporal instance. Recent advancements address the unique challenges of temporal data—including out-of-distribution (OOD) interpolation artifacts, loss of context-sensitive temporal dependencies, and inadequate evaluation metrics—by introducing temporality-aware integration paths, segment-wise masking, and domain-specific aggregation. State-of-the-art frameworks such as TIMING (Jang et al., 5 Jun 2025) and IGBO (Fouladi et al., 2 Jan 2026) preserve key IG axioms while offering enhanced fidelity and robustness for sequential model explainability.

1. Mathematical Formulation and Core Principles

Let xRT×Dx \in \mathbb{R}^{T \times D} be a multivariate time series, with TT time steps and DD features. A baseline xRT×Dx' \in \mathbb{R}^{T \times D} anchors the input path. Define a sequence model F:RT×DRmF: \mathbb{R}^{T \times D} \to \mathbb{R}^m. Standard TIG follows the straight-line path γ(α)=x+α(xx)\gamma(\alpha) = x' + \alpha(x - x') (α[0,1]\alpha \in [0, 1]). The TIG attribution for location (t,d)(t, d) is:

TIGt,d(x)=(xt,dxt,d)01F(γ(α))xt,ddα\mathrm{TIG}_{t,d}(x) = (x_{t,d} - x'_{t,d}) \int_0^1 \frac{\partial F(\gamma(\alpha))}{\partial x_{t,d}} d\alpha

This line-integral formulation computes the accumulation of local gradients with respect to each feature at every time point as the input transitions from baseline to its observed value. TIG generalizes IG from static inputs to time-indexed, high-dimensional input spaces.

Generalizations to arbitrary domains T()T(\cdot) (e.g., frequency, ICA) are possible:

TIGiDT(x)=(zizi)01g(γ(α))zidα\mathrm{TIG}_i^{\mathcal{D}_T}(x) = (z_i - z'_i) \int_0^1 \frac{\partial g(\gamma(\alpha))}{\partial z_i} d\alpha

where z=T(x)z = T(x) and g(z)=F(T1(z))g(z) = F(T^{-1}(z)) (Kechris et al., 19 May 2025).

2. Temporality-Aware Path and Masking

Conventional IG's linear interpolation can traverse OOD regions never encountered during model training, especially in temporally dependent, nonlinear time series. The TIMING framework remedies this by introducing temporality-respecting masks:

  • Let M{0,1}T×DM \in \{0,1\}^{T \times D} denote a binary mask selecting nn contiguous segments of arbitrary length s[smin,smax]s \in [s_{\min}, s_{\max}].
  • The masked baseline is x~(M)=(1M)x\tilde{x}(M) = (1 - M) \odot x.
  • The path is z(α;M)=α(Mx)+(1M)xz(\alpha; M) = \alpha(M \odot x) + (1 - M) \odot x.
  • MaskingIG for (t,d)(t, d) is:

MaskingIGt,d(x,M)=xt,dMt,d01F(z(α;M))xt,ddα\mathrm{MaskingIG}_{t,d}(x, M) = x_{t,d} M_{t,d} \int_0^1 \frac{\partial F(z(\alpha; M))}{\partial x_{t,d}} d\alpha

  • TIMING computes the conditional expectation over masks:

TIMINGt,d(x;n,smin,smax)=EMG(n,smin,smax)[MaskingIGt,d(x,M)Mt,d=1]\mathrm{TIMING}_{t,d}(x; n, s_{\min}, s_{\max}) = \mathbb{E}_{M \sim G(n, s_{\min}, s_{\max})} [\mathrm{MaskingIG}_{t,d}(x, M) | M_{t,d} = 1]

This approach enforces in-distribution sampling and preserves temporally local relationships, outperforming pointwise masking (as in RandIG).

3. Axiomatic Guarantees and Limitations

TIG preserves IG’s canonical axioms given certain constraints:

  • Sensitivity: If only (t,d)(t, d) differs between xx and xx', and F(x)F(x)F(x) \neq F(x'), then TIMINGt,d(x)0\mathrm{TIMING}_{t,d}(x) \neq 0 (Mt,d=1M_{t,d}=1 mask reduces to IG at that coordinate).
  • Implementation Invariance: Attributions depend only on the model's functional gradient along the chosen path; thus, functionally equivalent models yield identical TIG values.
  • Completeness: t,dTIGt,d(x)=F(x)F(x)\sum_{t,d} \mathrm{TIG}_{t,d}(x) = F(x) - F(x') holds for the straight-line path but not for aggregation over multiple masking contexts (a conscious trade-off in TIMING).

Nonlinear, adaptive, or data-manifold-aware paths have been proposed to further remedy OOD artifacts (Fouladi et al., 2 Jan 2026). The IGBO framework uses a learnable Oracle GG producing anchor sequences {pi}\{p_i\}, constructing piecewise-linear, validity-constrained paths to maintain gradient stability. Completeness and sensitivity persist in these generalizations, while practical implementations lose strict completeness for richer baseline diversity.

4. Evaluation Metrics: CPD and CPP

Traditional simultaneous removal benchmarks are confounded by sign cancellation, where positive and negative attributions mask each other's global impact. TIMING introduces cumulative metrics to disambiguate feature ordering:

  • Cumulative Prediction Difference (CPD):

CPD(x)=k=0K1F(xk)F(xk+1)1\mathrm{CPD}(x) = \sum_{k=0}^{K-1} \| F(x^{\uparrow}_k) - F(x^{\uparrow}_{k+1}) \|_1

where xkx^{\uparrow}_k denotes the input after masking the kk most highly attributed points. CPD accumulates absolute prediction change for highly ranked locations.

  • Cumulative Prediction Preservation (CPP):

CPP(x)=k=0K1F(xk)F(xk+1)1\mathrm{CPP}(x) = \sum_{k=0}^{K-1} \| F(x^{\downarrow}_k) - F(x^{\downarrow}_{k+1}) \|_1

masking least important points. Small CPP indicates robust retention of prediction accuracy under removal of low-ranked features.

These metrics systematically reward attribution methods that correctly rank both positively and negatively impactful locations and discriminate methods with sign or mask biases (Jang et al., 5 Jun 2025).

5. Implementation Details and Practical Optimization

Examples of TIG implementation, as abstracted from TIMING and IGBO:

1
2
3
4
5
for alpha in linspace(0, 1, N):
    x_alpha = baseline + alpha * (input - baseline)
    gradient = grad(model(x_alpha), x_alpha)
    accumulate gradients
TIG = (input - baseline) * mean_gradients

TIMING and IGBO further optimize by:

  • Sampling masks or anchor paths in Monte Carlo fashion.
  • Batching segment masks.
  • Parallelizing integration over α\alpha.
  • Employing GAN-discriminators and RNN oracles to ensure generated anchors remain on the data manifold (Fouladi et al., 2 Jan 2026).

The practical guidance is to select KK anchor points for path fidelity, MM integration steps for numerical resolution, and mask parameters (n,smin,smax)(n, s_{\min}, s_{\max}) for temporal segment granularity. TIMING computes attributions in approximately $0.04$ seconds per sample, comparable to vanilla IG and faster than retraining-based XAI methods (Jang et al., 5 Jun 2025).

6. Empirical Results and Domain Applications

Comprehensive benchmarking compares TIG and TIMING against FO, AFO, IG, GradSHAP, DeepLIFT, LIME, FIT, WinIT, Dynamask, Extrmask, ContraLSP, TimeX, TimeX++ on real-world datasets (e.g., MIMIC-III mortality, PAM, Boiler, Epilepsy, Wafer, Freezer):

Dataset CPD (IG) CPD (TIMING) Relative Gain
MIMIC-III 0.342 0.366 +7%
PAM --- --- +5%
Boiler --- --- +110%
Epilepsy --- --- +11%
Wafer --- --- +35%
Freezer --- --- +1%

TIMING outperforms IG and leading baselines under CPD, with robust CPP behavior. On synthetic Switch-Feature and State datasets, TIMING matches IG on true-map metrics (AUP, AUR), but achieves superior CPD (Jang et al., 5 Jun 2025). Qualitative case studies (MIMIC-III) demonstrate medically coherent signed attributions for risk and protective factors.

IGBO evaluates DAG Satisfaction Rate, accuracy trade-offs, OOD consistency, and variance reduction. TIG plus Oracle achieves >>80% DSR, minimal <<5% accuracy loss, and significant variance reduction compared to linear baselines (Fouladi et al., 2 Jan 2026).

7. Limitations, Open Issues, and Prospective Directions

TIG suffers from several notable limitations:

  • Completeness may be lost under aggregation of masks or nonlinear baselines.
  • OOD sensitivity in high-dimensional time series can destabilize gradient estimates; manifold-aware path construction mitigates but does not eliminate this concern.
  • CPD and CPP metrics use L1L_1 distance on class probabilities; extension to structured outputs or alternative distance measures remains unexplored.
  • Mask parameter tuning and baseline selection are nontrivial and domain-dependent.
  • Adaptive, data-driven segment generators, higher-dimensional baselines, and hybrid completeness-restoring strategies are suggested avenues for further investigation (Jang et al., 5 Jun 2025, Fouladi et al., 2 Jan 2026).

Cross-domain TIG (e.g., Fourier or ICA domains) generalizes interpretability across semantically meaningful transforms, requiring invertible, differentiable TT and inheriting all classical IG caveats such as gradient saturation and integration discretization error (Kechris et al., 19 May 2025).

In summary, TIG and its temporality-aware extensions constitute a principled, differentiable attribution framework for time-series neural networks, with demonstrated empirical, theoretical, and computational advantages over prior post-hoc explainability methods.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Temporal Integrated Gradients (TIG).