Shifted Window Integrated Gradients

Updated 5 December 2025

The paper introduces SWING, enhancing traditional Integrated Gradients by using retrospective windows to capture temporal causal dependencies in time-series monitoring.
SWING employs four piecewise-linear interpolation paths to keep the integration trajectory on the observed data manifold, thereby reducing out-of-distribution artifacts.
Empirical evaluations across healthcare, finance, and activity recognition tasks demonstrate SWING’s improved attribution faithfulness, coherence, and computational feasibility compared to classical methods.

Shifted Window Integrated Gradients (SWING) is an extension of Integrated Gradients (IG) tailored to explain prediction changes in online time-series monitoring tasks. Unlike classical IG—which integrates gradients from a fixed “zero” or constant baseline to a given input—SWING defines its integration path using retrospective windows derived directly from the temporal context, thereby systematically capturing causal dependencies and mitigating out-of-distribution (OOD) artifacts. SWING is specifically designed to maintain the integration trajectory on the data manifold, thus producing more faithful attributions in dynamic temporal environments such as healthcare, finance, and activity recognition (Kim et al., 28 Nov 2025).

1. Motivations and Conceptual Foundations

The canonical IG method attributes a model’s prediction to its input features by integrating gradients along a straight path from a baseline $x_0$ to the evaluated input $x$ . In time-series monitoring settings, each input typically constitutes a window $X_{T-W+1:T}\in \mathbb R^{W \times D}$ encompassing the past $W$ observations. Applying IG directly with a zero or constant baseline in such settings induces two primary deficiencies: (1) the integration trajectory often strays far from the observed data manifold, resulting in OOD interpolations; and (2) it neglects the influence of preceding data windows, which are causally relevant for online prediction differences between time steps. As a consequence, classical IG fails to adequately explain why a prediction shifted between two time points, since both baseline selection and integration fail to reflect the true temporal evolution.

SWING addresses these deficiencies by replacing the zero baseline with a moving, retrospective window—specifically, the immediate historical sequence up to (but not including) the time step in question. This approach keeps the integration path within realistic, observed regions of the feature space and directly models the causal chain leading to the current prediction. Furthermore, SWING assembles its final attribution not from a single path but from a composite of four distinct, piecewise-linear paths reflecting all possible retrospective and forward transitions between two time indices.

2. Mathematical Formalism and Attribution Framework

Formally, SWING generalizes the IG path from a constant baseline to an arbitrary smooth curve $\gamma:[0,1]\to \mathbb R^{W\times D}$ with $\gamma(0) = X’$ , $\gamma(1) = X$ . The IG attribution to the $(t,d)$ -th input coordinate along path $\gamma$ is:

$\mathrm{IG}^\gamma_{t,d}(f,X) = \int_0^1 \frac{\partial f(\gamma(\alpha))}{\partial \gamma_{t,d}(\alpha)} \frac{\partial \gamma_{t,d}(\alpha)}{\partial \alpha} d\alpha,$

with completeness guaranteed by the Fundamental Theorem of Line Integrals:

$\sum_{t,d} \mathrm{IG}^\gamma_{t,d}(f,X) = f(X) - f(X’).$

For two time windows $T_1$ and $T_2$ , SWING employs baseline windows $X_{T_1-W:T_1-1}$ and $X_{T_2-W:T_2-1}$ . It defines four piecewise-linear paths— $\gamma_{1,2}, \gamma_{2,2}, \gamma_{1,1}, \gamma_{2,1}$ —each interpolating through the sequence of observed windows between the relevant time indices. The SWING attribution for feature $(t,d)$ explaining the change from $T_1$ to $T_2$ is:

$\varphi_{\mathrm{SWING}}(t,d) = \frac{1}{2} \sum_{i=1}^2 \left[ \mathrm{IG}^{\gamma_{i,2}}_{t,d}(f, X_{T_2-W+1:T_2}) - \mathrm{IG}^{\gamma_{i,1}}_{t,d}(f, X_{T_1-W+1:T_1}) \right]$

This expression guarantees that the sum of attributions over all features equals $f(X_{T_2}) - f(X_{T_1})$ , thereby satisfying completeness and skew-symmetry.

3. Algorithmic Implementation and Practical Considerations

The SWING algorithm consists of evaluating line integrals along the four defined piecewise-linear trajectories connecting the combination of historical and current windows at $T_1$ and $T_2$ . Windows are interpolated using segment-wise linear interpolation parameterized by $\alpha$ , partitioned into $M$ steps according to the local window indices. The integration is numerically approximated with the trapezoid rule, requiring $n$ samples per path (typically $n=50$ suffices for stable results). Each path involves $O(n)$ forward and backward model passes, with total computational complexity $O(4n)$ per attribution query.

Practically, only two windows per jump are needed—those ending immediately before $T_1$ and $T_2$ . This simplifies memory management in streaming applications. Empirical measurements indicate a runtime of 0.35 seconds per sample for MIMIC-III with current high-performance computing resources.

4. Empirical Evaluation and Benchmarking

SWING was extensively evaluated across diverse real-world and synthetic time-series monitoring tasks, including:

MIMIC-III decompensation prediction (binary mortality risk)
PhysioNet 2019 sepsis detection (early warning)
UCI activity recognition
Synthetic tasks such as Delayed-Spike (causal delays) and Switch-Feature (state-dependent importance)

Performance was assessed using a suite of metrics targeting attribution faithfulness, sufficiency, and coherence:

CPD, AUPD (Area under Point Drop): effect of removing top- $k$ salient points
CPP, AUPP: effect of removing least-salient points (lower is better)
Macro-averaged MPD, AUMPD; MPP, AUMPP
Correlation between attributed feature ordering and observed prediction differences
Case-paper coherence alignment with domain knowledge
Computational efficiency (runtime and memory)

On MIMIC-III using an LSTM backbone and $k=50$ salient points, representative results were:

Metric	SWING	IG	DeepLIFT	TIMING
AUPD (higher better)	16.23	9.10	9.42	9.71
AUPP (lower better)	5.85	13.97	–	13.36
Attribution-value correlation	0.40	0.17	–	0.19

Comparable improvements were observed across other domains and backbone architectures (CNNs, Transformers). Qualitatively, SWING produces sharply localized heatmaps, with attributions sharply focusing on causal perturbations or clinically meaningful events such as sudden hypotensive episodes.

5. Use Cases, Strengths, and Limitations

SWING is particularly effective for tasks characterized by pronounced temporal causality—delayed effects or rolling-window dependencies in risk scores or event detection. Its advantage over baseline-anchored methods is most pronounced when common baselines (zero, constant, or global means) induce OOD effects or do not reflect the observed data trajectory.

Limitations center primarily on computational overhead: each attribution requires approximately four times the gradient evaluations of standard IG. Selection of integration step count $n$ and window offset is critical—insufficient samples or overly large window separation can compromise fidelity.

Outstanding challenges and ongoing research include:

Adaptive sampling over $\alpha$ , prioritizing regions with sharp gradient variations
Learned baselines via auxiliary prediction networks for optimal reference selection
Hybrid interpolation paths incorporating manifold learning techniques
Higher-order quadrature or annealing for increased numerical precision

6. Theoretical Properties and Guarantees

SWING preserves core theoretical properties foundational to attribution methods:

Completeness: The sum of attributions over all features equals the net prediction change between time windows.
Implementation invariance: Attributions depend on the functional behavior of the model $f$ , not its implementation.
Skew-symmetry: SWING attributions strictly reflect the net increment or decrement in the output between specified temporal windows.

By integrating only via trajectories formed by observed window sequences, SWING maintains robust alignment with the data manifold and circumvents speculative extrapolations endemic to classical IG with inappropriate baselines.

7. Summary and Empirical Significance

Shifted Window Integrated Gradients offers a systematic, computationally feasible, and theoretically grounded solution to the problem of explaining online prediction changes in time-series monitoring. Across domains—clinical, industrial, or behavioral—SWING consistently demonstrates higher attribution faithfulness, sufficiency, and domain coherence than both classical IG and more recent, specialized temporal explainers. Its adoption is particularly beneficial wherever baseline selection and temporal causality are central to interpretability challenges (Kim et al., 28 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Shifted Window Integrated Gradients (SWING).