Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

Published 3 Apr 2026 in cs.CV, cs.AI, and cs.LG | (2604.02946v1)

Abstract: Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model's reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel provenance-based gradient guidance approach that directs model attention to task-relevant regions, reducing spurious correlations.
It leverages synthetic data generation methods like image mixing and generative editing, using provenance masks to enhance localization and improve classification accuracy.
Experimental results demonstrate improved robustness, faster convergence, and mitigated effects of background artifacts across various tasks.

Provenance-Based Input Gradient Guidance for Learning from Synthetic Data

Introduction

The use of synthetic data for regularizing and improving the generalization of deep neural networks (DNNs) is widespread, particularly in computer vision and related domains. However, existing synthetic learning approaches often improve robustness only indirectly by diversifying data distributions, without explicitly informing models which regions in the input are relevant for the task. "Learning from Synthetic Data via Provenance-Based Input Gradient Guidance" (2604.02946) addresses this by introducing a principled framework that leverages provenance information—annotation-free labels that specify the true origin of each input region—available during data synthesis. The proposed method uses these provenance signals to guide input gradients and thus enforces class predictions to depend on truly relevant regions, effectively mitigating the learning of spurious correlations induced by synthetic artifacts or contextual biases.

The framework generalizes across various synthetic data generation strategies (e.g., mixing, editing) and applies to both vision and spatio-temporal action localization tasks. The paper presents a suite of experiments demonstrating that provenance-guided gradient supervision systematically improves discriminative performance, robustness, and convergence efficiency.

Methodology

The framework consists of three components: (1) Data Synthesis, (2) Provenance Extraction, and (3) Input Gradient Guidance.

Data Synthesis and Provenance Extraction

Synthetic data generation is accomplished via three paradigms:

Image Mixing (e.g., CutMix, ResizeMix): Images are combined using binary spatial masks, producing both a synthesized instance and a provenance map that records for each pixel its originating image and, by extension, its supervisory class.
Object Skeleton Mixes: Temporal and spatial mixes of skeleton sequences, with precise per-feature provenance.
Image Editing with Generative Models: Regions of an image are modified (typically with diffusion or similar generative models) guided by text prompts that exclude the target object. The provenance mask is determined through difference images and Otsu thresholding, segmenting edited from unedited (presumably target) regions.
Figure 1: Examples of provenance information obtained during synthesis, mapping each input region to its origin class label.

Input Gradient Guidance

Rather than solely relying on indirect regularization via data augmentation, the method introduces a provenance loss, $L_{\mathrm{PG}}$ , penalizing input gradients corresponding to class logits in regions that, according to the provenance mask, should not contribute to the class score. Specifically:

For mix-based synthetic data, the gradient of each class logit is explicitly suppressed outside the mask corresponding to that class.
For hard-label synthetic data (as in generative editing), the class logit's input gradient is suppressed over any non-target (altered) region.

The total loss is:

$L_\text{total} = L_\text{cls} + \alpha L_\text{PG}$

where $L_\text{cls}$ is the (soft or hard) cross-entropy classification loss and $\alpha$ balances the regularization strength.

This approach forces models to localize their evidence on semantically meaningful regions, increasing robustness to distribution shifts (e.g., spurious context cues).

Experimental Results

The evaluation assesses localization and classification performance, as well as robustness to domain shift and spurious correlation. Datasets include CUB-200-2011 for fine-grained bird recognition and localization, iWildCam for domain shift, Waterbirds for background-class correlation, and UCF101-24 for spatio-temporal action localization.

Weakly Supervised Object Localization

Incorporating provenance-guided input gradient supervision into CutMix and multi-modal models (VGG16, DeiT-S, SAT) results in a consistent boost in localization accuracy. For instance, CutMix+CAM on CUB improves from 62.3% to 65.1% mean MaxBoxAccV2; introducing provenance loss lifts SAT from 91.5% to 92.1% mean accuracy.

Figure 2: Guided Grad-CAM visualizations (left) and predicted BBoxes overlaid on ground truth (middle) on CutMix-synthesized CUB images; localization accuracy as a function of $\alpha$ (right).

The qualitative results demonstrate increased focus of class activation maps on the actual target object, with reduced reliance on spurious background correlations. These trends are robust to variations in the $\alpha$ coefficient.

Weakly Supervised Spatio-Temporal Action Localization

For SKP with BatchMix augmentation on UCF101-24, provenance-based guidance boosts AP from 38.0% to 39.7%. This highlights utility beyond images, into spatio-temporal domains where action cues may be confounded by unrelated trajectories.

Image Classification under Distribution Shifts

When introduced to ALIA (image editing-based augmentation), provenance-based guidance further improves accuracy: CUB (71.7%→72.0%), iWildCam (83.5%→84.4%), Waterbirds (71.4%→80.7%). The improvement on Waterbirds (9.6 pp) signals strong mitigation of background-based spurious correlation, a critical concern for robust model deployment.

Figure 3: Visualization of image editing synthesis steps and corresponding provenance masks deriving unedited (target) versus edited (non-target) regions.

Ablations: Mask Quality and Regularization

Mask dependence: If the provenance mask is replaced by a random or full (uniform) mask, accuracy drops (Tab. R1 controls), confirming that performance gains stem from semantically meaningful provenance, not mere gradient suppression.
Mask robustness: Modest dilation/erosion (±30%) of mask regions does not notably harm performance; the method is robust to some imprecision in mask extraction, especially for generative editing-based provenance.
Training Efficiency: The approach reduces epochs-to-convergence for both mixing-based (VGG, DeiT-S) and image editing-based setups—sometimes by over 3×—without increasing tuning complexity.
Computational overhead: Extra memory (due to second-order differentiation for input gradients) is offset by fewer required epochs for convergence.

Theoretical and Practical Implications

This work makes a concrete case for leveraging annotation-free, synthesis-side provenance annotations as auxiliary supervision in synthetic data pipelines. Its modularity ensures applicability to arbitrary modalities and synthetic learning strategies providing region-level provenance. The explicit suppression of input gradient signal in non-target regions yields models that are better aligned with the task itself, less reliant on context or artifact-based shortcuts, and more robust under domain shifts.

A core implication is that the provenance-based regularization localizes the model's functional support and promotes interpretability, as visualized through activation maps and sensitivity heatmaps.

Given the reliance on provenance masks, future work should examine further avenues for provenance extraction, including unsupervised or model-attention-based mask generation, especially in generative augmentation pipelines where difference images may be insufficient. The paradigm also invites integration into more fine-grained multi-modal or continuous-provenance tasks, as well as scaling studies on large foundation models.

Conclusion

Provenance-based input gradient guidance transforms synthetic learning from indirect regularization to direct supervision, focusing DNN attention on task-relevant input regions as provided by the synthesis process. The framework generalizes across image and temporal domains, different architectures, and multiple synthetic strategies. The result is enhanced robustness to spurious correlations, improved convergence efficiency, and superior localization and classification metrics without incurring significant hyperparameter or annotation cost. This method establishes provenance-aware regularization as a central component for any future work on synthetic data-driven training regimens.

Markdown Report Issue