Papers
Topics
Authors
Recent
2000 character limit reached

Idis: Images with Distractors in Vision Research

Updated 3 December 2025
  • Images with Distractors (Idis) is a framework that systematically perturbs images with irrelevant visual elements to evaluate robustness in vision and reinforcement learning tasks.
  • Controlled experiments use distractors of varying semantic, numerical, spatial, and dynamic types to isolate task-relevant features in datasets.
  • Recent studies focus on formalizing distractor generation and mitigation strategies to reduce spurious correlations and enhance model performance.

Images with Distractors (Idis) are a key experimental and benchmarking paradigm in visual cognition, computer vision, and reinforcement learning, in which images are systematically constructed or perturbed to include distractor content—visual elements irrelevant to the task, potentially interfering with human or machine interpretation. The Idis concept underpins synthetic and real-world datasets, model evaluation protocols, and robust learning frameworks across visual recognition, visual question answering, attention modeling, change detection, and control. Recent research formalizes distractor generation, measurement, and mitigation, enabling controlled exploration of spurious correlations, bias, and robustness phenomena in contemporary vision and vision-language systems.

1. Taxonomy and Formal Definition of Distractors

Distractors in the Idis framework are image elements extraneous to the intended target or task-relevant signal. They may be objects, textures, lighting, or pose changes and are classified according to semantic, numerical, spatial, temporal, or low-level feature axes.

  • Semantic distractors: Object instances with varying relationship to the target (e.g., positively aligned/spuriously correlated, conflicting/negatively correlated, or truly irrelevant). In the Idis VQA dataset, semantic distraction is controlled by set membership: aligned (spuriously correlated), conflicting (antagonistic), and irrelevant (uncorrelated) (Bae et al., 26 Nov 2025).
  • Numerical distractors: Varying the count of distractors present. E.g., from n=0n=0 to n=4n=4 additional objects in a synthetic scene (Bae et al., 26 Nov 2025).
  • Spatial distractors: Manipulating distractor size/location (e.g., occupying 25–45% of image width) to assess the relative salience of competing image regions (Bae et al., 26 Nov 2025).
  • Low-level feature distractors: Differences in orientation, color, brightness, or geometric attributes (e.g., orientation search, color pop-out) in synthetic Idis patterns for saliency benchmarking (Berga et al., 2019).
  • Dynamic distractors: Transient (moving) objects not present in all views (e.g., people, vehicles, shadows) are especially critical in 3D vision and RL (Sabour et al., 2023, Sabour et al., 28 Jun 2024).

Formal construction: Given a clean base image I0I_0 and a specification of distractor parameters (s,n,α)(s, n, \alpha), a deterministic editing procedure produces I(s,n,α)=Edit(I0,{distractors},α)I^{(s, n, \alpha)} = \mathrm{Edit}(I_0, \{\text{distractors}\}, \alpha), ensuring tight experimental control (Bae et al., 26 Nov 2025).

2. Benchmarks and Dataset Construction

Several canonical datasets operationalize the Idis paradigm, with explicit distractor injection and complete annotation for task-relevant versus irrelevant content.

  • Idis for Vision-Language Reasoning: 64,800 images constructed from ImageNet-9 cores, systematically augmented with up to four distractor objects per image, controlling semantic alignment, count, and size. Each sample (I,q,y)(I, q, y) comprises the edited image, fixed question "Which category best describes the main object in the image?", and ground-truth label (Bae et al., 26 Nov 2025).
  • SID4VAM / IDis for Saliency: 230 synthetic images (1280x1024 px), each containing a target singleton and ~320 distractors laid out on a grid, spanning 15 feature categories with 7 defined contrast levels. Ground-truth masks precisely delineate salient target regions (Berga et al., 2019).
  • Distracting Control Suite: RL environments extending DM Control with distractors along three axes: camera pose (βcam\beta_{\rm cam}), object color (βrgb\beta_{\rm rgb}), and background (βbg\beta_{\rm bg}). Difficulty is continuously parameterized; distractors may remain static or follow temporally varying processes (Stone et al., 2021).
  • 3D View Synthesis / NeRF/GS: Benchmarks such as the RobustNeRF and SpotlessSplats scenes include synthetic or real captures with dynamic distractors, e.g., moving pedestrians, vehicles, or environmental changes, with clean and distractor-injected splits for controlled ablation (Sabour et al., 2023, Sabour et al., 28 Jun 2024).
  • SIDOD: 144k stereo pairs with randomized lighting, object/distractor count, and camera pose; distractors are "flying" objects generated to challenge object recognition and pose estimation tools (Jalal et al., 2020).

Table: Characteristics of Representative Idis Datasets

Dataset Distractor Types Task Modality
Idis VQA Semantic, count, spatial VQA (classification)
SID4VAM Low-level feature Visual attention
DistractingCtrl Color, pose, background Vision-based RL
RobustNeRF/SLS Dynamic scene elements 3D novel view synthesis

3. Experimental Protocols and Metrics

Evaluation on Idis datasets is meticulously structured to isolate the effects of distractors on model accuracy, reasoning, or attentional deployment.

  • Accuracy and Robustness: Measured as the fraction of correct predictions over distractor configurations; crucial to ascertain degradation as function of distractor number/type (Bae et al., 26 Nov 2025, Stone et al., 2021).
  • Reasoning Length and Attribute Analysis: For reasoning VLMs, the chain-of-thought token count (RL\mathrm{RL}) and the distribution of attribute mentions related to target vs. distractors are tracked. The distractor-attribute ratio rattrr_{\mathrm{attr}} is highly predictive of accuracy: above 80% when rattr<20%r_{\mathrm{attr}} < 20\%, near-zero as rattr>50%r_{\mathrm{attr}} > 50\% (Bae et al., 26 Nov 2025).
  • Saliency/AOI Metrics: Saliency performance is measured via AUC (Judd), NSS, CC, SIM, KL divergence, and the Saliency Index (SI), with an emphasis on how closely model outputs coincide with human-labeled (or synthetic ground-truth) targets in the presence of structured distractors (Berga et al., 2019).
  • Policy Robustness (RL): RL agents’ cumulative reward is tracked as a function of distractor difficulty, with static and dynamic regimes, assorted distractor combinations, and both training/test domain generalization (Stone et al., 2021).
  • PSNR/SSIM/LPIPS (3D Synthesis): Photometric and perceptual similarity scores evaluate reconstruction fidelity with and without distractors masked (Sabour et al., 2023, Ungermann et al., 21 Aug 2024, Sabour et al., 28 Jun 2024).
  • Adversarial Degradation (MC-VQA): For generated distractors in multiple-choice VQA, success is defined by the extent to which candidate distractors cause "fooling"—i.e., increase error rate of pretrained models on perturbed question sets (Lu et al., 2019).

4. Methods for Distractor Generation and Modeling

Automated distractor generation is foundational to Idis. Approaches vary by modality and task.

  • Synthetic Editing and Compositionality: Paste-based image edits introduce objects at algorithmically defined locations, leveraging pre-curated pools with semantic constraints (Bae et al., 26 Nov 2025, Jalal et al., 2020).
  • Low-Level Feature Manipulation: Synthetic pattern generation modulates target and distractor parameters independently (e.g., Δθ\Delta\theta for orientation, rr for size, etc.), enforcing specified contrast between target and distractors (Berga et al., 2019).
  • Dynamic Simulation: Simulated environments allow sampling of pose, color, and background distractors with temporal coherence and controlled stochasticity (Stone et al., 2021).
  • Adversarial Generation: For MC-VQA, RL-based generator agents select distractor answers by maximizing downstream model error, with reward signals derived from pre-trained VQA models (Lu et al., 2019).
  • Residual and Feature-Based Masking (3D/GS/NeRF): Self-supervised identification of local reconstruction residuals—smoothed, statistically thresholded, or filtered using pretrained semantic features (e.g., Stable Diffusion, SAM)—is used to detect and ignore distractor pixels and regions during model fitting (Sabour et al., 2023, Ungermann et al., 21 Aug 2024, Sabour et al., 28 Jun 2024).
  • Separation of Dynamics (Imitation Learning): Latent state models factor environment dynamics into task-relevant and distractor-specific transitions, maintaining policy/planning entirely within the task-relevant subspace (Wan et al., 2023).

5. Empirical Phenomena and Analytical Findings

Idis stimuli have enabled discovery of mechanisms by which both human and artificial agents become susceptible—or robust—to visual distraction.

  • Inverse Scaling with Distractors: In reasoning VLMs, adding distractors reduces accuracy but leaves reasoning trace length largely unchanged (i.e., the length-accuracy curve shifts downward without "overthinking") (Bae et al., 26 Nov 2025). This contrasts with textual distractors, where longer reasoning often accompanies degradation.
  • Semantic vs. Spatial/Count Effects: Conflicting (negatively correlated) distractors exert the steepest accuracy drop; aligned distractors pose milder challenges. Larger spatial scale amplifies distractor attribute ratios and suppresses correct reasoning (Bae et al., 26 Nov 2025).
  • Saliency and Model Type: Spectral/Fourier-based saliency models exhibit SI(Ψ) curves strongly correlated with human psychophysics under Idis arrangements (Spearman’s ρ > 0.85), while deep saliency nets underperform and lack proper contrast scaling (Berga et al., 2019).
  • Outlier Robustness in 3D Synthesis: Both RobustNeRF and SLS/SpotlessSplats demonstrate substantial PSNR/SSIM gains by algorithmically masking or down-weighting regions of high residual or low-feature similarity, with ablations confirming that robust masking is critical for suppressing artifacts from transient distractors (Sabour et al., 2023, Sabour et al., 28 Jun 2024, Ungermann et al., 21 Aug 2024).
  • Generalization Failure Modes: In control domains, background distractors (even unseen video textures) cause rapid policy collapse, while color distractors are less impactful; aggregated distractors are harder than the sum of their single-type effects (Stone et al., 2021).
  • Adversarial Distractor Generation: Gobbet-generated distractors, optimized to maximally degrade model accuracy, substantially increase the challenge of MC-VQA and, when used for data augmentation, improve robustness on both original and distractor-rich benchmarks (Lu et al., 2019).

6. Mitigation Strategies and Robust Model Design

Idis motivates diverse model and training interventions to immunize vision systems against distractor-induced error.

  • Robust Losses and Masking: Iteratively reweighted or trimmed losses discard a dynamically estimated fraction of high-residual pixels; feature-based or self-supervised MLPs provide finer-grained masking via semantic features (e.g., SLS's use of Stable Diffusion features; intersection with SAM object masks for spatial coherence) (Sabour et al., 2023, Sabour et al., 28 Jun 2024, Ungermann et al., 21 Aug 2024).
  • Task-Relevance Filtering: Learned filters decrease discriminability of distractor attributes while preserving task-relevant content, formalized via joint Fisher discriminant objectives and optimized differentiable filters (Whitehill et al., 2011).
  • Domain Randomization: Enforcing wide variability in non-task visual factors during training (randomized backgrounds, lighting, view, distractor count) can harden models against adverse test-time settings (Jalal et al., 2020).
  • Prompting and Attention Guidance: For VLMs, explicit prompt engineering to focus on the “foreground” or task-relevant attributes reduces bias-driven errors and improves conflict group accuracy (e.g., “Think step by step based on the foreground bird’s attributes.”) (Bae et al., 26 Nov 2025).
  • State-Space Separation: Decomposing latent dynamics into task-relevant and distractor dimensions, as in SeMAIL, enables model-based RL and imitation agents to plan in distractor-free representations, restoring near-expert performance under severe visual contamination (Wan et al., 2023).
  • Contrastive and Channelwise Decorrelation: Change captioning under distractors is enhanced by forcing per-channel alignment and decorrelation of before/after representations (DIRL), and by cross-modal contrastive regularization tying difference features to generated words (Tu et al., 16 Jul 2024).

7. Significance, Limitations, and Future Directions

The Idis paradigm underlies empirical advances across robust recognition, visual reasoning, RL, and synthetic attention modeling. It provides an experimental substrate for probing failure modes, spurious correlation, and the dynamics of model robustness in a systematically variable and reproducible manner.

Significant findings reveal (i) current model architectures often leverage spurious or distractor-derived cues when available, (ii) robust masking/attention yields marked improvements especially in dynamic or out-of-distribution settings, and (iii) simple prompt or architecture modifications (structured masking, feature-based filters, reward shaping) can confer substantial robustness with minimal overhead.

Limitations include potential over-masking of relevant but rare features, reliance on semantic feature extractors which may themselves be distractible, and the difficulty of distinguishing truly task-irrelevant from subtly informative distractors, especially in semantically complex scenes.

Emerging trends point toward (a) ever more sophisticated distractor generation via generative models and adversarial optimization, (b) integration of multi-task and uncertainty-aware filtering in robust learning, (c) scaling of Idis toward multi-modal and interactive domains, and (d) the use of Idis benchmarks to expose and diagnose hidden model vulnerabilities that evade detection in i.i.d. settings.


References:

(Bae et al., 26 Nov 2025, Berga et al., 2019, Stone et al., 2021, Sabour et al., 2023, Sabour et al., 28 Jun 2024, Ungermann et al., 21 Aug 2024, Lu et al., 2019, Jalal et al., 2020, Wan et al., 2023, Whitehill et al., 2011, Tu et al., 16 Jul 2024)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Idis (Images with Distractors).