Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Published 6 Apr 2026 in cs.CV | (2604.04838v2)

Abstract: Recent advancements in Vision-LLMs (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents Degradation-Driven Prompting (DDP), which uses aggressive input downsampling and tool invocation to enforce global structural reasoning in VQA.
The methodology integrates multi-stage image degradation with agentic tool use and Chain-of-Thought prompting to reduce reliance on spurious local cues.
Empirical results show DDP achieves up to a 10% improvement over baselines, notably enhancing performance on adversarial and high-frequency distractor tasks.

Degradation-Driven Prompting for VQA: Agentic Perception via Detail Reduction

Motivation and Problem Statement

Modern Vision-LLMs (VLMs) have achieved high benchmarks in Visual Question Answering (VQA), yet they systematically fail on images containing visual illusions, occlusions, and high-frequency distractors. These failures are intrinsic, rooted in the models' tendency to exploit local textures and spurious pixel-level cues instead of reasoning about global structure and semantics. The underlying perception-logic gap between human and machine visual understanding manifests as hallucinations and brittle predictions when VLMs are exposed to adversarial or ambiguous stimuli. The paper "Less Detail, Better Answers: Degradation-Driven Prompting for VQA" (2604.04838) introduces Degradation-Driven Prompting (DDP), an agentic, hierarchical framework that leverages multi-stage input degradation, targeted visual prompting, and external tool invocation, with the explicit goal of enforcing global structural reasoning.

Methodological Framework

DDP’s architecture is motivated by the cognitive dual-process theory—moving VLMs from passive, single-shot inference to an iterative, active, and tool-augmented perception.

Input Degradation and Task Decomposition

The pipeline begins with aggressive input downsampling, using Gaussian smoothing followed by systematic reduction of input resolution to suppress high-frequency details. The initial classifier routes each image-query pair into two tracks: Physical Attributes (e.g., size, color, geometric properties) and Perceptual Phenomena (e.g., illusions, occlusions, motion artifacts). This early separation allows allocation of specialized toolsets and prompts tailored to the input’s visual complexity.

Figure 1: A low-resolution DDP pipeline eliminates background noise, achieving ≈50% reduction in response time and ≈50% improvement in accuracy on basic physical attribute tasks.

Agentic Tool Invocation

In the second stage, the Tool Manager, acting as an autonomous agent, iteratively applies context-dependent visual primitives: auxiliary lines for geometric rectification, cropping for context isolation, white-out masks to neutralize global distractors, blurring to attenuate local textures, and contrast enhancement for fragile feature extraction. Each tool is deployed based on the output of the prior classifier and the current state of the visual evidence set.

Figure 2: The DDP toolchain overcomes visual reasoning bottlenecks, e.g., resolving occlusion-based illusions via divide-and-conquer and region-specific tool application.

This agentic, programmatic approach shifts the VLM’s role from direct pixel-to-answer mapping toward hypothesis generation and verification, grounded in manipulated and purified evidence.

Structural Bottleneck and Chain-of-Thought Prompting

In the final inference step, all evidence—including raw, degraded, and tool-augmented images—is further downsampled to ≤80 pixels in the largest dimension, establishing an information bottleneck that excludes nearly all textural noise. The target "Critic" module then executes Chain-of-Thought (CoT) reasoning leveraging task-specific prompts and explicit alignment templates to perform rigorous logical verification and deduce the final answer.

Figure 3: The DDP-based enhancement framework integrates task classification, agentic tool application, and low-res CoT reasoning in a structured inference pipeline.

Empirical Evaluation and Results

The DDP framework is subjected to extensive evaluation on multiple international benchmarks, including MMBench, SEED-Bench, ScienceQA, VQAv2, and the adversarial V*Bench and ColorBlind datasets. It is tested as an augmentation layer on leading VLM backbones such as Gemini-3-Pro and GPT-4o.

Key quantitative findings include:

Across standard and adversarial VQA benchmarks, DDP delivers 3–10% absolute improvements over state-of-the-art VLM backbones in zero-shot and perturbed settings.
On MMBench, SEED-Bench, and VQAv2, DDP (with Gemini-3-Pro) achieves 92.1%, 94.5%, and 89.4% accuracy, significantly outperforming the unmodified backbone (up to +8.7%).
On highly challenging visual tasks (ColorBlind, V*Bench), all standard VLMs achieve close to 0% on Pass@1; DDP achieves 29.33%.
Robustness to noise and adversarial perturbation is empirically validated: e.g., on perturbed images (DataCV CVPR Challenge), DDP demonstrates a +20% increase in accuracy over the baseline.

Ablation studies demonstrate that aggressive degradation (downsampling/blurring), autonomous tool invocation, and prompt engineering each contribute substantial, non-redundant gains. Removal of the image degradation yields the sharpest drop in performance (−8.7%), confirming the central hypothesis of the work.

Figure 4: DDP pipeline case: external tools and degradation yield purified intermediate images, enabling robust reasoning on perception-intensive edge cases.

Theoretical and Practical Implications

The primary theoretical contribution is the demonstration that deliberately constraining perceptual bandwidth via multi-level degradation compels VLMs to suppress spurious local features, thereby enforcing semantic and structural reasoning. This setup is explicitly supported by the Data Processing Inequality: reduction of input entropy via downsampling minimizes the mutual information between high-frequency noise and model predictions.

Practically, the DDP framework validates agentic, tool-augmented inference—treating modern VLMs not as static classifiers but as active, recursive reasoners capable of self-correction and hypothesis testing. The strategy is inherently extensible: new tools, prompts, and domain-specific augmentations can be incorporated to target additional failure modes.

Prospects and Future Directions

DDP establishes that "less is more" for VQA: detail reduction outperforms parameter scaling for tasks dominated by local noise and adversarial visual structure. Its agentic design paradigm aligns closely with future directions in multi-modal cognition, human-in-the-loop decision making, and robustness to distributional shift.

Potential future developments include automated toolset expansion (via meta-learning or neural-symbolic search), dynamic resolution tuning driven by uncertainty estimation, and deeper integration with external physical measurement or simulation engines for real-world robotic perception. Furthermore, interpretability and auditability are substantially enhanced, as the pipeline exposes intermediate reasoning states and tool invocations, directly supporting diagnostics and regulatory transparency.

Conclusion

Degradation-Driven Prompting (DDP) reframes multi-modal vision from passive observation to active, tool-rich reasoning, providing empirical and theoretical evidence that strategic downsampling and agentic tool-use systematically resolve failure cases endemic to modern VLMs. By focusing on global structure and leveraging modular evidence synthesis, DDP not only advances VQA accuracy but also opens explicit pathways for interpretable, reliable, and robust artificial vision architectures.

Markdown Report Issue