An Intelligent Agentic System for Complex Image Restoration Problems (2410.17809v2)
Abstract: Real-world image restoration (IR) is inherently complex and often requires combining multiple specialized models to address diverse degradations. Inspired by human problem-solving, we propose AgenticIR, an agentic system that mimics the human approach to image processing by following five key stages: Perception, Scheduling, Execution, Reflection, and Rescheduling. AgenticIR leverages LLMs and vision-LLMs (VLMs) that interact via text generation to dynamically operate a toolbox of IR models. We fine-tune VLMs for image quality analysis and employ LLMs for reasoning, guiding the system step by step. To compensate for LLMs' lack of specific IR knowledge and experience, we introduce a self-exploration method, allowing the LLM to observe and summarize restoration results into referenceable documents. Experiments demonstrate AgenticIR's potential in handling complex IR tasks, representing a promising path toward achieving general intelligence in visual processing.
- NTIRE 2017 challenge on single image super-resolution: Dataset and study. In CVPRW, 2017.
- Large language models for mathematical reasoning: Progresses and challenges, 2024. URL https://arxiv.org/abs/2402.00157.
- Graph of Thoughts: Solving Elaborate Problems with Large Language Models. AAAI, 2024.
- Masked image training for generalizable deep image denoising. In CVPR, June 2023a.
- RestoreAgent: Autonomous image restoration agent via multimodal large language models, 2024a. URL https://arxiv.org/abs/2407.18035.
- Low-res leads the way: Improving generalization for super-resolution by self-supervised learning. In CVPR, pp. 25857–25867, 2024b.
- Activating more pixels in image super-resolution transformer. In CVPR, pp. 22367–22377, June 2023b.
- A comparative study of image restoration networks for general backbone network design. In ECCV, 2024c.
- Learning a low-level vision generalist via visual task prompt. In ACM MM, 2024d.
- Hierarchical integration diffusion model for realistic image deblurring. In NeurIPS, 2023c.
- InstructIR: High-quality image restoration following human instructions. In ECCV, 2024.
- Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model, 2024. URL https://arxiv.org/abs/2306.16092.
- Image super-resolution using deep convolutional networks. IEEE TPAMI, 2016.
- PaLM-E: An embodied multimodal language model, 2023. URL https://arxiv.org/abs/2303.03378.
- Is it an agent, or just a program?: A taxonomy for autonomous agents. In Intelligent Agents III Agent Theories, Architectures, and Languages, 1997.
- Clearing the skies: A deep network architecture for single-image rain removal. IEEE TIP, 26(6):2944–2956, 2017. doi: 10.1109/TIP.2017.2691802.
- Generative adversarial nets. In NeurIPS, 2014.
- Blind super-resolution with iterative kernel correction. In CVPR, pp. 1604–1613, 2019.
- Networks are slacking off: Understanding generalization problem in image deraining. In NeurIPS, 2023.
- Single image haze removal using dark channel prior. In CVPR, 2009.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Learning agile and dynamic motor skills for legged robots. Science Robotics, 2019.
- Towards flexible blind JPEG artifacts removal. In ICCV, October 2021.
- A survey on large language models for code generation, 2024a. URL https://arxiv.org/abs/2406.00515.
- AutoDIR: Automatic all-in-one image restoration with latent diffusion, 2024b. URL https://arxiv.org/abs/2310.10123.
- Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux, 2011.
- Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. In ICML, 2024.
- MUSIQ: Multi-scale image quality transformer. In ICCV, October 2021.
- Reflash dropout in image super-resolution. In CVPR, pp. 6002–6012, 2022.
- Towards effective multiple-in-one image restoration: A sequential and prompt learning strategy, 2024a.
- A preliminary exploration towards general image restoration, 2024b. URL https://arxiv.org/abs/2408.15143.
- Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, July 2017.
- Iterative filter adaptive network for single image defocus deblurring. In CVPR, pp. 2034–2042, June 2021.
- Benchmarking single-image dehazing and beyond. IEEE TIP, 2019.
- All-in-one image restoration for unknown corruption. In CVPR, June 2022.
- Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends in Computer Graphics and Vision, May 2024.
- SwinIR: Image restoration using swin transformer. In ICCVW, October 2021.
- DiffBIR: Towards blind image restoration with generative diffusion prior. In ECCV, 2024.
- Visual instruction tuning. In NeurIPS, 2023.
- Unifying image processing as visual prompting question answering. In ICML, 2024.
- Controlling vision-language models for multi-task image restoration. In ICLR, 2024.
- Benchmarking robustness in object detection: Autonomous driving when winter is coming, 2020. URL https://arxiv.org/abs/1907.07484.
- Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, July 2017.
- OpenAI. GPT-4 technical report, 2023a.
- OpenAI. GPT-4V(ision) system card, 2023b. URL https://openai.com/index/gpt-4v-system-card.
- Generative agents: Interactive simulacra of human behavior. In Annual ACM Symposium on User Interface Software and Technology, 2023.
- PromptIR: Prompting for all-in-one image restoration. In NeurIPS, 2023.
- Learning to deblur using light field generated and real defocus images. In CVPR, June 2022.
- HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face. In NeurIPS, 2023.
- Reflexion: language agents with verbal reinforcement learning. In NeurIPS, 2023.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 2018.
- Vision transformers for single image dehazing. IEEE TIP, 2023.
- Cognitive architectures for language agents. Transactions on Machine Learning Research, 2024.
- NTIRE 2017 challenge on single image super-resolution: Methods and results. In CVPRW, July 2017.
- LLaMA: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971.
- MAXIM: Multi-axis mlp for image processing. In CVPR, June 2022.
- MAXIM: Multi-axis mlp for image processing. In CVPR, June 2022.
- TransWeather: Transformer-based restoration of images degraded by adverse weather conditions. In CVPR, June 2022.
- TransWeather: Transformer-based restoration of images degraded by adverse weather conditions. In CVPR, June 2022.
- Exploring CLIP for assessing the look and feel of images. AAAI, June 2023.
- Exploring CLIP for assessing the look and feel of images. AAAI, June 2023.
- A survey on large language model based autonomous agents. Frontiers of Computer Science, March 2024.
- A survey on large language model based autonomous agents. Frontiers of Computer Science, March 2024.
- Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. In ICCVW, October 2021.
- Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. In ICCVW, October 2021.
- Image quality assessment: from error visibility to structural similarity. IEEE TIP, 2004.
- Deep retinex decomposition for low-light enhancement. In British Machine Vision Conference, 2018.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Visual ChatGPT: Talking, drawing and editing with visual foundation models, 2023a.
- Q-Bench: A benchmark for general-purpose foundation models on low-level vision. In ICLR, 2024a.
- Q-Instruct: Improving low-level visual abilities for multi-modality foundation models. In CVPR, June 2024b.
- Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. In ICML. PMLR, 2024c.
- Towards open-ended visual quality comparison. In ECCV, 2024d.
- RIDCP: Revitalizing real image dehazing via high-quality codebook priors. In CVPR, June 2023b.
- The rise and potential of large language model based agents: A survey, 2023. URL https://arxiv.org/abs/2309.07864.
- MANIQA: Multi-dimension attention network for no-reference image quality assessment. In CVPRW, June 2022.
- Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023.
- LAMM: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In NeurIPS, 2023.
- Descriptive image quality assessment in the wild, 2024a. URL https://arxiv.org/abs/2405.18842.
- Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In ECCV, 2024b.
- Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In CVPR, June 2024.
- Crafting a toolchain for image restoration by deep reinforcement learning. In CVPR, June 2018.
- The shift from models to compound AI systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/, 2024.
- Multi-stage progressive image restoration. In CVPR, June 2021.
- Restormer: Efficient transformer for high-resolution image restoration. In CVPR, June 2022.
- Ingredient-oriented multi-degradation learning for image restoration. In CVPR, June 2023a.
- Beyond a Gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE TIP, 2017.
- Designing a practical degradation model for deep blind image super-resolution. In ICCV, October 2021.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, June 2018.
- Crafting training degradation distribution for the accuracy-generalization trade-off in real-world super-resolution. In ICML. PMLR, 2023b.
- Instruction-following evaluation for large language models, 2023.
- Image restoration for under-display camera. In CVPR, June 2021.
- Ghost in the Minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory, 2023. URL https://arxiv.org/abs/2305.17144.
- Karel Zuiderveld. VIII.5. - Contrast limited adaptive histogram equalization. In Graphics Gems, pp. 474–485. Academic Press, 1994.