Factorized Diffusion: Perceptual Illusions by Noise Decomposition (2404.11615v2)

Published 17 Apr 2024 in cs.CV

Abstract: Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.

References (2)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces Factorized Diffusion as a novel method that decomposes noise into independent components to generate hybrid images with perceptual illusions.
It modifies the standard diffusion process by estimating noise separately for image components based on specific text prompts, leading to controlled visual variations.
Empirical evaluations using CLIP scores show superior fidelity and perceptual coherence, suggesting potential applications in adaptive digital signage and augmented reality.

Factorized Diffusion for Generating Perceptual Illusions with Noise Decomposition

Introduction to Factorized Diffusion

The paper introduces a novel technique termed "Factorized Diffusion" for manipulating image components through diffusion models without the need for additional training or guidance networks. This method leverages image decomposition into various sub-components (e.g., frequency bands, color channels, motion blur components), allowing each to be independently controlled via distinct text prompts during the diffusion process. This approach enables the generation of "hybrid images," which exhibit different appearances under varying perceptual conditions such as viewing distance, color perception under different lighting, or motion blur effects.

Technical Insights on Implementation

Diffusion Model Adaptation: The proposed method modifies the noise estimation step of the diffusion process. Traditional diffusion involves estimating and subtracting out noise iteratively to synthesize cleaner image versions. In contrast, Factorized Diffusion constructs a composite noise estimate from multiple components, each influenced by different textual descriptions, facilitating independent control over each image sector.
Component-Specific Noise Estimation: For a given image decomposed into N components, noise is estimated separately for each fragment conditioned on bespoke textual cues. These are then recombined to form a modified noise signal that guides the generation of the final image, ensuring that each component adheres to its specified prompt.

Key Contributions and Results

The method efficiently controls image decomposition elements in zero-shot scenarios using standard diffusion models.
The generation of hybrid images showcases strong effectiveness, particularly images that show significant perceptual changes based on the observer's viewing conditions.
The research significantly extends beyond traditional hybrid images by introducing triple hybrids and achieving novel perceptual illusions such as color hybrids (changing with lighting conditions) and motion hybrids (altering under motion blur).
Empirical evaluations indicate superior performance over existing methods, with the produced images displaying higher fidelity and adherence to specified conditions.

Comparative Analysis

Quality Comparisons: The paper benchmarks against classical approaches like those by Oliva et al., with the new method yielding more realistic and perceptually coherent images. Hybrid images controlled by the proposed technique align more closely with human perceptions across multiple viewing modalities.
Quantitative Metrics: Utilizing CLIP scores to measure alignment with text prompts under varying conditions, the Factorized Diffusion approach surpasses previous methods, suggesting more effective control of perception-driven image attributes.

Theoretical and Practical Implications

Broader Implications: The ability to manipulate image perception through simple model modifications opens up new avenues in personalized content creation, adaptive digital signage, and augmented reality, where visual content may need to dynamically adjust to user perceptions or environmental conditions.
Future Direction Speculation: The paper sets the stage for further research on integrating more complex decompositions and exploring additional perceptual cues. Future models could potentially incorporate auditory or tactile elements, expanding the sensory manipulation capabilities of generative models.

Considerations on Limitations

While promising, the method currently suffers from lower success rates in generating high-quality images consistently. This limitation stems from the complexity of generating out-of-distribution samples and the absence of mechanisms to prevent prompt overlap across different image components. Future enhancements could focus on refining decomposition techniques or developing more robust conditioning strategies to improve success rates and overall image quality.

In summary, the paper presents a significant advancement in the field of image synthesis, proposing a refined approach that leverages the inherent capabilities of diffusion models to produce perceptually varied images based solely on textual control over decomposed image components.

PDF Markdown

Related Papers

Tweets

https://twitter.com/dangengdg/status/1780795460283744677

https://twitter.com/taziku_co/status/1780810525246681224

https://twitter.com/ryo694/status/1790730646551454201

YouTube

Show All Videos