Posterior Distillation Sampling (2311.13831v3)
Abstract: We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces.
- Anonymous. Learning pseudo 3D guidance for view-consistent 3D texturing with 2D diffusion. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
- InstructPix2Pix: Learning to follow image editing instructions. In CVPR, 2023.
- Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
- Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In ICCV, 2023.
- DeepFloyd. Deepfloyd if. https://www.deepfloyd.ai/deepfloyd-if/.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414, 2023.
- Instruct-NeRF2NeRF: Editing 3D scenes with instructions. In ICCV, 2023.
- Delta denoising score. In ICCV, 2023a.
- Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023b.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
- An edit friendly DDPM noise space: Inversion and manipulations. arXiv preprint arXiv:2304.06140, 2023.
- Word-as-image for semantic typography. ACM TOG, 2023.
- Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In CVPR, 2023.
- Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
- Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
- 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 2023.
- SALAD: Part-level latent diffusion for 3d shape generation and manipulation. In ICCV, 2023.
- Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023.
- Diffusion-sdf: Text-to-shape via voxelized diffusion. In CVPR, 2023a.
- Focaldreamer: Text-driven 3D editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023b.
- Magic3D: High-resolution text-to-3D content creation. In CVPR, 2023.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- Latent-nerf for shape-guided generation of 3D shapes and textures. In CVPR, 2023.
- Midjourney. Midjourney. https://www.midjourney.com/.
- Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM TOG, 2019.
- NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
- Watch your steps: Local image and scene editing by text instructions. arXiv preprint arXiv:2308.08947, 2023.
- Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
- Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
- Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
- ED-NeRF: Efficient text-guided editing of 3D scene using latent space NeRF. arXiv preprint arXiv:2310.02712, 2023.
- Dreamfusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Dreambooth3D: Subject-driven text-to-3D generation. In ICCV, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Daniel Ritchie. Rudimentary framework for running two-alternative forced choice (2afc) perceptual studies on mechanical turk.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- MVDream: Multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. In ICLR, 2021a.
- Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
- EDICT: Exact diffusion inversion via coupled transformations. In CVPR, 2023.
- Score jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation. In CVPR, 2023a.
- Prolificdreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. In NeurIPS, 2023b.
- Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
- Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In NeurIPS, 2023.
- Matlaber: Material-aware text-to-3D via latent BRDF auto-encoder. arXiv preprint arXiv:2308.09278, 2023.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- HiFA: High-fidelity text-to-3D with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
- Dreameditor: Text-driven 3D scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.