The paper "Instruction-based Image Manipulation by Watching How Things Move" presents a novel methodology termed InstructMove, which addresses the challenge of instruction-based image editing using real video frames in tandem with multimodal LLMs (MLLMs) for construction of instruction datasets. This approach stands in contrast to previous methods which relied heavily on synthetically generated datasets, often resulting in challenges such as preserving content during complex, non-rigid transformations.
Key Contributions:
- Dataset Construction Pipeline:
- The authors propose a data construction pipeline that samples frame pairs from videos. This sampling captures realistic transformations that involve non-rigid subject motions and complex camera movements, difficult to model using only static images.
- MLLMs are used to generate high-quality editing instructions reflecting these transformations, creating source-target-instruction triplets. This enhances the model's ability to perform complex edits like pose adjustments, viewpoint changes, and subject rearrangements while maintaining content consistency.
- Spatial Conditioning Strategy:
- A novel spatial conditioning strategy is introduced, wherein the reference image is spatially concatenated with the noise map, rather than channel-wise, during model input. This enables the network to conduct flexible and structural changes without losing the contextual integrity of the source content.
- The spatial conditioning method enhances the model's ability for detailed control in image manipulation, allowing consistent cross-attention to the source image during the modification process.
- Integration with Additional Controls:
- The model adapts seamlessly with mask-based localization and auxiliary spatial control mechanisms like ControlNet, enabling specific regional edits or detailed control through sketches or pose adjustments.
- This integration enhances the model's precision and utility in real-world applications requiring detailed and specific adjustments.
Experiments and Evaluation:
- The model, fine-tuned using the new dataset and strategies, achieves state-of-the-art performance in several benchmark tasks related to image manipulation.
- Quantitative assessments using metrics such as CLIP feature distances illustrate superior alignment with textual instructions and content fidelity compared to existing models like InstructPix2Pix, MagicBrush, and UltraEdit.
- A user paper shows a significant preference for images generated using this approach, further validating its effectiveness in real-image transformation tasks.
Limitations and Future Work:
While the InstructMove method represents an advancement over existing techniques by leveraging real transformations found in videos, it faces challenges, such as occasional unintended edits due to limitations in instruction clarity by MLLMs. Additionally, the model primarily focuses on realistic transformations and may not be as effective in artistic modifications like style transfer. Future explorations could improve filtering processes and integrate this dataset with others to expand its applicability in diverse image-editing scenarios.
Overall, this work introduces significant methodological innovations in utilizing video data for scalable and realistic image manipulation training, paving the way for more robust and versatile models in the domain of image editing.