IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Published 2 Jun 2025 in cs.CV | (2506.01949v1)

Abstract: Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at https://github.com/muzishen/IMAGHarmony.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces QL-Edit, a novel task for precise control of object counts and spatial arrangements in image editing.
It proposes the IMAGHarmony framework that leverages harmony-aware attention and preference-guided noise selection to enhance semantic and structural fidelity.
The study establishes HarmonyBench to benchmark multi-object editing, demonstrating superior alignment and generation stability over existing methods.

IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

The paper “IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout” addresses the challenges in the domain of image editing, particularly focusing on multi-object scenarios where precise control of object counts and spatial layouts is essential. Recent advancements in diffusion models have greatly enhanced image editing capabilities, yet these models often falter in maintaining structural consistency when dealing with multiple objects. This research introduces Quantity-and-Layout Consistent Image Editing (QL-Edit) as a new task, aiming for fine-grained control in complex scenes.

IMAGHarmony is proposed as a solution, incorporating a structure-aware framework that leverages harmony-aware attention (HA) and preference-guided noise selection (PNS) strategies. The HA module aims to integrate multimodal semantics effectively, modeling both object counts and layouts to improve editing accuracy. The PNS strategy is designed to address the susceptibility of diffusion models to initial noise, thus enhancing generation stability and layout consistency.

Key Contributions:

QL-Edit Task Introduction: By defining QL-Edit, the paper sets a new standard for image editing tasks that require stringent adherence to predefined object counts and spatial arrangements.
IMAGHarmony Framework: Employing HA and PNS, IMAGHarmony advances the state-of-the-art in maintaining semantics and structural fidelity in multi-object scenarios. The HA module explicitly encodes object quantity and implicitly captures spatial relations, while PNS optimizes initial noise based on vision-language matching.
HarmonyBench Benchmark: The research also introduces HarmonyBench—a comprehensive benchmark for evaluating quantity and layout control scenarios, facilitating systematic assessment of structural alignment and semantic accuracy.

Experimental Results:

IMAGHarmony demonstrates superior performance over existing methods in terms of structural alignment and semantic accuracy. Through extensive experiments, it reliably outperforms state-of-the-art methods, illustrating its effectiveness across diverse scenarios involving complex multi-object editing tasks.

Implications and Future Work:

Practically, this research could revolutionize fields requiring precise and consistent image modifications, including creative industries and personalized content creation. Theoretically, IMAGHarmony sets a foundation for further exploration into structured consistency in image generation models. Future improvements may look into enhancing the semantic understanding and adaptability of AI models in handling larger and more intricate multi-object scenes.

In summary, IMAGHarmony presents a significant advancement in image editing capabilities by addressing structural consistency challenges inherent in current diffusion models. It pushes the boundaries of controllability in image editing, ensuring that object count, category, and layout remain coherent and faithful to the user’s instructions.

Markdown Report Issue