Step1X-Edit: A Practical Framework for General Image Editing (2504.17761v4)

Published 24 Apr 2025 in cs.CV

Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

PDF Abstract

This paper, "Step1X-Edit: A Practical Framework for General Image Editing" (Liu et al., 24 Apr 2025 ), addresses the notable performance gap between state-of-the-art proprietary image editing models, such as GPT-4o and Gemini2 Flash, and existing open-source solutions. The authors introduce Step1X-Edit, an open-source framework designed for general image editing based on natural language instructions, aiming to achieve performance comparable to closed-source systems.

The core contributions of the work are threefold:

The release of the Step1X-Edit model itself, promoting further research and development in the open-source community.
A scalable and flexible data generation pipeline capable of producing large-scale, high-quality instruction-image triplets for training image editing models.
The development of GEdit-Bench, a novel benchmark based on real-world user instructions, for more authentic and comprehensive evaluation of image editing models.

Data Creation and Pipeline: Recognizing the limitations of existing datasets in terms of scale and quality, the authors designed a comprehensive data pipeline. This pipeline begins with web crawling for diverse image editing examples, categorizing them into 11 major tasks (e.g., object addition/removal, replacement, color/material modification, text editing, motion change, portrait editing, style transfer, tone transformation). Generating training triplets (source image, instruction, target image) involves multiple steps utilizing various state-of-the-art vision and LLMs and tools. For instance, tasks like subject addition/removal leverage Florence-2 (Ballesteros et al., 10 Apr 2024 ) for annotation, SAM-2 (Ravi et al., 1 Aug 2024 ) for segmentation, and ObjectRemovalAlpha (Kalogeras et al., 28 Apr 2025 ) for inpainting. Other tasks employ models like Qwen2.5-VL (Bai et al., 19 Feb 2025 ), Recognize-Anything Model (Zhang et al., 2023 ), Flux-Fill (Li et al., 25 Apr 2024 ), ControlNet (Mou et al., 2023 ) with Stable Diffusion 3.5 (Lima et al., 7 Mar 2024 ), Zeodepth (Bhat et al., 2023 ), PPOCR (Du et al., 2020 ), BiRefNet (Nguyen, 8 Mar 2024 ), and RAFT (Kimura, 2020 ).

The data generation process yields over 20 million triplets, which are then filtered down to over 1 million high-quality examples using a combination of Multimodal LLMs (like Step-1o (Wang et al., 10 Feb 2025 ) and GPT-4o (Chan et al., 26 Mar 2025 )) and human annotators. A redundancy-enhanced, multi-round annotation strategy with MLLMs helps refine instructions and mitigate hallucinations. Stylized annotation via contextual examples ensures consistency, while a cost-efficient pipeline uses GPT-4o for initial annotation and Step-1o for scaling. The dataset supports bilingual annotation (Chinese-English), crucial for broader applicability.

Model Architecture: Step1X-Edit adopts a unified architecture combining a Multimedia LLM (MLLM) with a Diffusion in Transformer (DiT). The proposed framework consists of three key components:

MLLM: Processes the reference image and editing instruction (e.g., using Qwen-VL (Bai et al., 19 Feb 2025 )). It extracts semantic information.
Connector Module: A lightweight module (e.g., a token refiner (Ma et al., 17 Jun 2024 , Kong et al., 3 Dec 2024 )) that receives the MLLM's output embeddings (after filtering out prefix tokens) and restructures them into a compact textual feature representation.
Diffusion in Transformer (DiT): A diffusion model operating in the latent space (e.g., based on FLUX (Li et al., 25 Apr 2024 )), responsible for generating the final image.

The workflow is as follows: The reference image and instruction are fed into the MLLM. The resulting MLLM embeddings related to the instruction are processed by the connector module. This refined feature vector replaces the standard text embedding typically generated by a T5 (Sun et al., 2020 ) encoder in the DiT. Additionally, a global visual guidance vector is derived from the mean of the MLLM's output embeddings via a linear projection, further conditioning the DiT.

For training, the connector and DiT are optimized jointly. The process is initialized with pretrained weights from in-house MLLM and text-to-image DiT models. A token concatenation mechanism, inspired by FLUX-Fill (Li et al., 25 Apr 2024 ), is used during training: the VAE-encoded latent of the target image (with noise) is concatenated with the VAE-encoded latent of the reference image (without noise). This fused visual input helps the model reason over contrastive contexts. A learning rate of $1e^{-5}$ is used. A practical advantage of this architecture is that it does not require explicit masks for editing, relying solely on the instruction and reference image.

Benchmark and Evaluation (GEdit-Bench): To provide a more realistic evaluation, the authors collected GEdit-Bench. It comprises 606 testing examples with real-world reference images and editing instructions sourced from the internet (e.g., Reddit). Instructions are manually classified into the same 11 categories used for data creation. To protect user privacy, a de-identification protocol is implemented, involving reverse image searches to find public alternatives or modifying instructions to fit public domain images.

The benchmark is used to evaluate Step1X-Edit against various open-source (Instruct-Pix2Pix (Brooks et al., 2022 ), MagicBrush (Duenas-Vidal et al., 2023 ), AnyEdit (Yu et al., 24 Nov 2024 ), OmniGen (Xiao et al., 17 Sep 2024 )) and closed-source (GPT-4o (Chan et al., 26 Mar 2025 ), Gemini2 Flash (Jung et al., 25 Mar 2025 ), Doubao (Shi et al., 11 Nov 2024 )) models. Evaluation utilizes VIEScore (Ku et al., 2023 ) metrics: Semantic Consistency (SC), Perceptual Quality (PQ), and Overall (O), evaluated by GPT-4.1 and Qwen2.5-VL (Bai et al., 19 Feb 2025 ) for reproducibility. Results are reported for both English and Chinese instructions and on two sets: an "Intersection subset" (where all models returned valid responses) and the "Full set". Step1X-Edit significantly outperforms existing open-source models and shows comparable or superior performance to some leading closed-source models, particularly excelling in Chinese editing instructions and certain tasks like style and color changes.

A user paper with 55 participants also indicates that Step1X-Edit produces results with subjective quality comparable to proprietary models. While Gemini2 Flash received high scores partly due to its strong identity preservation, Step1X-Edit demonstrates competitive overall user preference.

In conclusion, the Step1X-Edit framework, coupled with its large-scale data pipeline and real-world benchmark, represents a significant step towards closing the gap between open-source and proprietary general image editing systems. Its practical design, leveraging the strengths of MLLMs and DiTs without requiring masks, makes it a valuable contribution for developers and researchers in the field. The planned public release of the model and benchmark data is expected to accelerate further advancements.

PDF Markdown Bookmark Chat (Pro)

Authors (24)

Shiyu Liu (32 papers)
Yucheng Han (9 papers)
Peng Xing (17 papers)
Fukun Yin (11 papers)
Rui Wang (996 papers)
Wei Cheng (175 papers)
Jiaqi Liao (15 papers)
Yingming Wang (5 papers)
Honghao Fu (18 papers)
Chunrui Han (21 papers)
Guopeng Li (28 papers)
Yuang Peng (10 papers)
Quan Sun (31 papers)
Jingwei Wu (5 papers)
Yan Cai (10 papers)
Zheng Ge (60 papers)
Ranchen Ming (7 papers)
Lei Xia (27 papers)
Xianfang Zeng (24 papers)
Yibo Zhu (31 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1915615495404847368

https://twitter.com/HuggingPapers/status/1915618572153344083

https://twitter.com/Ameerazam18/status/1915997870844686392

https://twitter.com/goyal__pramod/status/1916454554348212266

https://twitter.com/javaeeeee1/status/1916227514420093282

https://twitter.com/arxivsanitybot/status/1916686386566140038

YouTube

Show All Videos