PICA-100K: Physics-Aware Image Editing Dataset
- PICA-100K is a large-scale synthetic dataset that pairs source and edited images with accurate physical effects like shadows, deformations, and state transitions.
- It employs a fully automated pipeline integrating text-to-image and image-to-video models, with GPT-5 generating nuanced editing instructions and annotations.
- Models fine-tuned using PICA-100K show measurable gains in physical consistency and overall editing realism, setting a new standard in physics-aware visual editing.
PICA-100K is a large-scale synthetic dataset engineered to advance physically realistic image editing. Developed to address the persistent gap in modeling physical effects within edited imagery, PICA-100K supplies paired examples that reflect not only the content transformations prescribed by editing instructions but also the associated physical consequences, such as accurate shadows, deformations, and state transitions. The dataset’s organization, generation pipeline, and subsequent usage in model fine-tuning collectively position it as a central resource in the pursuit of physics-aware visual editing.
1. Dataset Composition and Organization
PICA-100K comprises 105,085 instruction-based editing samples. Each sample consists of a source image and an edited image that embodies a specific physical transformation in response to a textual instruction. Instructions are presented in three variants—superficial, intermediate, and explicit—generated and refined using LLMs (GPT-5). The dataset is systematically organized across eight physics categories aligned with three overarching dimensions: Optics (covering phenomena such as light propagation, reflection, refraction, and light source effects), Mechanics (including deformation and causality), and State Transition (encompassing global and local scene changes). Each category ensures that edits reflect adherence to physical laws, moving well beyond mere semantic instruction fulfillment.
| Dimension | Examples of Physical Phenomena | Types of Edit Instructions |
|---|---|---|
| Optics | Shadow, reflection, refraction, light-source | "Remove the lamp, including all shadows/reflections" |
| Mechanics | Deformation, causality | "Bend the pole; ensure metal deformation" |
| State Transition | Local/global changes | "Change from day to night, keeping correct color shifts" |
2. Data Generation Pipeline
PICA-100K is synthesized via a fully automated pipeline that integrates text-to-image and image-to-video generative models. Specifically, scene and subject creation is facilitated by systems such as FLUX.1-Krea-dev (text-to-image) and Wan2.2-14B (image-to-video). GPT-5 is employed in multiple capacities: to generate nuanced prompts for scene composition, to produce motion-oriented editing instructions, and to automatically annotate each source-edited image pair with the intended transformation. Each generated video simulates a plausible physical state change; the source and final frames are extracted to serve as paired examples. This methodology ensures high diversity and the physical plausibility of visual phenomena encoded in the dataset.
The core stages of the pipeline are visually summarized in Figure 1 of the referenced paper (Pu et al., 20 Oct 2025), disambiguating every transformation step from instructional text input to video synthesis and frame extraction.
3. Instructional Design and Physical Supervision
Every sample in PICA-100K is equipped with an instruction intentionally constructed to elicit not only the “edit” but also the physically consistent effects associated with that edit. Instruction variants range from superficial ("add a tree") through intermediate ("add a tree with realistic shadow"), to explicit ("add a tree ensuring shadow aligns with light direction and ground deformation underneath the roots"). GPT-5 automates both the linguistic generation and the annotation process, yielding high-quality, coherent supervision signals that instruct editing models to mimic real-world physics.
This design explicitly targets limitations in prior datasets such as Mira400K, which are sourced from natural video and may lack consistently enforced physical coherence. PICA-100K’s synthetic construction overcomes the annotation and supervision ambiguity common in natural data sources.
4. Applications in Model Training and Evaluation
PICA-100K is principally utilized for the fine-tuning of image editing models with the goal of enhancing physically consistent realism. For example, models such as FLUX.1-Kontext-dev are trained using LoRA, with supervised pairs drawn from PICA-100K. Experimental results show that such fine-tuning leads to measurable improvements: overall accuracy increases by +1.71% compared to baseline, and physical consistency in outputs rises from 24.57 dB to 25.23 dB.
The primary evaluation approach leverages PICABench and PICAEval (Pu et al., 20 Oct 2025), both designed to systematically assess physical realism in model outputs. The benchmarks employ VLM-as-a-judge scoring, coupled with region-level human annotation, targeting both completion of edit instructions and accompanying physical effects.
The formula capturing this quantitative gain is:
5. Addressed Challenges and Key Findings
PICA-100K confronts several central challenges in physics-aware image editing:
- Ensuring that edited outputs correctly adhere to real-world physical laws, such as shadow removal upon object deletion, gravitational support after content change, and realistic material deformation on attribute edits.
- Overcoming limitations in naturally acquired datasets that cannot guarantee precise, coherent supervision or capture subtle physical phenomena.
- Providing paired data that enables the internalization of physical principles by deep learning models, as evidenced by improved output realism and measurable gains in domain-specific metrics.
Key findings indicate that models fine-tuned on PICA-100K reliably reproduce not only the prescribed edits but also the indirect physical consequences, setting a new standard for supervised physics-centric image editing.
6. Context, Significance, and Future Directions
The introduction of PICA-100K marks a substantive shift toward physically consistent realism in image editing. Unlike prior efforts that focus narrowly on instruction completion, the dataset’s design and annotation pipeline systematically prioritize comprehensive physical effects. This approach addresses previously overlooked aspects of editing realism and opens avenues for future development in generative visual systems.
A plausible implication is that further scaling and diversification of synthetic, physics-centric training data will catalyze advances not only in image editing realism but also in related tasks, including simulation, video prediction, and virtual environment construction. The integration of multi-turn editing, reinforcement learning with physics-based reward signals, and joint video-image dataset construction represents potential future research directions.
PICA-100K provides a reproducible, extensible foundation for researchers seeking to close the realism gap in image editing, setting a reference standard for both dataset construction and evaluation protocol in the field.