DataEvolver: Let Your Data Build and Improve Itself via Goal-Driven Loop Agents

Published 3 May 2026 in cs.AI | (2605.01789v1)

Abstract: Constructing controllable visual data is a major bottleneck for image editing and multimodal understanding. Useful supervision is rarely produced by a single rendering pass; instead it emerges through iterative generation, inspection, correction, filtering, and export. We present DataEvolver, a closed-loop visual data engine that organizes this process around explicit goals, persistent artifacts, bounded corrective actions, and acceptance decisions. DataEvolver supports multiple artifact types, including RGB images, masks, depth maps, normal maps, meshes, poses, trajectories, and review traces. In the current release, the system operates through two coupled loops: generation-time self-correction within each sample and validation-time self-expansion across dataset rounds. We validate the framework on an image-level object-rotation setting. With a fixed Qwen-Edit LoRA probe, our final Ours+DualGate model outperforms both the unadapted base model and a public multi-angle LoRA on SpatialEdit and a held-out evaluation set. Ablations show a consistent improvement path from scene-aware generation to feedback-driven correction and dual-gated validation. Beyond the released rotation data, our main contribution is a reusable framework for building visual datasets through explicit goal tracking, review, correction, and acceptance loops.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents a formal, goal-driven loop-agent paradigm that iteratively constructs and corrects visual data via explicit artifact graphs.
It features a dual-loop mechanism where an inner loop handles sample-specific corrections and an outer loop leverages downstream feedback for quality assurance.
Evaluation on scene-aware tasks shows that the Ours +DualGate configuration outperforms baselines on metrics like PSNR, SSIM, and LPIPS.

DataEvolver: Formal Analysis of Goal-Driven Loop-Agent Visual Data Construction

Motivation and Engine Abstraction

The paper introduces a formal and extensible paradigm for controllable visual data construction that transcends conventional one-pass render pipelines. DataEvolver operationalizes this paradigm, shifting the locus of dataset creation from loose script chaining to explicit, goal-centric loop agents that iteratively inspect, correct, and accept visual artifacts. This approach imbues the construction process with persistent state and bounded controller actions, enabling traceability and explicit quality assurance across complex artifact bundles including RGB images, masks, depth maps, normals, meshes, poses, trajectories, transformation scripts, and review traces.

The necessity of closed-loop construction is crystallized by failure propagation inherent in multi-artifact export. Single-instance scene mistakes can corrupt multiple supervision channels, which underscores the impotence of traditional pipelines in diagnosing, correcting, and tracking failures across downstream model training. DataEvolver addresses this by encoding operational knowledge into artifact graphs and review channels rather than relegating validation and correction to non-reproducible post-hoc processes.

Figure 1: The goal-driven-loop-agent workflow turns a visual data request into an explicit artifact graph, reviewed and routed via VLM/CV signals, verdicts, and probes.

Dual-Loop Self-Evolution and Artifact Schema

DataEvolver's construction logic is implemented via dual interconnected loop agents—a generation-time inner loop and a validation-time outer loop. The inner loop executes sample-specific correction using review signals (lighting, grounding, pose, mask integrity, object identity) and bounded controller actions, iterating until acceptance, rejection, or plateau. The outer loop aggregates downstream validation feedback (e.g., weak angle or category coverage, action-label mismatch), informing prioritized dataset expansion or targeted resampling.

Figure 2: The inner loop corrects sample artifacts at generation time; the outer loop translates downstream evaluation into data-construction decision-making.

Every sample is formulated as an artifact graph, comprising scene context, asset identity, action program, rendered outputs, geometry artifacts, temporal trajectories, review channels (CV, VLM, programmatic checks), verdicts, and export records. This affords traceability—each artifact is tethered to explicit targets, review outcomes, and bounded corrective steps.

Engine Implementation and Pipeline Components

The instantiation of DataEvolver begins with asset preparation through concept expansion and foreground segmentation. 3D asset quality and geometry richness are maintained across a pipeline informed by advances in SAM 3 segmentation (Carion et al., 20 Nov 2025), Hunyuan3D (Hunyuan3D et al., 18 Jun 2025 Lai et al., 19 Jun 2025), and simulation-ready mesh generation. Scene setup ensures physical plausibility, stable camera semantics, and normalization, avoiding confounds in downstream transformation contracts.

Action programs encode explicit transformation primitives (rotation, translation, scaling, camera motion, composition), sampled as source-target pairs or dense trajectories depending on export mode. Rendering yields complete artifact bundles—not just RGB, but per-frame mask, depth, normals, pose, and transformation metadata. The export schema supports image pairs, multi-view sets, video sequences, geometry packages, trajectory datasets, preference records, and diagnostic logs.

Minimal Case Study: Scene-Aware Object Rotation

The study validates closed-loop data construction in a controlled image-level task: scene-aware object rotation with fixed scene and camera. This task exposes geometric, grounding, identity, and viewpoint consistency failures in a falsifiable framework, providing fair ground for four-stage ablation and external benchmark comparison.

The engine is evaluated using a downstream Qwen-Image-Edit-2511 LoRA probe (Wu et al., 4 Aug 2025), contrasting Ours +DualGate (final DataEvolver configuration) against the unadapted base model and a public multi-angle LoRA baseline on SpatialEdit-Bench (Xiao et al., 6 Apr 2026) and held-out Eval1 test sets.

Figure 3: Ours +DualGate outperforms baselines on SpatialEdit-Bench across PSNR, SSIM, LPIPS, CLIP-I, and DINO metrics.

Figure 4: Superior performance of Ours +DualGate on the Eval1 Test Set, with consistent improvement across all metrics.

Qualitative comparisons further corroborate robust viewpoint accuracy and object consistency, with ablation variants showing visible artifacts and degraded structure.

Figure 5: Ours +DualGate achieves stronger target azimuth tracking and preserves object appearance across in-domain rotation views.

Figure 6: Improved viewpoint fidelity and object structure on out-of-domain SpatialEdit-Bench rotate examples with Ours +DualGate.

Figure 7: Enhanced appearance preservation and target rotation match in a different object category, highlighting artifact reduction.

Ablation Chain Analysis and Quality Gates

A four-stage ablation demonstrates monotonic improvement: Ours-Base (scene-aware baseline), Ours +Feedback (outer feedback loop control), Ours +InnerGate (internal quality gating), and Ours +DualGate (integrated VLM post gating). Each stage adds explicit workflow constraints that propagate into metric and qualitative gains.

Traditional image metrics (PSNR, SSIM, LPIPS) and semantic/identity alignment scores (CLIP-I, DINO, VIE) show consistent improvement. The inner-loop self-correction and outer-loop feedback enhance editability and perceptual quality, ensuring the training set converges toward explicit transformation contracts.

Figure 8: Full closed-loop gating in Ours +DualGate yields more coherent side/back transitions and stable object structure.

Implications, Extension Roadmap, and Limitations

The abstraction is extensible: extension roadmap sequences (static geometry image export, rotation/translation video, compositional edits, multi-object relational data) are directly supported by artifact schema and loop logic without architectural overhaul. Closed-loop construction ensures quality gates operate within the engine, preventing unsafe or artifact-laden data from entering downstream training. Quality assurance is treated as a first-class acceptance criterion, not a diagnostic afterthought.

Practically, the Data-evolver paradigm aligns with emerging trends in geometry-rich supervision, structured editing logic, and self-correcting scene-aware construction. Theoretically, it formalizes persistent state, review signals, and explicit verdicts—enabling traceable, correctable, and generalizable visual supervision that can underpin scalable multi-modal foundation models.

Video-level validation, compositional and relational editing, and real-data integration remain unbenchmarked, marking limits to current scope. Temporal quality control, action-label consistency, and upstream asset quality are bottlenecks for future development.

Conclusion

DataEvolver establishes a reusable, inspectable data-construction engine in which persistent artifact graphs, bounded controller actions, explicit review channels, and acceptance verdicts drive closed-loop, goal-centric supervision. Even in minimal image-level object rotation, integrated feedback and gating mechanisms deliver consistent metric gains and stronger editability relative to conventional baselines. The abstraction generalizes seamlessly to images, videos, and geometry, paving the way for scalable, traceable, multi-artifact visual data construction under formal goal-driven loop agents (2605.01789).

Markdown Report Issue