CoF-Evol-Instruct Dataset
- The paper introduces CoF-Evol-Instruct, a 64K-sample dataset that uses three-frame chains to model visual refinement from defective draft to high-fidelity output.
- It employs a fixed three-frame design with F1 as a semantically flawed draft, F2 for corrected semantics, and F3 as the polished final image.
- The dataset enables CoF-T2I to achieve significant gains on benchmarks by using independent frame encoding for clear, progressive visual reasoning.
CoF-Evol-Instruct is a curated 64K-sample dataset of three-frame Chain-of-Frame (CoF) trajectories for text-to-image generation, introduced as the supervision substrate for CoF-T2I. Each sample pairs a prompt with an ordered sequence that models progressive visual refinement from semantic correction to aesthetic enhancement: is a defective draft, is semantically grounded but visually unrefined, and is the high-fidelity final target (Tong et al., 15 Jan 2026). Unlike ordinary text-to-image corpora, CoF-Evol-Instruct is trajectory supervision rather than endpoint supervision; intermediate frames are intended to function as explicit visual reasoning steps.
1. Definition, scope, and naming
CoF-Evol-Instruct is defined within the CoF-T2I framework as a dataset for training a video generation model to act as a pure visual reasoner for text-to-image generation. Its central object is not a rewritten natural-language instruction in the standard instruction-tuning sense, but an ordered visual chain whose temporal axis is repurposed as a refinement process rather than physical motion (Tong et al., 15 Jan 2026).
The dataset name inherits the Evol-Instruct lineage, but its operational role differs materially from classical language-model instruction evolution. In the original WizardLM formulation, Evol-Instruct starts from an instruction-response dataset and iteratively produces evolved datasets through prompt-based rewriting, with in-depth and in-breadth operators designed to increase difficulty and diversity (Xu et al., 2023). CoF-Evol-Instruct instead uses a prompt-conditioned, model-curated pipeline to build visual trajectories whose states encode draft, semantic repair, and aesthetic polish (Tong et al., 15 Jan 2026).
A common source of confusion is that the acronym “CoF” has no standing across the earlier code-oriented Evol-Instruct literature. In particular, WizardCoder explicitly notes that “CoF” does not appear anywhere in the paper, even though it contains iterative instruction rewriting, added reasoning steps, debugging-based misdirection, and a benchmark-based outer feedback loop (Luo et al., 2023). Accordingly, CoF-Evol-Instruct should be treated as a term introduced in the CoF-T2I setting rather than as a standard synonym for code-domain Evol-Instruct.
2. Trajectory semantics and supervision format
The dataset uses a fixed three-frame design because the authors identify two essential refinement stages—semantic correction and aesthetic improvement—and choose the shortest sequence that preserves a consistent causal progression (Tong et al., 15 Jan 2026).
| Frame | Role | Interpretation |
|---|---|---|
| Defective draft | Semantically misaligned starting point | |
| Intermediate | Semantically correct but visually unrefined | |
| Final target | High-fidelity, aesthetically refined output |
This structure is explicitly described as a progression “from semantics to aesthetics.” In practical terms, may retain semantic violations such as incorrect object placement or attribute binding, 0 is expected to correct those semantic defects while remaining weak in texture, lighting, or realism, and 1 serves as the polished target image (Tong et al., 15 Jan 2026).
Formally, the model is trained to generate a latent trajectory
2
where 3 is the text prompt. The user-visible output is only the decoded terminal state,
4
This formulation makes the intermediate frames internal visual reasoning states rather than auxiliary outputs for separate supervision (Tong et al., 15 Jan 2026).
The dataset is therefore neither a conventional instruction-following corpus nor a generic image-editing set. It is defect-aware trajectory supervision. The instruction-like component enters during data construction through edit planning, but the learned supervision is ultimately over the ordered frame chain itself. This suggests that CoF-Evol-Instruct is best understood as a dataset for sequence-level latent supervision over refinement processes, not as a direct descendant of text-only instruction tuning.
3. Construction pipeline
CoF-Evol-Instruct is built by a quality-aware generation pipeline with six main stages: prompt collection and deduplication, anchor generation, stage classification, route-specific trajectory completion, filtering, and retention of final sequences (Tong et al., 15 Jan 2026).
The prompt pool comes from three sources: 24K prompts from Flow-GRPO, 37K prompts from Echo-4o, and 12K self-generated prompts adapted from GenEval. After prompt-level deduplication, the pipeline keeps 68K unique prompts (Tong et al., 15 Jan 2026).
For each prompt, one anchor image is produced by sampling from three text-to-image model tiers: Wan2.1 as the weak tier, Qwen-Image as the medium tier, and Nano-Banana as the strong tier, with sampling probabilities 5, 6, and 7 (Tong et al., 15 Jan 2026). These anchors are then classified into one of three quality stages: Semantically Misaligned (8), Visually Unrefined (9), or High Fidelity (0). The paper reports an inconsistency here: the main text says the quality assessor is Qwen3-VL-8B, whereas Appendix B states Qwen3-VL-7B (Tong et al., 15 Jan 2026).
Before editing, each prompt is assigned one of five semantic categories: Attribute Binding, Object Combination, Quantity Control, Spatial Arrangement, and Context Manipulation. This category is used to constrain edit intent (Tong et al., 15 Jan 2026).
The core construction mechanism is the Unified Editing Primitive (UEP), a closed-loop editing system with three agents: Planner: Qwen3-VL-32B, Editor: Qwen-Image-Edit-2509, and Verifier: Qwen3-VL-32B. Given the current frame, target stage, prompt, transition direction, category label, and optional previous frame, the planner emits a concise edit instruction; the editor applies it; and the verifier returns a binary success signal 1. When verification fails, the pipeline retries up to 2; if retries still fail, Appendix B states that the system falls back to direct regeneration with Qwen-Image (Tong et al., 15 Jan 2026).
Trajectory completion depends on the stage of the anchor:
- Forward Refinement: 3
- Bidirectional Completion: 4
- Backward Synthesis: 5
This routing allows the pipeline to exploit anchors drawn from different quality tiers while still reconstructing a coherent three-stage chain (Tong et al., 15 Jan 2026).
Appendix B adds a resolution policy aligned with the semantics-to-aesthetics decomposition: 6 is used for 7 semantic-stage transitions, while 8 is preserved for 9 aesthetic-stage transitions (Tong et al., 15 Jan 2026).
After filtering failed or incomplete chains, the final dataset contains 64K high-quality CoF sequences (Tong et al., 15 Jan 2026).
4. Role in CoF-T2I and empirical effects
CoF-Evol-Instruct is the training supervision that enables CoF-T2I to recast image generation as latent sequence generation over reasoning trajectories. The model is initialized from Wan2.1-T2V-14B and fine-tuned on the 64K dataset for 1,800 steps with batch size 64, learning rate 0, and weight decay 1, using square images at 2 (Tong et al., 15 Jan 2026).
The training objective remains the standard rectified-flow formulation inherited from the video backbone. With 3, 4, and interpolation
5
the model minimizes
6
The distinctive contribution of CoF-Evol-Instruct is therefore not a novel loss, but a dataset that turns the video model’s temporal dimension into explicit refinement supervision (Tong et al., 15 Jan 2026).
A key implementation choice is independent frame encoding. The native Wan2.1 VAE uses causal spatiotemporal compression, which is appropriate for motion but undesirable for logical refinement states. CoF-T2I therefore encodes each frame independently so that adjacent reasoning states do not become entangled through temporal compression. The ablation reported in Table 4 shows 0.83 on GenEval for “CoF w/o Independent VAE” versus 0.86 for CoF-T2I (Tong et al., 15 Jan 2026).
The reported empirical gains are substantial. On GenEval, the base Wan2.1-T2V-14B scores 0.55, while CoF-T2I reaches 0.86. On Imagine-Bench, the same comparison is 5.939 to 7.468 (Tong et al., 15 Jan 2026). The paper also isolates the effect of full trajectory supervision: Target-only SFT, which keeps only 7, reaches 0.81 on GenEval and 6.755 on Imagine-Bench, whereas full CoF trajectory training reaches 0.86 and 7.468 respectively (Tong et al., 15 Jan 2026).
Frame-wise evaluation further supports the intended semantics of the dataset. On GenEval, the generated chain improves monotonically from Frame 1: 0.56, to Frame 2: 0.79, to Frame 3: 0.86. Appendix results on Imagine-Bench show the same pattern: 6.015, 7.187, and 7.468 (Tong et al., 15 Jan 2026). This is direct evidence that the learned sequence behaves like a refinement trajectory rather than a collection of arbitrary variants.
5. Position within the broader Evol-Instruct literature
CoF-Evol-Instruct inherits the intuition that evolution can synthesize supervision unavailable from raw seed data alone, but it relocates that intuition from natural-language prompt rewriting to ordered visual-state construction. In that sense it sits downstream of a broader research program rather than duplicating any one predecessor.
WizardLM established the original Evol-Instruct pattern of iterative instruction rewriting, using in-depth and in-breadth operators to create a difficulty-balanced mixture of instructions from a seed corpus (Xu et al., 2023). WizardCoder adapted that scheme to code by narrowing the operator set, adding code debugging and code time-space complexity constraints, and selecting the best evolution depth through an outer HumanEval pass@1 loop (Luo et al., 2023). WizardMath then combined math-specific upward and downward evolution with an Instruction Reward Model, a Process-supervised Reward Model, and PPO, under the name RLEIF (Luo et al., 2023).
Subsequent work diversified what “evolution” could mean. Magicoder’s OSS-Instruct changed the source of variation by grounding synthesis in open-source code snippets rather than only evolving prior instructions (Wei et al., 2023). Instruction Fusion replaced single-parent mutation with GPT-4-Turbo-based recombination of two seed prompts, explicitly targeting smoother difficulty growth and improved diversity (Guo et al., 2023). Tag-Evol moved from fixed heuristic strategies to knowledge tag injection with an explicit difficulty budget 8, formalized as
9
thereby making evolution combinatorial and one-shot rather than strictly iterative (Wang et al., 30 May 2025). Loong and VeriEvol emphasized executable grounding and verification, the former through question-answer-code triples and a dual-solution consistency filter, the latter through type-aware evolution and the falsification-oriented HTV-Agent verifier (Huang et al., 3 Sep 2025, Li et al., 22 Jun 2026).
Against this background, CoF-Evol-Instruct occupies a distinct point in the design space. It does not evolve text instructions into harder text instructions; it constructs visual reasoning trajectories from prompts and anchor images. A plausible implication is that CoF-Evol-Instruct extends the Evol-Instruct idea from “rewrite the task” to “materialize the latent refinement path by which a model should solve the task.” That interpretation is not stated in those terms by the paper, but it is strongly suggested by the move from endpoint supervision to structured, ordered chains (Tong et al., 15 Jan 2026).
6. Limitations, ambiguities, and open directions
The CoF-T2I paper explicitly states that it has not systematically explored extension to text-to-video, text-to-3D, longer temporal reasoning chains, or reinforcement-learning-based refinement (Tong et al., 15 Jan 2026). This means that CoF-Evol-Instruct, as introduced, is tied to a fixed three-frame design and to the text-to-image setting.
Several dataset-level details remain underreported. The paper does not provide per-route statistics for Forward Refinement, Bidirectional Completion, and Backward Synthesis; it does not present a human evaluation of the dataset itself; and it does not compare alternative chain lengths beyond the conceptual argument for choosing three frames (Tong et al., 15 Jan 2026). These omissions matter because they limit direct analysis of route imbalance, annotation reliability, and sensitivity to temporal granularity.
There is also a documented inconsistency in the reported size of the quality assessor, with the main text naming Qwen3-VL-8B and Appendix B naming Qwen3-VL-7B (Tong et al., 15 Jan 2026). For strict reproduction, this ambiguity is material.
More broadly, CoF-Evol-Instruct should not be generalized beyond what the paper demonstrates. It is strong evidence that trajectory supervision can improve text-to-image generation when a video model is trained to interpret frames as reasoning states. It is not direct evidence that longer chains, alternative stage taxonomies, or non-visual Evol-Instruct variants will inherit the same gains. This suggests that the principal encyclopedic significance of CoF-Evol-Instruct is methodological: it provides a concrete dataset design showing how visual refinement trajectories can serve as the analogue of evolved supervision in a model class whose native inductive bias is temporal sequence modeling (Tong et al., 15 Jan 2026).