CraftBench: Dual-Domain Benchmark

Updated 4 July 2026

CraftBench is an ambiguous benchmark label in machine learning used in two distinct domains: scientific figure generation and procedural video understanding.
The scientific figure benchmark assesses generation quality across three figure styles and four input tasks, evaluated with multimodal judgments by multiple annotators.
The procedural video benchmark and its CraftBench-Fact extension verify factual accuracy with action-aligned captions and structured fact annotation protocols.

CraftBench is an ambiguous benchmark name in recent machine-learning literature. In the provided arXiv sources, it denotes at least two distinct evaluation resources: a benchmark for scientific figure generation across multiple figure types and input conditions, and a procedural-video benchmark for furniture crafting and assembly that was later extended into a fact-verification resource called CraftBench-Fact (Zhao et al., 28 May 2026, Oguz et al., 28 Apr 2026). This multiplicity matters because nearby work in crafting, construction, creativity, and Minecraft-inspired evaluation is often lexically similar but not equivalent, and the term therefore has no single canonical referent across all current arXiv usage.

1. Nomenclature and referential scope

The clearest documented uses of the name fall into two separate domains. One use appears in scientific-figure generation, where CraftBench is introduced as a benchmark spanning three figure types and four input conditions. The other appears in procedural video understanding, where CraftBench refers to crafting videos with temporally aligned action-level captions and later to a structured factuality extension, CraftBench-Fact (Zhao et al., 28 May 2026, Oguz et al., 28 Apr 2026).

This ambiguity is reinforced by several adjacent names that are explicitly not aliases. The paper introducing CreativeBench states that it is about CreativeBench, not “CraftBench,” and provides no evidence that “CraftBench” is an alternate name, abbreviation, prior name, or subset of the same project (Wang et al., 12 Mar 2026). CrafText is strongly related at the level of open-ended crafting environments and instruction following, but it is presented as CrafText, not CraftBench (Volovikova et al., 17 May 2025). Likewise, BuilderBench, MinePlanner, and the Minecraft builder-dialog benchmark are construction- or planning-oriented resources relevant by theme, not by identity (Ghugare et al., 7 Oct 2025, Hill et al., 2023, Madge et al., 2024).

Name in source	Domain	Relation to “CraftBench”
CraftBench	Scientific figure generation	Direct use of the name
CraftBench / CraftBench-Fact	Procedural crafting videos	Direct use of the name
CreativeBench	Code-generation creativity	Explicitly not shown to be CraftBench
CrafText	Craftax instruction following	Related but differently named
BuilderBench	Embodied block construction	Adjacent, not equivalent

A plausible implication is that “CraftBench” should be treated as a homonymous benchmark label whose meaning depends on the paper and domain rather than as a single benchmark family.

2. CraftBench in scientific figure generation

In the scientific-figure literature, CraftBench is introduced to evaluate figure generation beyond narrow text-to-image academic-diagram settings. Its defining scope is 3 figure types/styles—Academic figures, Posters, and Infographics—together with 4 input conditions/tasks: Text-to-image, Mask-completion, Sketch-conditioned generation, and Key-element composition. The benchmark contains 279 samples in total, with 179 text-to-image samples, 30 mask-completion samples, 40 sketch-conditioned samples, and 30 key-element-composition samples; by figure family, it contains 140 Academic figures, 109 Posters, and 30 Infographics (Zhao et al., 28 May 2026).

Component	Count
Total samples	279
Academic figures	140
Posters	109
Infographics	30
Text-to-image	179
Mask-completion	30
Sketch-conditioned generation	40
Key-element composition	30

The benchmark is curated from published papers across 18 research areas, award-tier conference posters, and research blogs. The appendix describes a seven-stage quality pipeline: caption keyword filtering, strict content classification, complexity rescoring, alignment verification, first-pass QA, evidence-required QA, and manual review. For reference-conditioned inputs, every sample is reviewed by three graduate-level annotators and accepted only on unanimous agreement. The paper does not report train/validation/test splits for CraftBench; it is presented as an evaluation benchmark of curated samples rather than a supervised training corpus (Zhao et al., 28 May 2026).

Evaluation uses a referenced VLM-as-judge protocol with Gemini 3.5 Flash. Candidate outputs and human targets are scored independently rather than in side-by-side comparison. For text-to-image, the scored aspects are content faithfulness, readability, and one style-specific format aspect; for reference-conditioned tasks, the aspects are content faithfulness, readability, and input fidelity. The benchmark-level score is a lenient win-rate obtained by mapping verdicts to Model → 100, Tie → 50, and Human → 0, then averaging. In the main reported results, Crafter (w/ Nano Banana Pro) achieves 52.30 overall, compared with 29.00 for PaperBanana (w/ Nano Banana Pro) and 22.40 for Nano Banana Pro alone; under a controlled same-backbone comparison with Nano Banana 2, Crafter scores 50.20 versus 28.00 for PaperBanana (Zhao et al., 28 May 2026).

Within this literature, CraftBench functions as evidence for cross-type and cross-condition generalization. It is explicitly designed to expose failures of pipelines that perform adequately on narrow academic-diagram text-to-image settings but degrade on sketches, masks, or key-element preservation.

3. CraftBench in procedural video understanding and CraftBench-Fact

A distinct use of the name appears in procedural video understanding. Here, CraftBench is a benchmark built around furniture and utility crafting videos, with materials such as wood and metal and tools such as saws, drills, clamps, and fasteners. Videos are selected so that the procedure is visually demonstrated by a narrator and the narration provides step-by-step instructions aligned with performed actions. The original CraftBench setup is described as containing action-level, temporally aligned captions for crafting videos (Oguz et al., 28 Apr 2026).

The later extension, CraftBench-Fact, adds a structured factual layer through three steps: Clause decomposition, Implicit argument augmentation (VIA), and Structured fact annotation. VIA expands omitted but visually grounded patient/object, tool, and location arguments. From those augmented captions, annotators derive contextual facts—grounded predicate-argument relations—and conceptual facts—abstract semantic role assignments. The paper gives the example

$\mathcal{F}_g^{ctx} = \{\text{stir(soup)}, \text{stir(with spoon)}, \text{stir(in pot)}\}$

and

$\mathcal{F}_g^{con} = \{\text{Action = stirring}, \text{Ingredient = soup}, \text{Tool = spoon}, \text{Location = pot}\}.$

It then uses these structures for dual-layer factuality verification (Oguz et al., 28 Apr 2026).

The published dataset statistics for CraftBench-Fact are explicit. The Train split contains 120 videos, 1,735 clips, 2,132 VIA annotations, 4,468 conceptual facts, and 3,429 contextual facts. The Test split contains 100 videos, 1,468 clips, 1,888 VIA annotations, 4,197 conceptual facts, and 3,108 contextual facts. Overall, the benchmark comprises 220 videos and 3,203 clips (Oguz et al., 28 Apr 2026).

The evaluation framework distinguishes DualFact-T, which verifies facts against textual evidence, from DualFact-V, which verifies facts against video-grounded evidence. Fact-level support is aggregated into MultiFactScore:

$\text{MultiFactScore} = \frac{ |\{ f_i \in F : \hat{y}_i = SUPPORTED \}| }{ |F| }.$

A central finding is that standard caption metrics obscure omissions and role-level inconsistencies, while video-grounded verification shows that caption-only evaluation can overestimate hallucination by treating visually grounded but task-irrelevant mentions as unsupported. On CraftBench, VIA improves lexical metrics such as BLEU: 1.17 → 1.66, ROUGE: 18.00 → 21.74, and SPICE: 7.47 → 11.39, while fact extraction itself is near-ceiling, indicating that downstream factuality errors are not mainly due to extraction noise (Oguz et al., 28 Apr 2026).

4. Annotation and evaluation paradigms

The two CraftBench usages differ sharply in annotation object, supervision protocol, and evaluation target. The scientific-figure benchmark centers on multimodal generation under conditional inputs; the procedural-video benchmark centers on action-aligned captioning and role-aware factual verification. Their annotation pipelines therefore operationalize different notions of correctness (Zhao et al., 28 May 2026, Oguz et al., 28 Apr 2026).

Aspect	Scientific-figure CraftBench	Procedural CraftBench-Fact
Primary object	Generated figure image	Procedural caption facts
Inputs	Caption, paper context, and optionally mask/sketch/key elements	Video clip and aligned caption
Human review	Three graduate-level annotators; unanimous agreement for reference-conditioned samples	Four trained annotators with calibration and adjudication
Automatic evaluation	Referenced VLM-as-judge with lenient win-rate	Textual and video-grounded fact verification
Output structure	Image-level scores	Conceptual and contextual facts

The scientific-figure benchmark validates benchmark construction through manual review and validates automatic evaluation through a blind human study. The reported agreement between the automatic judge and majority human verdict is 72%, with Cohen’s $\kappa = 0.58$ (Zhao et al., 28 May 2026). The procedural-video extension reports annotation reliability directly at the structural-label level: Cohen’s $\kappa = 0.93$ for clause segmentation, 0.87 for VIA, 0.92 for conceptual facts, and 0.97 for contextual facts (Oguz et al., 28 Apr 2026).

This contrast is methodologically important. One CraftBench treats evaluation as referenced multimodal quality assessment; the other treats evaluation as fact support under textual and video evidence. The name is shared, but the operational semantics of “benchmark performance” are substantially different.

Several benchmark families surround the CraftBench name but should be distinguished from it. CreativeBench is a benchmark for machine creativity in code generation with subsets CreativeBench-Combo and CreativeBench-Explore; the paper explicitly states that there is no textual evidence that CreativeBench is also called CraftBench (Wang et al., 12 Mar 2026). CrafText is a Craftax-based multimodal benchmark for instruction following in a dynamic open-ended world, with 3,924 instructions, 12 scenarios, and 496 goals, but it is presented as a related benchmark rather than a CraftBench alias (Volovikova et al., 17 May 2025).

In embodied construction, BuilderBench targets open-ended block building with a hardware-accelerated simulator, a task suite with over 42 diverse target structures, and an emphasis on self-supervised exploration. It is described as strongly relevant to the construction-oriented subset of a hypothetical CraftBench, but not as CraftBench itself (Ghugare et al., 7 Oct 2025). MinePlanner is a planning-oriented benchmark for long-horizon Minecraft tasks compiled into PDDL, with 45 tasks overall and support for automatically creating propositional and numeric instances; it is relevant as a symbolic planning reference, not as a direct CraftBench definition (Hill et al., 2023). The paper “A LLM Benchmark based on the Minecraft Builder Dialog Agent Task” proposes a synthetic Minecraft-inspired spatial benchmark organized around Absolute Addressing, Relative Addressing, and Primitive Shapes; it is explicitly characterized as related background rather than CraftBench (Madge et al., 2024).

A separate line of work around the Craft Assembly Task is also adjacent rather than identical. “Component Selection for Craft Assembly Tasks” formalizes assembly from imperfect components given a single RGB image and available primitive scene objects, using segmentation, template retrieval, pose optimization, primitive simplification, and correspondence search (Isume et al., 2024). “Prompt2Craft” later formulates an LLM-centered structured assembly planner for functional craft assemblies, but again does not define CraftBench itself (Isume et al., 4 Dec 2025). These papers are important because they show that “craft” in current literature spans multiple problem classes—scientific illustration, procedural video understanding, assembly from imperfect parts, and open-world instruction following—without a unified benchmark namespace.

6. Significance, limitations, and current status

The main significance of CraftBench lies in what the two benchmark lines reveal about benchmarking practice. In scientific figure generation, CraftBench broadens evaluation beyond text-only synthesis and introduces cross-condition testing with masks, sketches, and key-element composition (Zhao et al., 28 May 2026). In procedural video understanding, CraftBench-Fact shows that fluent captions can remain factually incomplete and that role-sensitive verification can expose omissions, hallucinations, salience errors, and role inconsistencies that lexical overlap metrics do not capture (Oguz et al., 28 Apr 2026).

Both lines also expose important limitations. The scientific-figure benchmark is relatively small at 279 samples, is imbalanced toward academic and poster figures, and relies on a closed-source judge, Gemini 3.5 Flash (Zhao et al., 28 May 2026). The procedural benchmark is currently narrow in domain—furniture crafting and assembly—and its factual extension explicitly notes that it does not model attribute-oriented facts such as size or spatial properties (Oguz et al., 28 Apr 2026). More fundamentally, the shared name obscures substantial differences in task definition, annotation schema, metric design, and intended use.

This suggests that “CraftBench” should always be disambiguated by citation and domain. In current arXiv usage, it is not a single benchmark with stable semantics, but a label attached to at least two distinct research artifacts: one concerned with multimodal scientific figure generation, and one concerned with procedural crafting video understanding and factual verification.