CrochetBench: Procedural Evaluation for Crochet

Updated 14 November 2025

CrochetBench is an evaluation benchmark designed to assess procedural reasoning in MLLMs using structured crochet patterns and DSL integration.
It integrates the CrochetPARADE DSL to enable robust structural validation and executable synthesis of crochet instructions from visual and textual inputs.
The benchmark evaluates tasks ranging from stitch recognition to full DSL translation, revealing gaps in both surface-level perception and deep procedural performance.

CrochetBench is an evaluation benchmark designed to measure the procedural competence of multimodal LLMs (MLLMs) in the complex domain of crochet. Unlike traditional benchmarks, which primarily assess surface-level perception (such as vision-language alignment or captioning), CrochetBench specifically targets the ability to perform fine-grained, low-level procedural reasoning—from visual recognition of stitches to structured, executable crochet instruction synthesis. The benchmark is notable for its integration of a domain-specific language (DSL), CrochetPARADE, which enables robust structural validation and functional program execution, thus facilitating comprehensive assessment of both syntactic and semantic model capabilities (Li et al., 12 Nov 2025).

1. Motivation and Scope

Crochet tasks require intertwined reasoning over three modalities: symbolic representation (stitch abbreviations, counts), natural language (full-text instructions, materials, gauge), and visual inputs (finished product images). CrochetBench shifts the evaluation emphasis from a descriptive paradigm (describing what is seen) to a procedural one (doing what is required to physically realize the pattern). This paradigm is motivated by the intrinsic, program-like structure of crochet patterns, where local stitch choices recursively propagate to determine global mesh topology.

The benchmark comprises 6,085 real-world crochet patterns sourced from Yarnspirations, encompassing 55 project categories (e.g., blankets, garments, hats) and stratifying for four skill levels. Each item includes structured metadata (materials, gauge, abbreviation tables), full natural language (NL) instructions (ranging 20,000–30,000 characters), and paired high-resolution product images (coverage: 98.77%). The dataset is compatible with the CrochetPARADE DSL, which is leveraged for downstream parsing, compilation, rendering, and evaluation.

2. CrochetPARADE DSL: Structure and Semantics

CrochetPARADE is a domain-specific language created for the formalization and execution of crochet instructions. A CrochetPARADE program is composed of sequence-labeled lines, utilizing unique control constructs for repetition, anchoring, and stitch operations. The core syntax features:

Lines: Prefaced with a paragraph marker “¶”, each representing one procedural step.
Commands: Elementary operations include ch (chain), sc (single crochet), dc (double crochet), tr (treble), sc2inc (single crochet increase), ss (slip stitch), among others. These correspond directly to U.S. standard crochet abbreviations.
Repeat Groups: Expressed as “[ … ]n”, enforce exactly n repetitions of the enclosed sequence.
Anchoring and Label References: Addresses such as “@A” allow explicit reference to prior stitch locations, necessary for loops, joins, or working in the round.

A sample excerpt of the CrochetPARADE grammar in LaTeX:

$\begin{array}{rcl} \langle Program\rangle &\to& \langle Line\rangle\;\bigl|\;\langle Program\rangle\;\langle Line\rangle,\ \langle Line\rangle &\to& \verb|¶|\;\langle CmdSeq\rangle,\ \langle CmdSeq\rangle &\to& \langle Cmd\;[@\langle Label\rangle] \rangle\;(\verb|,|\;\langle Cmd\;[@\langle Label\rangle]\rangle)^*,\ \langle Cmd\rangle &\to& \textsf{ch}\;|\;\textsf{sc}\;|\;\textsf{dc}\;|\;\textsf{tr}\;|\;\textsf{sc2inc}\;|\;\dots\;|\;[\langle CmdSeq\rangle] \langle Number\rangle. \end{array}$

This structured representation permits compilation, parsing, and 2D/3D mesh rendering, as well as precise error detection regarding syntax, anchor resolution, and stitch consistency.

3. Task Suite and Evaluation Protocol

CrochetBench defines four primary evaluation tasks, incrementally increasing in procedural complexity and demanding both visual and symbolic synthesis:

Task	Input	Output	Test Size	Metric(s)
A: Stitch Classification	Single image	List of U.S. stitch abbreviations	6,009	F1 (per-example, averaged)
B: Instruction Grounding	Image + four NL instruction candidates	Single letter (A/B/C/D)	6,003	Accuracy
C: NL Instruction Generation	Image	Full NL crochet pattern	6,009	BLEU $_n$ , ROUGE-L, ChrF
D: NL $\to$ DSL Translation	(a) Contextual NL–DSL pairs + NL (step) <br> (b) Full NL pattern + image + reference DSL (proj)	One-line DSL code <br> Full DSL program	119 <br> 100	CSR $_\text{step}$ , CSR $_\text{proj}$ , PER (Partial Executable Rate)

Evaluation proceeds via compilation and validation in the CrochetPARADE environment. Structural validity is established when the DSL is syntactically correct and maintains state consistency (e.g., defined anchors, valid stitch references, correctly balanced repeats). Functional correctness is assessed via mesh rendering, compared (optionally) through vision-LLMs such as CLIP. Error taxonomies include bracket balance, undefined tokens, anchor misuse, and row operation errors.

4. Baseline Model Performance

CrochetBench reports substantial variance between surface-level and deep procedural metrics across both open-source and closed-source VLMs/MLLMs. Key results:

Task A (Stitch Recognition, F1): BLIP-2 Flan-T5 XL (0.2250), Qwen2-VL (0.5816), DeepSeek-VL (0.6060), Claude Sonnet 4 (0.6094).
Task B (Grounding, Accuracy): BLIP-2 Flan-T5 XL (0.2562), Qwen2-VL (0.4196), GPT-4o (0.5811), Claude Sonnet 4 (0.5739).
Task C (NL Generation): BLEU scores and related string-similarity metrics are low for all models; Gemini 2.5 (BLEU 0.0482).
Task D (NL→DSL, Step-level CSR): BLIP-2 Flan-T5 XL (4.2%), Qwen2-VL (35.3%), Claude Sonnet 4 (52.1%).
Task D (Project-level CSR): Qwen2-VL (21.0%), DeepSeek-VL (8.1%), GPT-4o (4.0%).

Performance declines sharply from recognition and selection (Tasks A, B) to executable synthesis (Task D). Open-source models are more susceptible to syntax errors (unbalanced brackets, undefined stitches); closed models more frequently produce programs that are structurally valid but lack semantic faithfulness. Notably, Qwen2-VL's 21% project-level CSR indicates possible symbolic generalization effects from architectural or pretraining design.

5. Structural and Functional Evaluation Insights

CrochetBench distinguishes two evaluation axes:

Structural Validity: Syntactic and referential correctness; all bracketed repeat groups, label references, and stitch counts must resolve during compilation.
Functional Correctness: The semantics of the generated program produce a mesh that matches the intended object geometry, often checked via automatic rendered-image–to–target-image similarity metrics using pretrained VLMs.

This dual regime exposes systematic model failure points. Syntax errors (bracket/parenthesis mismatches), undefined operations (unknown stitch abbreviations), label/reference misresolution (missing anchors), and looping/turning errors are common in open-source models. In contrast, semantically incoherent but structurally valid outputs are more characteristic of closed-source models.

6. Procedural Reasoning Challenges and Future Research

Current VLMs show measurable competence on isolated perceptual or selection tasks, but are limited by deficits in long-range, stateful procedural synthesis. The stateful nature of crochet—tracking stitch position, mesh topology, and recursive groupings—necessitates persistent symbolic memory, explicit tracking of global to local state, and enforcement of topological invariants (such as mesh closure or repeated motifs).

Future research directions identified include:

Hybrid Neuro-Symbolic Architectures: Integration of external memory or scratchpad modules to maintain stitch states, count tracking, and history for contextually aware procedural generation.
Procedural and Topological Pretraining: Enriching model pretraining data with structured procedural artifacts (instructional videos, mesh or CAD representations, annotated construction logs).
Renderer-in-the-Loop Training: Backpropagation or reinforcement via differentiable or semi-automated renderers (using CrochetPARADE’s mesh output) to provide geometric/physical feedback.
CAD/CAM Integration: Utilizing DSL intermediates as bridges between NL/design and automated manufacturing pipelines.

This suggests that fine-grained procedural synthesis in the tactile creative domains remains an unsolved challenge, with CrochetBench functioning as a rigorous diagnostic and research accelerator for models that must reason symbolically, geometrically, and sequentially in tandem (Li et al., 12 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CrochetBench.