STARFlow: Structured Workflow Extraction
- STARFlow is a generative framework that automatically converts hand-drawn or digital workflow diagrams into structured, executable JSON.
- Its hybrid encoder–decoder pipeline uses a frozen vision encoder and a transformer decoder to capture intricate workflow logic from diverse input modalities.
- Empirical evaluations show finetuning boosts performance significantly, with metrics like FlowSim rising by 30–40 points over unadapted models.
StarFlow refers to a family of distinct methodologies and frameworks across diverse domains, unified by their emphasis on complex structured modeling—ranging from automated workflow extraction via vision-LLMs to advanced generative modeling in astrophysics and computer vision. In this article, emphasis is placed on the system "StarFlow: Generating Structured Workflow Outputs From Sketch Images" (Bechard et al., 27 Mar 2025), with comparative reference to adjacent research threads in the "STARFlow" and "STARFlow-V" literature.
1. Definition and Scope
StarFlow (as introduced in (Bechard et al., 27 Mar 2025)) is an end-to-end generative framework designed to automatically convert arbitrary workflow diagrams—sketched by hand or rendered by computer—directly into structured, executable JSON representations. The workflow's input can originate from diverse input modalities (sketches, digital illustrations, screenshots), addressing the need for robust orchestration and automation in enterprise platforms while mitigating the complexity typically associated with manual configuration in low-code or visual programming environments. The core novelty is the application of finetuned vision-LLMs (VLMs) to interpret ambiguous, style-varied diagrammatic inputs and infer executable workflow logic in a completely automated, modal-flexible manner.
2. Input-to-Output Pipeline
The StarFlow pipeline is a hybrid encode–decode architecture, tailored for visual-to-structured conversion. The main stages are as follows:
- Preprocessing: Input diagrams are cropped to the flow’s bounding box, resized (max height/width=1024 px), and normalized for the chosen vision encoder. No explicit OCR is performed; text is implicitly embedded via the vision encoder.
- Vision Encoding: A frozen image encoder (e.g., LLaMA-3 Vision, Qwen-2.5-VL) produces a patch-based embedding grid, . A small, trainable connector transforms these embeddings into the initial key-value pairs for the transformer decoder.
- Structured Decoding: The decoder comprises a pre-trained LLM (e.g., LLaMA 3.2, Phi-4), extended to cross-attend on the image-derived prefix. It autoregressively emits tokens specifying workflow elements: the trigger (type, annotation, inputs), and a fully ordered list of components (actions, flowlogic blocks) each with explanatory fields (definition, scope, ordering, block/nesting index, detailed inputs). Inference requires only a single forward pass to produce the complete workflow JSON.
This modality-agnostic design enables robust mapping from free-form diagrammatic input to executable structure, handling significant variance in diagram style, annotation, and logical complexity.
3. Formal Representation of Workflows
StarFlow models workflows as directed graphs , where:
- encodes nodes (trigger or individual components).
- specifies directed edges, capturing both execution sequencing and nesting logic (e.g., FOREACH IF ELSE action).
Each node is associated with a -dimensional feature vector:
All node features are collected in a matrix , and the adjacency matrix encodes execution dependencies. For practical generation and evaluation, is linearized into a rooted, ordered tree , whose preorder traversal defines the corresponding JSON token emission sequence.
4. Model Architecture and Training Regimen
- Architecture: The vision-language pipeline couples a frozen vision encoder to a small connector and a trainable transformer LLM decoder, with deep cross-attention integration at each LM layer.
- Objective Function: Structured workflow generation is formulated as an unconditional token-level language modeling task:
No explicit structural or graph-matching loss is needed; the model learns implicit structure from serialized JSON supervision.
- Training Details:
- Optimizer: AdamW (, , weight decay=, )
- LR schedule: cosine anneal with 30-step warmup, peak lr=
- 16H100 80GB GPUs, mixed precision (bf16), full-sharded data parallelism
- Frozen vision encoder; only the transformer decoder and connector are updated. Early stopping based on validation loss.
This regimen supports high-capacity adaptation to a diverse range of visual workflow representations while strictly leveraging off-the-shelf vision and language modules.
5. Dataset Curation and Annotation Schema
StarFlow’s training set (23,310 samples) is partitioned by source modality and rigorously annotated:
| Source | Train | Valid | Test |
|---|---|---|---|
| Synthetic | 12,376 | 1,000 | 1,000 |
| Manual | 3,035 | 333 | 865 |
| Digital | 2,613 | 241 | 701 |
| Whiteboard | 484 | 40 | 46 |
| UI | 373 | 116 | 87 |
| Total | 18,881 | 1,730 | 2,699 |
- Annotation: Each input is paired with JSON scaffolding: trigger metadata (type, annotation, inputs), and an ordered component list with detailed structure (category, block/indexing, definition, scope, nesting, and all inputs). This curation supports granular structural supervision across both real and synthetic diagrammatic variance.
6. Evaluation Metrics and Empirical Results
Four principal metrics quantify workflow generation fidelity:
- Flow Similarity (FlowSim): Measures tree-edit distance similarity of generated vs. reference workflows ().
- TreeBLEU: Node- and depth-aware overlap on depth-1 subtrees.
- Trigger Match (TM): Exact trigger type agreement.
- Component Match (CM): Jaccard similarity over component sets.
Empirical results show:
| Model | FlowSim (test) |
|---|---|
| Pixtral-12B (unfinetuned) | 0.632 |
| LLaMA-3.2-11B (unft.) | 0.466 |
| Qwen-2.5-VL-7B (unft.) | 0.614 |
| GPT-4o | 0.786 |
| Gemini Flash 2.0 | 0.780 |
| Qwen-2.5-VL-7B (finetuned) | 0.957 |
| LLaMA-3.2-11B (finetuned) | 0.955 |
| Pixtral-12B (finetuned) | 0.952 |
Finetuning on the StarFlow dataset yields a significant (+30–40 point) improvement over unadapted VLMs, surpassing proprietary models on held-out samples.
7. Ablations, Limitations, and Prospects
- Ablation Studies:
- Manual and whiteboard sketches are most challenging (0.50–0.60 FlowSim); synthetic/UI input is easier (0.90+ post-finetune).
- Landscape layouts reduce performance by 2–3 points compared to portrait.
- Excessive image resolution slightly degrades performance, especially for unfinetuned models.
- End-to-end sketchJSON modeling outperforms decomposed, staged approaches.
- Limitations:
- Weak generalization to unseen UIs or workflow logic patterns (FlowSim drops to 0.20–0.30).
- No explicit OCR or retrieval-augmented tool call integration; component names and table scopes may be hallucinated.
- Evaluation quantifies only structural similarity—not semantic workflow executability.
- Future Extensions:
- Incorporation of retrieval-augmented tool calls for component grounding.
- Hybrid OCR+vision integration to enhance text extraction from complex diagrams.
- Executable-aware evaluation metrics (e.g., HumanEval-style) to assess functional correctness.
- Dataset expansion to include more diverse logic and legacy user interfaces.
A plausible implication is that while StarFlow demonstrates robust structure extraction, actual end-to-end workflow automation will also require semantic grounding and functionally-executable output verification.
References:
- StarFlow: Generating Structured Workflow Outputs From Sketch Images (Bechard et al., 27 Mar 2025)