Papers
Topics
Authors
Recent
Search
2000 character limit reached

STARFlow: Structured Workflow Extraction

Updated 29 March 2026
  • STARFlow is a generative framework that automatically converts hand-drawn or digital workflow diagrams into structured, executable JSON.
  • Its hybrid encoder–decoder pipeline uses a frozen vision encoder and a transformer decoder to capture intricate workflow logic from diverse input modalities.
  • Empirical evaluations show finetuning boosts performance significantly, with metrics like FlowSim rising by 30–40 points over unadapted models.

StarFlow refers to a family of distinct methodologies and frameworks across diverse domains, unified by their emphasis on complex structured modeling—ranging from automated workflow extraction via vision-LLMs to advanced generative modeling in astrophysics and computer vision. In this article, emphasis is placed on the system "StarFlow: Generating Structured Workflow Outputs From Sketch Images" (Bechard et al., 27 Mar 2025), with comparative reference to adjacent research threads in the "STARFlow" and "STARFlow-V" literature.

1. Definition and Scope

StarFlow (as introduced in (Bechard et al., 27 Mar 2025)) is an end-to-end generative framework designed to automatically convert arbitrary workflow diagrams—sketched by hand or rendered by computer—directly into structured, executable JSON representations. The workflow's input can originate from diverse input modalities (sketches, digital illustrations, screenshots), addressing the need for robust orchestration and automation in enterprise platforms while mitigating the complexity typically associated with manual configuration in low-code or visual programming environments. The core novelty is the application of finetuned vision-LLMs (VLMs) to interpret ambiguous, style-varied diagrammatic inputs and infer executable workflow logic in a completely automated, modal-flexible manner.

2. Input-to-Output Pipeline

The StarFlow pipeline is a hybrid encode–decode architecture, tailored for visual-to-structured conversion. The main stages are as follows:

  • Preprocessing: Input diagrams are cropped to the flow’s bounding box, resized (max height/width=1024 px), and normalized for the chosen vision encoder. No explicit OCR is performed; text is implicitly embedded via the vision encoder.
  • Vision Encoding: A frozen image encoder (e.g., LLaMA-3 Vision, Qwen-2.5-VL) produces a patch-based embedding grid, FRH×W×C\mathbb{F} \in \mathbb{R}^{H \times W \times C}. A small, trainable connector transforms these embeddings into the initial key-value pairs for the transformer decoder.
  • Structured Decoding: The decoder comprises a pre-trained LLM (e.g., LLaMA 3.2, Phi-4), extended to cross-attend on the image-derived prefix. It autoregressively emits tokens specifying workflow elements: the trigger (type, annotation, inputs), and a fully ordered list of components (actions, flowlogic blocks) each with explanatory fields (definition, scope, ordering, block/nesting index, detailed inputs). Inference requires only a single forward pass to produce the complete workflow JSON.

This modality-agnostic design enables robust mapping from free-form diagrammatic input to executable structure, handling significant variance in diagram style, annotation, and logical complexity.

3. Formal Representation of Workflows

StarFlow models workflows as directed graphs G=(V,E)G=(V,E), where:

  • V={v0,,vn}V=\{v_0,\ldots,v_n\} encodes nodes (trigger or individual components).
  • EV×VE \subset V \times V specifies directed edges, capturing both execution sequencing and nesting logic (e.g., FOREACH \rightarrow IF \rightarrow ELSE \rightarrow action).

Each node vv is associated with a dd-dimensional feature vector:

xv=[one-hot(categoryv); one-hot(definitionv); embed(annotationv)]Rdx_v = \big[\text{one-hot}(category_v);\ \text{one-hot}(definition_v);\ \text{embed}(annotation_v)\big] \in \mathbb{R}^d

All node features are collected in a matrix X=[xv1,...,xvn]Rd×nX = [x_{v_1},...,x_{v_n}] \in \mathbb{R}^{d \times n}, and the adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n} encodes execution dependencies. For practical generation and evaluation, GG is linearized into a rooted, ordered tree T(F)T(F), whose preorder traversal defines the corresponding JSON token emission sequence.

4. Model Architecture and Training Regimen

  • Architecture: The vision-language pipeline couples a frozen vision encoder to a small connector and a trainable transformer LLM decoder, with deep cross-attention integration at each LM layer.
  • Objective Function: Structured workflow generation is formulated as an unconditional token-level language modeling task:

LCE(θ)=t=1Tlogpθ(yty<t,I)L_{CE}(\theta) = -\sum_{t=1}^T \log p_\theta(y_t | y_{<t}, I)

No explicit structural or graph-matching loss is needed; the model learns implicit structure from serialized JSON supervision.

  • Training Details:
    • Optimizer: AdamW (β1=0.95\beta_1=0.95, β2=0.999\beta_2=0.999, weight decay=10610^{-6}, ϵ=108\epsilon=10^{-8})
    • LR schedule: cosine anneal with 30-step warmup, peak lr=2×1052\times10^{-5}
    • 16×\timesH100 80GB GPUs, mixed precision (bf16), full-sharded data parallelism
    • Frozen vision encoder; only the transformer decoder and connector are updated. Early stopping based on validation loss.

This regimen supports high-capacity adaptation to a diverse range of visual workflow representations while strictly leveraging off-the-shelf vision and language modules.

5. Dataset Curation and Annotation Schema

StarFlow’s training set (23,310 samples) is partitioned by source modality and rigorously annotated:

Source Train Valid Test
Synthetic 12,376 1,000 1,000
Manual 3,035 333 865
Digital 2,613 241 701
Whiteboard 484 40 46
UI 373 116 87
Total 18,881 1,730 2,699
  • Annotation: Each input is paired with JSON scaffolding: trigger metadata (type, annotation, inputs), and an ordered component list with detailed structure (category, block/indexing, definition, scope, nesting, and all inputs). This curation supports granular structural supervision across both real and synthetic diagrammatic variance.

6. Evaluation Metrics and Empirical Results

Four principal metrics quantify workflow generation fidelity:

  • Flow Similarity (FlowSim): Measures tree-edit distance similarity of generated vs. reference workflows (FlowSim(F,Fr)=1TED(F,Fr)/(F+Fr)\mathrm{FlowSim}(F,F_r) = 1 - \mathrm{TED}(F,F_r)/(|F|+|F_r|)).
  • TreeBLEU: Node- and depth-aware overlap on depth-1 subtrees.
  • Trigger Match (TM): Exact trigger type agreement.
  • Component Match (CM): Jaccard similarity over component sets.

Empirical results show:

Model FlowSim (test)
Pixtral-12B (unfinetuned) 0.632
LLaMA-3.2-11B (unft.) 0.466
Qwen-2.5-VL-7B (unft.) 0.614
GPT-4o 0.786
Gemini Flash 2.0 0.780
Qwen-2.5-VL-7B (finetuned) 0.957
LLaMA-3.2-11B (finetuned) 0.955
Pixtral-12B (finetuned) 0.952

Finetuning on the StarFlow dataset yields a significant (+30–40 point) improvement over unadapted VLMs, surpassing proprietary models on held-out samples.

7. Ablations, Limitations, and Prospects

  • Ablation Studies:
    • Manual and whiteboard sketches are most challenging (\sim0.50–0.60 FlowSim); synthetic/UI input is easier (\sim0.90+ post-finetune).
    • Landscape layouts reduce performance by 2–3 points compared to portrait.
    • Excessive image resolution slightly degrades performance, especially for unfinetuned models.
    • End-to-end sketch\rightarrowJSON modeling outperforms decomposed, staged approaches.
  • Limitations:
    • Weak generalization to unseen UIs or workflow logic patterns (FlowSim drops to 0.20–0.30).
    • No explicit OCR or retrieval-augmented tool call integration; component names and table scopes may be hallucinated.
    • Evaluation quantifies only structural similarity—not semantic workflow executability.
  • Future Extensions:
    • Incorporation of retrieval-augmented tool calls for component grounding.
    • Hybrid OCR+vision integration to enhance text extraction from complex diagrams.
    • Executable-aware evaluation metrics (e.g., HumanEval-style) to assess functional correctness.
    • Dataset expansion to include more diverse logic and legacy user interfaces.

A plausible implication is that while StarFlow demonstrates robust structure extraction, actual end-to-end workflow automation will also require semantic grounding and functionally-executable output verification.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to STARFlow.