Med-CRAFT: Medical Video Reasoning Framework
- Med-CRAFT is a neuro-symbolic framework that generates interpretable, multi-hop benchmarks for medical video reasoning.
- It integrates visual extraction, semantic graph construction, and logic-based query synthesis to overcome limitations in existing medical video datasets.
- The framework achieves precise Chain-of-Thought provenance and scalable, expert-level QA benchmarks, enhancing model evaluation for surgical video analysis.
Med-CRAFT (Medical Cross-modal Reasoning And Fine-grained Tracking) is a neuro-symbolic, deterministic data engineering framework for the automated construction of interpretable, multi-hop medical video reasoning benchmarks. It addresses key limitations of prior approaches—most notably the scarcity of scalable, logically annotated video datasets for medical Multi-Modal LLM (MLLM) evaluation—by reframing benchmark synthesis as a dynamic knowledge graph traversal problem. Med-CRAFT enables large-scale, expert-level video question-answer (QA) workloads with precise, verifiable Chain-of-Thought (CoT) provenance and fine-grained logical complexity, as exemplified by its instantiation in the M³-Med-Auto benchmark (Liu et al., 30 Nov 2025).
1. Motivation and Conceptual Foundations
The development and assessment of MLLMs in medicine is constrained by several intrinsic data challenges:
- Scarcity of Expert Annotations: Manual curation of surgical videos is expensive and non-scalable due to the necessity for domain expertise, accurate temporal boundaries, and logical consistency.
- Simulation-Based Synthesis Gaps: Physics-based video simulators (e.g., CLEVR, VisualRoad) are inadequate for surgical domains due to their inability to model the stochastic, deformable dynamics of human tissue and surgical actions, resulting in a significant sim-to-real domain gap.
- Limitations of Black-Box Generative Approaches: End-to-end LLM/MLLM generation for queries often results in hallucinated entities or implausible event sequences, lacking deterministic provenance and challenging failure analysis.
Med-CRAFT bridges these gaps by integrating structured visual primitive extraction, deterministic knowledge graph instantiation, and logic-grounded multi-hop query generation—yielding highly controllable, interpretable, and scalable video benchmarks.
2. Pipeline Architecture and Methodology
Med-CRAFT operates across three abstraction layers:
- Visual Extraction Layer (Pixel-Level): Raw surgical video is ingested alongside synchronized audio and on-screen text via automatic speech recognition (ASR) and optical character recognition (OCR). These modalities inform an open-set grounding detector (Grounding DINO), which produces frame-level bounding box detections and CLIP-ViT embeddings for semantic association. Primitives are temporally associated across frames using affinity scores that combine semantic similarity, spatial overlap (Generalized IoU), and global bipartite matching (Hungarian algorithm) into spatiotemporal tubelets.
- Graph Construction Layer (Semantic Level): Tubelets become nodes in a dynamic spatiotemporal knowledge graph (KG), , where are entity nodes linked by typed, directed edges , each temporally annotated. Edge candidacy is determined by overlap scores, trajectory similarity (DTW-based), and semantic affinity, with relation types assigned via MLLM (Qwen3-VL) analysis of corresponding video sub-sequences.
- Query Synthesis Layer (Logic Level): Question generation reduces to deterministic graph traversal, where each path (e.g., ) encodes an -hop reasoning requirement. Depth-first search (DFS) extracts graph paths up to hops, forming structured query templates. Each path prompts a generative MLLM to synthesize a video QA pair, explicitly coupled with its CoT trace—reflecting the symbolic reasoning chain. An adversarial critic (e.g., GPT-4V) filters non-answerable or ambiguous queries.
3. M³-Med-Auto: Benchmark Instantiation and Complexity Profiling
Med-CRAFT is instantiated on the M³-Med corpus to generate M³-Med-Auto, a large-scale QA benchmark for medical video reasoning. This dataset is profiled along three core axes:
- Temporal Selectivity Ratio (TSR):
Lower TSR indicates tighter temporal localization.
- Semantic Contextual Confusion (SCC):
Higher SCC corresponds to increased semantic challenge in distinguishing correct answer segments.
- Logic Depth Distribution:
denotes the number of -hop queries, enabling precise control over logical complexity.
Empirical results show that M³-Med-Auto achieves lower TSR, higher SCC, and a balanced distribution of 1- to 4-hop queries compared to previous datasets, confirming both fine-grained temporal selectivity and multi-hop logical challenge.
Representative Query Examples with CoT:
- 2-Hop: "After infiltrating iodine with a cotton swab, what is the immediate next action on the skin?" (CoT: 1. Identify 'infiltrates' event; 2. Resolve 'smears' action.)
- 3-Hop: "Once the capsule is incised to expose the nodule, which tool is used next to biopsy the tissue?" (CoT: 1. Incision detection; 2. Nodule exposure; 3. Biopsy action resolution.)
4. Experimental Evaluation and Logic Alignment
Two core evaluations are conducted following M³-Med protocols:
- Temporal Answer Grounding (TAGSV): Measured by Recall@ for tIoU.
- Video Corpus Retrieval + Grounding (TAGVC): Assessed by top-K mIoU, combining corpus-level retrieval with segment localization.
Benchmarks include:
- Supervised specialist (MutualSL),
- Heuristic CLIP baseline,
- Chain-of-Thought oracle (GLM-4.5V).
Results indicate:
- Comparable model performance on simple (1-hop) queries.
- Marked performance divergence on complex (multi-hop) queries: CLIP baseline fails, whereas GLM-4.5V retains high [email protected].
- Automated M³-Med-Auto matches or exceeds expert-curated complexity with substantially reduced annotation cost.
Logic Alignment Analysis: The relationship between prescribed hop count (via KG path) and inferred hop count (oracle CoT trace) yields a Pearson , alignment accuracy . Heatmap analysis confirms that Med-CRAFT’s symbolic KG traversals directly correspond to LLM reasoning steps.
5. Advantages, Limitations, and Prospective Extensions
Advantages:
- Provenance and Interpretability: Each QA item is uniquely traceable to a specified KG path, providing unambiguous reasoning provenance and eliminating black-box hallucination phenomena.
- Logical and Temporal Control: Difficulty can be tuned by sampling over desired hop counts and temporal windows, facilitating curriculum design and model stress testing.
- Scalability: Visual primitive extraction decouples benchmark scaling from human annotation, supporting dataset expansion.
- Domain Fidelity: Operating on real surgical video mitigates the sim-to-real gap inherent in synthetic environments.
Limitations and Extensions:
- Detection Errors: The integrity of tubelet linking and KG relation labeling is contingent on visual detector robustness; rare or subtle events may be missed.
- Computational Overhead: Graph construction and adversarial QA validation entail significant computational costs.
- Modal Scope: Current pipeline operates on video plus audio/text. Incorporation of additional modalities (e.g., CT, EHR) may expand diagnostic reasoning depth.
- Iterative Adversarial Refinement: An active adversarial loop could leverage observed model failures to generate incrementally challenging queries.
- Privacy-Preserving Deployment: Federated pipeline deployment may be developed to allow local execution within healthcare institutions, protecting sensitive video data.
6. Impact and Future Directions
Med-CRAFT introduces a robust and scalable paradigm for constructing interpretable, multi-hop video QA benchmarks in the medical domain by casting the task as a deterministic traversal over structured, pixel-level knowledge graphs. Its neuro-symbolic methodology establishes strong logic alignment with state-of-the-art MLLM reasoning and enables systematic, low-cost generation of complex, explainable evaluation protocols. Extensions toward multi-modal integration, privacy-preserving deployments, and adversarial difficulty optimization are identified as future research directions for maximizing impact within domain-critical, clinical applications.
Key Reference:
Med-CRAFT: Automated Construction of Interpretable and Multi-Hop Video Workloads via Knowledge Graph Traversal (Liu et al., 30 Nov 2025)