Physics Question Scene Graph (PQSG)

Updated 27 June 2026

PQSG is a formally defined hierarchical graph structure that decomposes scene analysis into object existence, action verification, and physics-plausibility nodes.
It employs data-driven methodologies using vision-language and language models to generate atomic yes/no queries with explicit dependency rules.
Empirical evaluations show that PQSG scores correlate well with human judgments, effectively discriminating model performance in physical realism.

A Physics Question Scene Graph (PQSG) is a formally specified, hierarchical, question-based graph structure designed to provide fine-grained, contextually valid evaluation of physical plausibility in generated or observed scenes, particularly in the context of text-to-video (T2V) generation and scientific visual reasoning. PQSGs are constructed either from textual prompts or images, decomposing scene understanding into object existence, action verification, and physics-plausibility verification nodes, with logical dependencies that reflect the compositional and causal nature of physical law adherence. This approach enables precise localization and categorical specification of physical law violations within visual data or model outputs.

1. Formal Structure and Hierarchical Design

A PQSG is defined as a directed acyclic graph $G = (V, E)$ where the node set $V$ is partitioned into three disjoint subsets:

$V_o = \{ o_1, \ldots, o_{|V_o|} \}$ (object-existence nodes)
$V_a = \{ a_1, \ldots, a_{|V_a|} \}$ (action-verification nodes)
$V_p = \{ p_1, \ldots, p_{|V_p|} \}$ (physics-plausibility nodes)

Each node $v \in V$ is associated with a specific Boolean verification question $q_v$ , such as "Are there two pillows?" (object), "Does one grabber tool release the ball?" (action), or "Do the pillows visibly deform upon impact?" (physics) (Pothiraj et al., 24 Jun 2026). The edge set $E \subseteq V \times V$ encodes logical prerequisites with three permitted edge types: object-to-action, action-to-physics, and intra-category dependencies. An edge $(u \to v)$ specifies that $v$ 's question should only be considered if $V$ 0 was answered affirmatively. The strict hierarchy $V$ 1 is enforced, ensuring that no physics node is reachable except via a chain of action and object ancestors.

For any $V$ 2, there exist $V$ 3 and $V$ 4 such that a path $V$ 5 exists. If at evaluation time any parent $V$ 6 of $V$ 7 is answered "no," all descendants are automatically set to "no," preserving logical consistency. This construction guarantees that physics queries are only posed for scene elements with verified presence and action context.

2. Methodologies for PQSG Generation

PQSG construction is data-driven, leveraging large vision-LLMs (VLMs) or LLMs guided by structured system instructions and high-quality in-context examples (Pothiraj et al., 24 Jun 2026, Haque et al., 28 May 2026). The methodologies differ depending on input modality:

Text-to-Video Prompting: The prompt $V$ 8 initiates a VLM with detailed instructions and sample JSON PQSGs. The response delineates atomic, non-overlapping yes/no questions for each hierarchy and the dependency edges. Output JSON schema explicitly partitions nodes by class, ensuring that each query remains atomic and there is no overlap.
Scientific Visual Reasoning (PhysScene): For image-based scenes, PQSGs are constructed as labeled, directed graphs $V$ 9, where nodes represent objects, instruments, or regions, and edges are labeled with predicates from a physics-specific vocabulary (Zou et al., 8 Jun 2026). Features are extracted using deep CNNs for bounding box regions, and edge relations are computed based on contextual embeddings and geometric features.

For more symbolic problem contexts (e.g., diagram generation), scene graphs are typed, attributed, and heterogeneous, including nodes of objects, surfaces, actions, forces, spatial and constraint types, with deterministic chain-of-thought pipeline extraction (Haque et al., 28 May 2026).

3. Scoring, Evaluation, and Benchmark Datasets

The PQSG evaluation pipeline systematically queries each node in topological order, utilizing either a VLM QA model or human annotators. For each node $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 0:

If all parents have affirmative answers, query $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 1 on the generated video (or image) and record $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 2.
Otherwise, set $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 3.

Per-category and overall scores are defined as: $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 4

$V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 5

with $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 6 (thus, the score reduces to the mean $V_o = \{ o_1, \ldots, o_{|V_o|} \}$ 7 across all nodes).

The FinePhyEval dataset enables rigorous benchmarking of PQSG on generated videos. It comprises 65 physics-focused prompts (from Physics-IQ), with videos synthesized by three state-of-the-art T2V models (Sora 2, Veo 3, Wan 2.1), annotated by humans for both categorical and holistic evaluation. Inter-annotator agreement is excellent (ICC = 0.84), validating the dataset’s reliability (Pothiraj et al., 24 Jun 2026).

4. Empirical Outcomes and Comparative Performance

Key empirical findings from PQSG evaluation on FinePhyEval and related datasets include:

Correlation with human judgments: PQSG scores exhibit higher Pearson correlation coefficients with human Likert ratings than prior metrics (e.g., r = 0.478 for PQSG w/ GPT-5.5, compared to 0.382 for Direct VQA and 0.346 for VideoPhy-2 AutoEval).
Category-wise agreement: PQSG achieves per-category correlations of 0.59 (object), 0.68 (action), and 0.48 (physics) in auto-QA, with improvements using human-QA.
Model ranking by physical realism: PQSG discriminates effectively between models, ranking Veo 3 (0.80 ± 0.018) and Sora 2 (0.78 ± 0.057) above Wan 2.1 (0.59 ± 0.042), consistent with human annotation trends.
Question generation and answering: Large VLMs attain high question-generation precision/recall (up to 95.2%), but lag in physics-specific answering (QA accuracy: 64.6% for GPT-5.5 in the physics category), suggesting limits in VLM physical reasoning (Pothiraj et al., 24 Jun 2026).

Ablation studies demonstrate that the explicit dependency structure and fine-grained decomposition of PQSG are critical for maximizing agreement with human evaluators; coarser or edge-agnostic variants underperform by at least 0.04 in correlation.

5. PQSGs in Scientific and Diagrammatic Reasoning

PQSG methodology extends beyond video generation to disciplines involving complex, domain-specific physical scenes:

Diagram Generation (PhyDrawGen): Scene graphs are extracted with node types encoding objects, surfaces, actions, forces, spatial anchors, and constraints. These are converted via deterministic solvers into Planar Straight-Line Graphs (PSLGs) which enforce explicit physical laws (e.g., force closure, Snell’s Law, Gauss’s Law) (Haque et al., 28 May 2026). Propose–verify loops with vision-language correction models automatically amend any constraint violations, and iterative application leads to convergence in over 78% of perturbed samples.
Experimental Scene Understanding (PhysScene): PQSGs provide the foundation for modeling laboratory setups, capturing the high-density semantic relations and specialized node types inherent in experimental physics (e.g., Beaker, BunsenBurner, SpectrometerBody, Pulley) and their spatial, attribute, and human-object interaction predicates. Baseline scene graph methods show high predictive performance (up to Recall@100: 55.8 in PredCls) but also pronounced drops under novelty or open-vocabulary evaluation, underscoring domain complexity (Zou et al., 8 Jun 2026).

Sample PQSGs encode both object configurations and procedural steps, enabling tasks such as visual question answering, anomaly detection, and procedural tutoring in physics education and analysis.

6. Limitations and Prospects for PQSG-based Evaluation

Current PQSGs are prompt- or scene-constrained, covering only entities and actions specified in the textual or visual context. They do not attempt open-ended inference about background or implicit physical phenomena. Rapid actions, nuanced material interactions, or non-explicit physics remain challenging for QA modules. Achievable upper-bound correlations (r = 0.80 with human QA) indicate space for model improvement. Scaling PQSGs to unconstrained vision or multimodal tasks, and to non-physics STEM domains (e.g., chemistry or biology laboratories), represents an active area for future extension (Pothiraj et al., 24 Jun 2026, Zou et al., 8 Jun 2026).

7. Applications and Broader Impact

PQSGs enable fine-grained, interpretable, and logically grounded evaluation in a range of domains:

Graded and local analysis of physical law adherence in generative models, facilitating diagnosis and benchmarking.
Automated tutoring systems through multimodal scene-to-graph and instruction-to-graph mappings.
Visual question answering and procedural step derivation in complex experiment scenes.
Constraint-driven diagram generation, with semantic and physical correctness checks at every abstraction layer.

Through these properties, PQSGs furnish a standardized bridge between human intuitive evaluation, symbolic physical laws, and the outputs of emerging large-scale vision and LLMs, contributing to the development of physically reliable AI systems in science and education (Pothiraj et al., 24 Jun 2026, Haque et al., 28 May 2026, Zou et al., 8 Jun 2026).