Papers
Topics
Authors
Recent
2000 character limit reached

DualXrayBench: Dual-View X-Ray Benchmark

Updated 30 November 2025
  • DualXrayBench is a dual-view X-ray benchmark that integrates top and side images to enable explicit geometric and semantic fusion for security inspection.
  • It comprises 45,613 annotated image pairs across 12 object categories, structured via a hierarchical chain-of-thought protocol.
  • The benchmark evaluates tasks like counting, recognition, and spatial reasoning using metrics such as Accuracy, F1, and mIoU, showcasing state-of-the-art gains.

DualXrayBench is a comprehensive, expert-verified benchmark for dual-view X-ray image understanding, supporting cross-modal and geometric-semantic reasoning for security inspection contexts. It establishes a structured evaluation suite that reflects the practical workflow of human inspectors, who routinely employ both top- and side-view X-ray imagery to detect prohibited items. Unlike prior approaches that rely on single-view visual input or augment visual QA with language, DualXrayBench formulates the second view as a "language-like modality," enabling explicit geometric and semantic fusion in automated inspection systems (Peng et al., 23 Nov 2025).

1. Dataset Composition and Annotation Protocol

DualXrayBench contains 45,613 dual-view X-ray image pairs, termed the DualXrayCap corpus, spanning 12 security-relevant object categories: Mobile Phone (MP), Orange Liquid (OL), Portable Charger (prismatic/cylindrical cell, PC1/PC2), Laptop (LA), Green Liquid (GL), Tablet (TA), Blue Liquid (BL), Columnar Orange Liquid (CO), Nonmetallic Lighter (NL), Umbrella (UM), and Columnar Green Liquid (CG).

Each pair is annotated via a hierarchical protocol, yielding:

  • scene_top: Free-form description of top-view image.
  • scene_side: Free-form description of side-view image.
  • objects: Array comprising per-object entries (category, normalized bounding box, and cross-view spatial descriptors such as “flat in top, tall in side”).
  • summary: Holistic description integrating complementary information from both views.

Post human verification, these JSON-structured captions are transformed into Chain-of-Thought (CoT) sequences marked by three tokens: <top>, <side>, and <conclusion>, designed to prompt models to reason over (i) top-view evidence, (ii) side-view evidence, and (iii) the resulting semantic fusion.

A typical operational partition (no fixed public split) is utilized: | Set | Data Type | Size | |-----------------|--------------------|----------| | GSXray train | CoT sequences | ~44,019 | | GSXray val | CoT sequences | ~4,402 | | DualXrayBench | QA evaluation set | 1,594 |

2. Diagnostic Tasks, Protocol, and Metrics

DualXrayBench encompasses 1,594 expert-validated dual-view visual question–answer (QA) pairs, spanning eight tasks categorized in four diagnostic families:

  1. Counting (CT): Infer count of objects of a class under occlusion.
  2. Object Recognition (OR): Identify category of partly occluded instances.
  3. Spatial Relation (SR): Judge relative positioning (e.g., ‘above/below’) between objects.
  4. Spatial Distance (SD): Estimate which object is closer along the depth axis.
  5. Occluded Area (OA): Localize occluded area via binary mask prediction.
  6. Contact–Occlusion (CO): Infer if object pairs are in contact or occluding each other.
  7. Placement Attribute (PA): Classify object pose (flat/upright/tilted).
  8. Spatial Attribute (SA): Determine bag position (top/middle/bottom) of an object.

Each evaluation instance comprises a tuple (top-view image, side-view image, question), with discrete answer or structured prediction.

Evaluation employs:

  • Accuracy (Acc): Acc=Correct PredictionsTotal×100%\mathrm{Acc} = \frac{\text{Correct Predictions}}{\text{Total}} \times 100\%
  • F1 Score: F1=2PRP+R×100%\mathrm{F1} = \frac{2PR}{P+R}\times100\% with PP (precision), RR (recall).
  • Mean Intersection over Union (mIoU): For segmentation and spatial ordering tasks, IoUi=predigtipredigti\mathrm{IoU}_i = \frac{| \mathrm{pred}_i \cap \mathrm{gt}_i |}{| \mathrm{pred}_i \cup \mathrm{gt}_i |}, mIoU=1NiIoUi\mathrm{mIoU} = \frac{1}{N} \sum_i \mathrm{IoU}_i.

3. Chain-of-Thought Supervision and the GSXray Dataset

Leveraging the annotated DualXrayCap corpus, 44,019 dual-view QA samples are generated to form the GSXray training set. For each QA instance, a three-stage Chain-of-Thought annotated sequence is created:

  • <top>: Geometric and semantic cues derived from the top view.
  • <side>: Corresponding cues from the side view.
  • <conclusion>: Integrated inference yielding the answer.

During supervised fine-tuning, LLMs are conditioned on these CoT stages, thereby enforcing a reasoning trajectory that begins with separate geometric interpretations and concludes with informed semantic inference.

4. Geometric–Semantic Reasoner (GSR) Model Architecture

The GSR model architecture fuses dual-view vision and language modalities as follows:

  • Vision Encoder EE (ViT-L/14 with 3D conv. patch embedding, shared weights):

ftop=E(xtop),fside=E(xside)\mathbf{f}_{\text{top}} = E(\mathbf{x}_{\text{top}}), \quad \mathbf{f}_{\text{side}} = E(\mathbf{x}_{\text{side}})

where fiRm×n\mathbf{f}_i \in \mathbb{R}^{m \times n} are dense visual tokens.

ftop=A(ftop),fside=A(fside)\mathbf{f}'_{\text{top}} = A(\mathbf{f}_{\text{top}}), \quad \mathbf{f}'_{\text{side}} = A(\mathbf{f}_{\text{side}})

projecting features into the LLM’s embedding space.

  • Language Reasoner LL (Qwen3-VL-MoE decoder):

[top ftop, side fside, conclusion Question][\langle \mathrm{top} \rangle\ \mathbf{f}'_{\text{top}},\ \langle \mathrm{side} \rangle\ \mathbf{f}'_{\text{side}},\ \langle \mathrm{conclusion} \rangle\ \text{Question}]

is processed to autoregressively generate the answer.

  • Training Objective:

The total loss combines cross-entropy for semantic output and optional geometric contrastive loss for top/side feature alignment:

L=λgeoLgeo+λsemLsem\mathcal{L} = \lambda_{\mathrm{geo}} \mathcal{L}_{\mathrm{geo}} + \lambda_{\mathrm{sem}} \mathcal{L}_{\mathrm{sem}}

with geometric contrastive loss defined as

Lgeo=1Ni=1Nlogexp(sim(ftop,i,fside,i)/τ)jexp(sim(ftop,i,fside,j)/τ)\mathcal{L}_{\mathrm{geo}} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\mathrm{sim}(\mathbf{f}'_{\text{top},i},\mathbf{f}'_{\text{side},i})/\tau)} {\sum_j\exp(\mathrm{sim}(\mathbf{f}'_{\text{top},i},\mathbf{f}'_{\text{side},j})/\tau)}

and answer generation cross-entropy

Lsem=1Tt=1Tlogp(yty<t,ftop,fside)\mathcal{L}_{\mathrm{sem}} = -\frac{1}{T} \sum_{t=1}^{T}\log p(y_t | y_{<t}, \mathbf{f}'_{\text{top}}, \mathbf{f}'_{\text{side}})

A salient aspect is treating the side-view tokens, wrapped in the <side> marker, analogously to a textual sentence—thereby forcing the LLM to treat multiple views as distinct but complementary “paragraphs,” with final semantic fusion cued by the <conclusion> token.

5. Benchmarking and Quantitative Performance

DualXrayBench provides state-of-the-art baselining for multimodal question-answering under security inspection constraints. GSR-8B achieves substantial improvements in both semantic and geometric tasks compared with off-the-shelf LLM-based baselines. Reported mean results:

Metric Best Baseline (Qwen3-VL-235B) GSR-8B
Accuracy (%) 58.8 65.4
F1 (%) 65.5 70.6
mIoU (%) 26.0 52.3

Per-task accuracy (best baseline vs. GSR-8B):

Task Baseline (%) GSR-8B (%)
CT 56.3 (Gemini-2.5-Pro) 63.5
OR 55.8 (Qwen3-VL-235B) 65.8
SR 50.5 (GPT-o3) 44.0¹
SD 61.5 (Gemini-2.5-Pro) 61.0
OA 56.6 (Qwen3-VL-235B) 67.9
CO 74.6 (Gemini-2.5-Pro) 80.4
PA 64.3 (Gemini-2.5-Pro) 64.4
SA 79.0 (Qwen3-VL-235B) 76.5

¹ SR's raw accuracy underperforms open-vocabulary LLMs, but F1 consistency is superior.

Overall, GSR-8B yields +6.6% accuracy, +5.1% F1, and +26.3% mIoU above the strongest baseline, achieving state-of-the-art on 6 of 8 sub-tasks.

6. Implications and Distinctive Contributions

DualXrayBench establishes, for the first time, a rigorous multi-view multimodal X-ray evaluation protocol, supplementing previous paradigms that were limited to single-view images or language–visual fusion. The approach of treating a second-view image as a “language-like modality,” structured with <top>, <side>, and <conclusion> chain-of-thought stages, introduces explicit geometric–semantic interplay, enabling models to tackle practical inspection tasks with greater fidelity to human expert workflow.

A plausible implication is that this framework generalizes to multi-view reasoning in other domains beyond X-ray security, wherever structured spatial and semantic fusion is integral to expert decision-making (Peng et al., 23 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DualXrayBench.