DualXrayBench: Dual-View X-Ray Benchmark
- DualXrayBench is a dual-view X-ray benchmark that integrates top and side images to enable explicit geometric and semantic fusion for security inspection.
- It comprises 45,613 annotated image pairs across 12 object categories, structured via a hierarchical chain-of-thought protocol.
- The benchmark evaluates tasks like counting, recognition, and spatial reasoning using metrics such as Accuracy, F1, and mIoU, showcasing state-of-the-art gains.
DualXrayBench is a comprehensive, expert-verified benchmark for dual-view X-ray image understanding, supporting cross-modal and geometric-semantic reasoning for security inspection contexts. It establishes a structured evaluation suite that reflects the practical workflow of human inspectors, who routinely employ both top- and side-view X-ray imagery to detect prohibited items. Unlike prior approaches that rely on single-view visual input or augment visual QA with language, DualXrayBench formulates the second view as a "language-like modality," enabling explicit geometric and semantic fusion in automated inspection systems (Peng et al., 23 Nov 2025).
1. Dataset Composition and Annotation Protocol
DualXrayBench contains 45,613 dual-view X-ray image pairs, termed the DualXrayCap corpus, spanning 12 security-relevant object categories: Mobile Phone (MP), Orange Liquid (OL), Portable Charger (prismatic/cylindrical cell, PC1/PC2), Laptop (LA), Green Liquid (GL), Tablet (TA), Blue Liquid (BL), Columnar Orange Liquid (CO), Nonmetallic Lighter (NL), Umbrella (UM), and Columnar Green Liquid (CG).
Each pair is annotated via a hierarchical protocol, yielding:
- scene_top: Free-form description of top-view image.
- scene_side: Free-form description of side-view image.
- objects: Array comprising per-object entries (category, normalized bounding box, and cross-view spatial descriptors such as “flat in top, tall in side”).
- summary: Holistic description integrating complementary information from both views.
Post human verification, these JSON-structured captions are transformed into Chain-of-Thought (CoT) sequences marked by three tokens: <top>, <side>, and <conclusion>, designed to prompt models to reason over (i) top-view evidence, (ii) side-view evidence, and (iii) the resulting semantic fusion.
A typical operational partition (no fixed public split) is utilized: | Set | Data Type | Size | |-----------------|--------------------|----------| | GSXray train | CoT sequences | ~44,019 | | GSXray val | CoT sequences | ~4,402 | | DualXrayBench | QA evaluation set | 1,594 |
2. Diagnostic Tasks, Protocol, and Metrics
DualXrayBench encompasses 1,594 expert-validated dual-view visual question–answer (QA) pairs, spanning eight tasks categorized in four diagnostic families:
- Counting (CT): Infer count of objects of a class under occlusion.
- Object Recognition (OR): Identify category of partly occluded instances.
- Spatial Relation (SR): Judge relative positioning (e.g., ‘above/below’) between objects.
- Spatial Distance (SD): Estimate which object is closer along the depth axis.
- Occluded Area (OA): Localize occluded area via binary mask prediction.
- Contact–Occlusion (CO): Infer if object pairs are in contact or occluding each other.
- Placement Attribute (PA): Classify object pose (flat/upright/tilted).
- Spatial Attribute (SA): Determine bag position (top/middle/bottom) of an object.
Each evaluation instance comprises a tuple (top-view image, side-view image, question), with discrete answer or structured prediction.
Evaluation employs:
- Accuracy (Acc):
- F1 Score: with (precision), (recall).
- Mean Intersection over Union (mIoU): For segmentation and spatial ordering tasks, , .
3. Chain-of-Thought Supervision and the GSXray Dataset
Leveraging the annotated DualXrayCap corpus, 44,019 dual-view QA samples are generated to form the GSXray training set. For each QA instance, a three-stage Chain-of-Thought annotated sequence is created:
- <top>: Geometric and semantic cues derived from the top view.
- <side>: Corresponding cues from the side view.
- <conclusion>: Integrated inference yielding the answer.
During supervised fine-tuning, LLMs are conditioned on these CoT stages, thereby enforcing a reasoning trajectory that begins with separate geometric interpretations and concludes with informed semantic inference.
4. Geometric–Semantic Reasoner (GSR) Model Architecture
The GSR model architecture fuses dual-view vision and language modalities as follows:
- Vision Encoder (ViT-L/14 with 3D conv. patch embedding, shared weights):
where are dense visual tokens.
- Alignment Module (2-layer MLP + GeLU):
projecting features into the LLM’s embedding space.
- Language Reasoner (Qwen3-VL-MoE decoder):
is processed to autoregressively generate the answer.
- Training Objective:
The total loss combines cross-entropy for semantic output and optional geometric contrastive loss for top/side feature alignment:
with geometric contrastive loss defined as
and answer generation cross-entropy
A salient aspect is treating the side-view tokens, wrapped in the <side> marker, analogously to a textual sentence—thereby forcing the LLM to treat multiple views as distinct but complementary “paragraphs,” with final semantic fusion cued by the <conclusion> token.
5. Benchmarking and Quantitative Performance
DualXrayBench provides state-of-the-art baselining for multimodal question-answering under security inspection constraints. GSR-8B achieves substantial improvements in both semantic and geometric tasks compared with off-the-shelf LLM-based baselines. Reported mean results:
| Metric | Best Baseline (Qwen3-VL-235B) | GSR-8B |
|---|---|---|
| Accuracy (%) | 58.8 | 65.4 |
| F1 (%) | 65.5 | 70.6 |
| mIoU (%) | 26.0 | 52.3 |
Per-task accuracy (best baseline vs. GSR-8B):
| Task | Baseline (%) | GSR-8B (%) |
|---|---|---|
| CT | 56.3 (Gemini-2.5-Pro) | 63.5 |
| OR | 55.8 (Qwen3-VL-235B) | 65.8 |
| SR | 50.5 (GPT-o3) | 44.0¹ |
| SD | 61.5 (Gemini-2.5-Pro) | 61.0 |
| OA | 56.6 (Qwen3-VL-235B) | 67.9 |
| CO | 74.6 (Gemini-2.5-Pro) | 80.4 |
| PA | 64.3 (Gemini-2.5-Pro) | 64.4 |
| SA | 79.0 (Qwen3-VL-235B) | 76.5 |
¹ SR's raw accuracy underperforms open-vocabulary LLMs, but F1 consistency is superior.
Overall, GSR-8B yields +6.6% accuracy, +5.1% F1, and +26.3% mIoU above the strongest baseline, achieving state-of-the-art on 6 of 8 sub-tasks.
6. Implications and Distinctive Contributions
DualXrayBench establishes, for the first time, a rigorous multi-view multimodal X-ray evaluation protocol, supplementing previous paradigms that were limited to single-view images or language–visual fusion. The approach of treating a second-view image as a “language-like modality,” structured with <top>, <side>, and <conclusion> chain-of-thought stages, introduces explicit geometric–semantic interplay, enabling models to tackle practical inspection tasks with greater fidelity to human expert workflow.
A plausible implication is that this framework generalizes to multi-view reasoning in other domains beyond X-ray security, wherever structured spatial and semantic fusion is integral to expert decision-making (Peng et al., 23 Nov 2025).