GSXray: Dual-View X-Ray Dataset
- GSXray dataset is a dual-view X-ray corpus that integrates paired top and side images with structured chain-of-thought supervision and detailed bounding box annotations.
- It features 44,019 samples partitioned for training, validation, and testing, covering 12 object categories with precise labeling and occlusion metadata.
- The design leverages the side view as a language-like modality, significantly enhancing multimodal model performance on security inspection benchmarks.
The GSXray dataset is a dual-view, fine-grained corpus designed to advance cross-view geometric reasoning and cross-modal semantic understanding for X-ray prohibited item inspection. Built on top of the DualXrayCap caption corpus and the LDXray dual-view imagery, GSXray provides paired top- and side-view X-ray images, structured chain-of-thought (CoT) supervision, detailed bounding-box annotations, and granular object categorizations. Its architecture facilitates the supervision of large multimodal models that incorporate geometric and semantic information, including reasoning where the second view is treated as a language-like modality. GSXray underpins all diagnostic tasks and benchmarks in DualXrayBench, enabling rigorous evaluation and reproducible research in automated security inspection (Peng et al., 23 Nov 2025).
1. Dataset Composition and Partitioning
GSXray comprises 44,019 dual-view samples, each consisting of:
- One top-view X-ray image ()
- One side-view X-ray image ()
The dataset is constructed by selecting paired imagery from the LDXray dual-view X-ray set and leveraging caption annotations from DualXrayCap (45,613 dual-view caption pairs; LDXray contains 146,997 pairs in total). GSXray samples are partitioned into 80% training (35,215 samples), 10% validation (4,402 samples), and 10% test (4,402 samples) splits.
GSXray covers 12 object categories, with the following total instance counts (across both views):
| Category | Abbreviation | Instance Count |
|---|---|---|
| Mobile Phone | MP | 59,003 |
| Orange Liquid | OL | 18,684 |
| Portable Charger 1 | PC1 | 9,998 |
| Portable Charger 2 | PC2 | 17,274 |
| Laptop | LA | 13,453 |
| Green Liquid | GL | 11,482 |
| Tablet | TA | 6,796 |
| Blue Liquid | BL | 1,677 |
| Columnar Orange Liquid | CO | 513 |
| Nonmetallic Lighter | NL | 487 |
| Umbrella | UM | 298 |
| Columnar Green Liquid | CG | 296 |
Each image pair contains an average of 3.07 objects, with significant occlusion: 22,772 pairs have intersection-over-minimum (IoM) .
2. Annotation Schema and Chain-of-Thought Supervision
Each GSXray sample is encapsulated in a single JSON object comprising:
"pair_id": Unique sample identifier"top_image"/"side_image": Filepaths for top and side images"bboxes": List of object bounding boxes (each as category label + rectangle coordinates)"cot": Chain-of-thought fields with three components:"<top>": Scene description for the top view"<side>": Scene description for the side view"<conclusion>": Unified semantic summary and inferred relationships
"question"/"answer": Diagnostic inspection query and ground-truth response
CoT sequences were generated via prompting LLMs (Qwen3-VL and GPT-4o), automatically filtered for fact coverage and alignment, then verified by humans. The CoT structure explicitly separates geometric perception (via "<top>" and "<side>" fields) from scene-level semantic fusion ("<conclusion>").
Bounding boxes are specified as axis-aligned rectangles, normalized to a coordinate grid. Annotators followed the LDXray labeling protocol; boxes enclose minimal covering rectangles, with overlaps and occlusions labeled per view.
3. Cross-View and Cross-Modal Modeling Design
GSXray operationalizes the side view as a language-like modality, motivated by the complementary spatial information human inspectors derive from both views. This is reflected in the tokenization pipeline, where side-view visual features are prefixed with a special "<side>" reasoning token analogous to textual embeddings.
The multimodal inference architecture comprises:
- , where is a ViT-L/14 vision encoder (shared for both );
- , with a two-layer MLP mapping visual tokens into the LLM embedding space;
- , where is a Qwen3-VL-MoE decoder conditioned on structured multimodal input.
Model training is based on cross-entropy () over the generated CoT sequence tokens and the final answer. Geometry-consistency is optionally regularized by enforcing alignment between top and side tokens; no additional losses (such as , ) are introduced.
4. Supported Tasks and Benchmarking Protocols
GSXray's CoT supervision is tied to eight diagnostic tasks defined in DualXrayBench:
- Counting (CT)
- Object Recognition (OR)
- Spatial Relation (SR)
- Spatial Distance (SD)
- Occluded Area Recognition (OA)
- Contact-Occlusion Judgment (CO)
- Placement Attribute (PA)
- Spatial Attribute (SA)
Each sample encodes a question-answer pair corresponding to these inspection challenges. Evaluation employs accuracy (Acc) for classification/QA, F1-Score for multi-label tasks, and mean Intersection-over-Union (mIoU) for spatial correspondence assessment.
Empirical comparisons show that fine-tuning with GSXray substantially improves task performance, as demonstrated on the test split:
| Model Variant | Acc (%) | F1 (%) | mIoU (%) |
|---|---|---|---|
| Qwen3-VL-8B (no fine-tuning) | 53.5 | 56.6 | 25.4 |
| GSR-8B (fine-tuned) | 65.4 | 70.6 | 52.3 |
This suggests that leveraging the dual-view CoT and geometric-semantic structure of GSXray yields significant enhancements in both reasoning and spatial localization (Peng et al., 23 Nov 2025).
5. Preprocessing, Access, and Implementation
Images are uniformly resized to pixels, preserving aspect ratio and zero-padding as needed. Bounding box coordinates are already normalized; rescaling to the local pixel grid may be performed if required by downstream models. Tokenization utilizes the Qwen3 tokenizer, with reserved IDs for special CoT tokens (“<top>”, “<side>”, “<conclusion>”).
A sample PyTorch data loader is provided:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
class GSXrayDataset(Dataset): def __init__(self, json_list, img_root, tokenizer, transform=None): self.records = load_json(json_list) self.tokenizer = tokenizer self.transform = transform def __len__(self): return len(self.records) def __getitem__(self, i): rec = self.records[i] img_top = Image.open(os.path.join(img_root, rec['top_image'])) img_side = Image.open(os.path.join(img_root, rec['side_image'])) if self.transform: img_top = self.transform(img_top) img_side = self.transform(img_side) cot_input = "<top> " + rec['cot']['<top>'] + \ " <side> " + rec['cot']['<side>'] + \ " <conclusion> " + rec['cot']['<conclusion>'] tokens = self.tokenizer(cot_input, return_tensors='pt') return { 'img_top': img_top, 'img_side': img_side, 'tokens': tokens, 'question': rec['question'], 'answer': rec['answer'], 'bboxes': rec['bboxes'] } |
Recommended fine-tuning utilizes LLaMA-Factory or HuggingFace Trainer, with AdamW optimizer (lr=, cosine decay, 10% warmup), bfloat16 mixed precision, batch size 256 (with gradient accumulation as needed), and 2–3 epochs for convergence.
GSXray is released under a CC-BY-4.0 license and is accessible at https://github.com/BJTU-DualXray/GSXray and via AWS S3 at s3://bjtu-dualxray/gsxray.tar.gz.
6. Research Utility and Reproducibility
GSXray enables direct fine-tuning of large multimodal models for chain-of-thought reasoning over dual-view data, advancing cross-view geometric and cross-modal semantic tasks in X-ray security inspection. It supports reproducible benchmarking on all DualXrayBench tasks and aligns with protocols established in recent work on geometric–semantic alignment in multimodal learning (Peng et al., 23 Nov 2025).
With its structured annotation schema, comprehensive object and occlusion coverage, and explicit CoT supervision, GSXray provides a foundation for developing and evaluating advanced models for real-world multi-view inspection and reasoning applications.