GSXray: Dual-View X-Ray Dataset

Updated 30 November 2025

GSXray dataset is a dual-view X-ray corpus that integrates paired top and side images with structured chain-of-thought supervision and detailed bounding box annotations.
It features 44,019 samples partitioned for training, validation, and testing, covering 12 object categories with precise labeling and occlusion metadata.
The design leverages the side view as a language-like modality, significantly enhancing multimodal model performance on security inspection benchmarks.

The GSXray dataset is a dual-view, fine-grained corpus designed to advance cross-view geometric reasoning and cross-modal semantic understanding for X-ray prohibited item inspection. Built on top of the DualXrayCap caption corpus and the LDXray dual-view imagery, GSXray provides paired top- and side-view X-ray images, structured chain-of-thought (CoT) supervision, detailed bounding-box annotations, and granular object categorizations. Its architecture facilitates the supervision of large multimodal models that incorporate geometric and semantic information, including reasoning where the second view is treated as a language-like modality. GSXray underpins all diagnostic tasks and benchmarks in DualXrayBench, enabling rigorous evaluation and reproducible research in automated security inspection (Peng et al., 23 Nov 2025).

1. Dataset Composition and Partitioning

GSXray comprises 44,019 dual-view samples, each consisting of:

One top-view X-ray image ( $x_{top}$ )
One side-view X-ray image ( $x_{side}$ )

The dataset is constructed by selecting paired imagery from the LDXray dual-view X-ray set and leveraging caption annotations from DualXrayCap (45,613 dual-view caption pairs; LDXray contains 146,997 pairs in total). GSXray samples are partitioned into 80% training (35,215 samples), 10% validation (4,402 samples), and 10% test (4,402 samples) splits.

GSXray covers 12 object categories, with the following total instance counts (across both views):

Category	Abbreviation	Instance Count
Mobile Phone	MP	59,003
Orange Liquid	OL	18,684
Portable Charger 1	PC1	9,998
Portable Charger 2	PC2	17,274
Laptop	LA	13,453
Green Liquid	GL	11,482
Tablet	TA	6,796
Blue Liquid	BL	1,677
Columnar Orange Liquid	CO	513
Nonmetallic Lighter	NL	487
Umbrella	UM	298
Columnar Green Liquid	CG	296

Each image pair contains an average of 3.07 objects, with significant occlusion: 22,772 pairs have intersection-over-minimum (IoM) $> 0.3$ .

2. Annotation Schema and Chain-of-Thought Supervision

Each GSXray sample is encapsulated in a single JSON object comprising:

"pair_id": Unique sample identifier
"top_image"/"side_image": Filepaths for top and side images
"bboxes": List of object bounding boxes (each as category label + rectangle coordinates)
"cot": Chain-of-thought fields with three components:
- "<top>": Scene description for the top view
- "<side>": Scene description for the side view
- "<conclusion>": Unified semantic summary and inferred relationships
"question"/"answer": Diagnostic inspection query and ground-truth response

CoT sequences were generated via prompting LLMs (Qwen3-VL and GPT-4o), automatically filtered for fact coverage and alignment, then verified by humans. The CoT structure explicitly separates geometric perception (via "<top>" and "<side>" fields) from scene-level semantic fusion ("<conclusion>").

Bounding boxes are specified as axis-aligned rectangles, normalized to a $[0,1000]$ coordinate grid. Annotators followed the LDXray labeling protocol; boxes enclose minimal covering rectangles, with overlaps and occlusions labeled per view.

GSXray operationalizes the side view as a language-like modality, motivated by the complementary spatial information human inspectors derive from both views. This is reflected in the tokenization pipeline, where side-view visual features are prefixed with a special "<side>" reasoning token analogous to textual embeddings.

The multimodal inference architecture comprises:

$f_i = E(x_i) \in \mathbb{R}^{m \times n}$ , where $E(\cdot)$ is a ViT-L/14 vision encoder (shared for both $i \in \{\text{top}, \text{side}\}$ );
$f_i' = A(f_i) \in \mathbb{R}^k$ , with $A(\cdot)$ a two-layer MLP mapping visual tokens into the LLM embedding space;
$y = L([\langle \text{top} \rangle f_{\text{top}}', \langle \text{side} \rangle f_{\text{side}}', \langle \text{conclusion} \rangle t_{\text{query}}])$ , where $L(\cdot)$ is a Qwen3-VL-MoE decoder conditioned on structured multimodal input.

Model training is based on cross-entropy ( $L_{CE}$ ) over the generated CoT sequence tokens and the final answer. Geometry-consistency is optionally regularized by enforcing alignment between $<$ top $>$ and $<$ side $>$ tokens; no additional losses (such as $L_{geo}$ , $L_{sem}$ ) are introduced.

4. Supported Tasks and Benchmarking Protocols

GSXray's CoT supervision is tied to eight diagnostic tasks defined in DualXrayBench:

Counting (CT)
Object Recognition (OR)
Spatial Relation (SR)
Spatial Distance (SD)
Occluded Area Recognition (OA)
Contact-Occlusion Judgment (CO)
Placement Attribute (PA)
Spatial Attribute (SA)

Each sample encodes a question-answer pair corresponding to these inspection challenges. Evaluation employs accuracy (Acc) for classification/QA, F1-Score for multi-label tasks, and mean Intersection-over-Union (mIoU) for spatial correspondence assessment.

Empirical comparisons show that fine-tuning with GSXray substantially improves task performance, as demonstrated on the test split:

Model Variant	Acc (%)	F1 (%)	mIoU (%)
Qwen3-VL-8B (no fine-tuning)	53.5	56.6	25.4
GSR-8B (fine-tuned)	65.4	70.6	52.3

This suggests that leveraging the dual-view CoT and geometric-semantic structure of GSXray yields significant enhancements in both reasoning and spatial localization (Peng et al., 23 Nov 2025).

5. Preprocessing, Access, and Implementation

Images are uniformly resized to $512 \times 512$ pixels, preserving aspect ratio and zero-padding as needed. Bounding box coordinates are already normalized; rescaling to the local pixel grid may be performed if required by downstream models. Tokenization utilizes the Qwen3 tokenizer, with reserved IDs for special CoT tokens (“<top>”, “<side>”, “<conclusion>”).

A sample PyTorch data loader is provided:

class GSXrayDataset(Dataset):
    def __init__(self, json_list, img_root, tokenizer, transform=None):
        self.records = load_json(json_list)
        self.tokenizer = tokenizer
        self.transform = transform
    def __len__(self): return len(self.records)
    def __getitem__(self, i):
        rec = self.records[i]
        img_top = Image.open(os.path.join(img_root, rec['top_image']))
        img_side = Image.open(os.path.join(img_root, rec['side_image']))
        if self.transform:
          img_top = self.transform(img_top)
          img_side = self.transform(img_side)
        cot_input = "<top> " + rec['cot']['<top>'] + \
                    " <side> " + rec['cot']['<side>'] + \
                    " <conclusion> " + rec['cot']['<conclusion>']
        tokens = self.tokenizer(cot_input, return_tensors='pt')
        return {
          'img_top': img_top,
          'img_side': img_side,
          'tokens': tokens,
          'question': rec['question'],
          'answer': rec['answer'],
          'bboxes': rec['bboxes']
        }

Recommended fine-tuning utilizes LLaMA-Factory or HuggingFace Trainer, with AdamW optimizer (lr= $1 \times 10^{-6}$ , cosine decay, 10% warmup), bfloat16 mixed precision, batch size 256 (with gradient accumulation as needed), and 2–3 epochs for convergence.

GSXray is released under a CC-BY-4.0 license and is accessible at https://github.com/BJTU-DualXray/GSXray and via AWS S3 at s3://bjtu-dualxray/gsxray.tar.gz.

6. Research Utility and Reproducibility

GSXray enables direct fine-tuning of large multimodal models for chain-of-thought reasoning over dual-view data, advancing cross-view geometric and cross-modal semantic tasks in X-ray security inspection. It supports reproducible benchmarking on all DualXrayBench tasks and aligns with protocols established in recent work on geometric–semantic alignment in multimodal learning (Peng et al., 23 Nov 2025).

With its structured annotation schema, comprehensive object and occlusion coverage, and explicit CoT supervision, GSXray provides a foundation for developing and evaluating advanced models for real-world multi-view inspection and reasoning applications.

PDF Markdown Chat (Pro)

References (1)

Can a Second-View Image Be a Language? Geometric and Semantic Cross-Modal Reasoning for X-ray Prohibited Item Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GSXray Dataset.

GSXray: Dual-View X-Ray Dataset

1. Dataset Composition and Partitioning

2. Annotation Schema and Chain-of-Thought Supervision

4. Supported Tasks and Benchmarking Protocols

5. Preprocessing, Access, and Implementation

6. Research Utility and Reproducibility

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

GSXray: Dual-View X-Ray Dataset

1. Dataset Composition and Partitioning

2. Annotation Schema and Chain-of-Thought Supervision

3. Cross-View and Cross-Modal Modeling Design

4. Supported Tasks and Benchmarking Protocols

5. Preprocessing, Access, and Implementation

6. Research Utility and Reproducibility

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research