SPADES Dataset: Parsing & Pose Benchmarks

Updated 4 July 2026

SPADES is a dataset name applied to distinct benchmarks: one for NLP semantic parsing with 93,319 cloze queries, and others for spacecraft pose estimation using event sensing and RGB sequences.
In the semantic parsing domain, the dataset leverages Freebase grounding and CCG-based logical forms to evaluate slot-filling performance through automated graph matching.
The spacecraft variants, including SPADES-RGB, combine real and simulated data with detailed calibration, offering comprehensive metrics for 6-DoF pose recovery using both event and visual sensing.

SPADES is a reused dataset name in multiple, unrelated research programs. In grounded semantic parsing, SPADES abbreviates Semantic PArsing of DEclarative Sentences and denotes a large-scale, Freebase-grounded cloze corpus for entity slot filling (Bisk et al., 2016). In spacecraft pose estimation, SPADES denotes a dataset built around a PROBA-2 mock-up, with real event data acquired in a controlled laboratory environment and simulated event data generated with matched camera intrinsics (Rathinam et al., 2023). A later monocular pose-estimation study reports SPADES-RGB, a dataset of temporal RGB sequences with per-frame 2D and 3D supervision for 6-DoF recovery (Sosa et al., 7 Sep 2025).

1. Terminological scope

The label SPADES does not refer to a single canonical benchmark across arXiv literature. Instead, it names at least two datasets with different modalities, tasks, and formalisms. The semantic-parsing SPADES is a Freebase-grounded NLP corpus, whereas the spacecraft-pose SPADES is a robotics and computer-vision dataset centered on event sensing. SPADES-RGB extends the spacecraft-pose line to monocular RGB sequences. A separate 2025 resource named SPADE concerns machine-generated dialogue detection and should not be conflated with SPADES (Bisk et al., 2016, Rathinam et al., 2023, Sosa et al., 7 Sep 2025, Li et al., 19 Mar 2025).

Name	Research area	Core contents
SPADES	Grounded semantic parsing	93,319 cloze questions paired with Freebase answers
SPADES	Spacecraft pose estimation using event sensing	Real event data, simulated event data, camera calibration, poses, masks
SPADES-RGB	Monocular spacecraft pose estimation	300 RGB temporal sequences with bounding boxes, keypoints, and 6-DoF poses
SPADE	Machine-generated text detection	14 synthetic dialogue datasets

A common misconception is that references to “the SPADES dataset” identify a single benchmark. In practice, the intended referent must be recovered from the surrounding domain vocabulary: Freebase and CCG indicate the semantic-parsing corpus; PROBA-2, event sensing, optical flow, or 6-DoF pose indicate the spacecraft datasets.

2. Semantic SPADES as a Freebase-grounded cloze corpus

In the semantic-parsing literature, the dataset is formally defined as

$D = \{(s_i, q_i, a_i)\} \quad \text{for } i = 1 \ldots N,$

where $s_i$ is an original declarative sentence containing at least two Freebase entity mentions, $q_i$ is the corresponding fill-in-the-blank question obtained by randomly blanking out one of those mentions, and $a_i$ is the gold Freebase-entity answer (Bisk et al., 2016).

Each question is derived from an underlying Freebase triple $(e_s, r, e_o)$ in which both subject and object appear in the sentence. Masking one entity yields a cloze query whose answer is the removed entity. The corpus was built from the FACC1-annotated portion of the ClueWeb09 corpus, using only declarative sentences that mention two or more Freebase entities. Semantic compatibility filtering then retains only sentences for which an ungrounded semantic graph $G_u(s)$ has at least one graph-isomorphic Freebase graph $G_f$ , and for which $G_u(s)$ has no free variables. From each remaining sentence, one entity mention is chosen uniformly at random and replaced by $\langle\text{blank}\rangle$ . No manual verification was performed; all filtering and graph-matching steps are fully automatic.

The dataset contains 93,319 cloze questions, with 79,247 training instances, 4,763 development instances, and 9,309 test instances. The training partition contains 685,922 tokens, 69,095 word types, and 37,606 entity types; the development partition contains 41,102 tokens, 9,306 word types, and 4,358 entity types; the test partition contains 80,437 tokens, 15,180 word types, and 7,431 entity types. The average training sentence length is reported as

$\mu_{\text{tokens}} = 685{,}922 / 79{,}247 \simeq 8.66.$

Each sentence originally contained between 2 and 4 entities, and approximately one-third of the test-set entities never appear in the training set. This design reduces the extent to which slot-filling systems can rely purely on memorization.

The dataset was released with graph-parsing code at https://github.com/sivareddyg/graph-parser. The summary indicates a permissive research license, with precise licensing terms delegated to the repository.

3. Formal grounding and parser evaluation in semantic SPADES

Although SPADES is exposed as a slot-filling benchmark, every question is tied to an explicit Freebase graph and thus to a formal logical form (Bisk et al., 2016). The grounding pipeline begins with CCG-driven lambda-calculus composition. Using a CCG derivation of $s_i$ 0, each lexical item is assigned a $s_i$ 1-expression. The summary gives the example for acquired:

$s_i$ 2

Composition under the CCG tree yields an ungrounded logical form $s_i$ 3, or equivalently an ungrounded directed graph $s_i$ 4.

Grounding proceeds by retrieving Freebase subgraphs $s_i$ 5 that are isomorphic to $s_i$ 6. Each such graph corresponds bijectively to a grounded logical form $s_i$ 7. A structured perceptron ranks the candidate grounded graphs using features derived from alignments between ungrounded and Freebase predicates. The top-ranked graph is taken as the final semantic parse; executing it returns a list of possible entities, and the first is compared with the gold answer. The paper gives the following grounded logical form as an example:

$s_i$ 8

SPADES was introduced as an extrinsic, task-based evaluation for supervised and unsupervised CCG parsing regimes. On the test set, a simple bag-of-words baseline achieves 31.4\% overall accuracy. The reported slot-filling accuracies for four parsing regimes are 24.8\% for unsupervised CCG, 27.3\% for semi-supervised CCG with a POS $s_i$ 9CCG lexicon, 28.4\% for semi-supervised CCG with a word $q_i$ 0CCG lexicon, and 30.9\% for fully supervised CCG. When performance is stratified by sentence complexity, the supervised parser reaches up to approximately 32.7\% on two-entity sentences and drops to approximately 20.2\% on four-entity sentences. This suggests that the dataset is sensitive both to syntactic quality and to the ability to compose multi-relation Freebase queries.

4. Event-based SPADES for spacecraft pose estimation

In spacecraft pose estimation, SPADES was introduced as “A Realistic Spacecraft Pose Estimation Dataset using Event Sensing” and comprises real event data acquired in a controlled laboratory environment together with simulated event data using the same camera intrinsics (Rathinam et al., 2023). The sensing platform is a Prophesee Metavision EVK4-HD with SONY IMX636ES(HD) sensor, 1280 $q_i$ 1 720 spatial resolution, a 6 mm fixed-focus lens, horizontal field of view of approximately 54.6 $q_i$ 2, maximum throughput of 3 Gevents/s, temporal resolution of at most approximately $q_i$ 3, and power draw of 0.5–1.5 W. Camera intrinsics are calibrated via grayscale image reconstruction using E2VID and MATLAB’s camera toolbox; extrinsics are obtained through hand-eye calibration between the internal camera frame and the OptiTrack marker frame.

Real data were acquired in the SnT Zero-G Lab, University of Luxembourg, with a 5 $q_i$ 4 3 $q_i$ 5 2.3 m volume. The environment includes two UR10 robotic arms on linear rails, an eight-camera OptiTrack motion capture system, and an Aputure LS-600D-PRO LED lamp with Fresnel F10 lens, specified as a 15 $q_i$ 6 beam, 120,000 lux at 1.5 m, and 5800 K color temperature. The target is a PROBA-2 satellite replica at 1:2.5 scale, with dimensions 0.64 $q_i$ 7 0.24 $q_i$ 8 0.416 m, mass approximately 7 kg, and surface materials chosen to match CAD texture fidelity.

The synthetic pipeline uses Unreal Engine with the UnrealCV plugin, the ESA PROBA-2 CAD model, Blue Marble 16 k Earth texture, P-brdf shading, and Rayleigh scattering. Trajectory generation samples start and end poses with quaternions uniform, $q_i$ 9 m, and $a_i$ 0 determined by field of view; interpolation is 80\% spline / 20\% helix. Event simulation is performed with the ICNS event camera emulator, with per-pixel latency, shot noise, dark noise, and bandwidth limits characterized as in Joubert et al. (2021), and with intrinsic matching to the real sensor.

The dataset contains two modalities. The synthetic modality has 300 trajectories with approximately 598 poses per trajectory, yielding 179,400 labeled frames, with pose range 3.5–12 m, varied background, and varied lighting. The real modality has 31 trajectories with 500 poses each, yielding 15,500 frames, with range 3.5–9 m, static versus dynamic camera motion, 4 lighting setups $a_i$ 1, and 3 camera positions $a_i$ 2. The recommended protocol uses synthetic data for training and in-domain testing, with the real modality reserved for out-of-domain evaluation; no cross-domain validation split is provided. The downloadable contents include RGB + event frames, camera intrinsics/extrinsics, ground-truth poses in the camera frame, and segmentation masks.

5. Filtering, representations, and baseline results in event-based SPADES

A central design issue in the event-based SPADES pipeline is that many accumulated $a_i$ 3 windows can contain too few events on the target silhouette, which harms training (Rathinam et al., 2023). The paper therefore introduces mask-based KL filtering. Let $a_i$ 4 be the target mask with $a_i$ 5. A uniform distribution over mask pixels is defined as

$a_i$ 6

An event-weighted distribution is then defined by

$a_i$ 7

The divergence

$a_i$ 8

is computed, and frames are rejected when $a_i$ 9 exceeds a global threshold $(e_s, r, e_o)$ 0 selected by sweeping on the synthetic set. A simpler bounding-box filtering baseline rejects samples when the event density inside the 2D detection box falls below a threshold.

The paper also evaluates event representations. Existing choices are E2F, LNES, and TS, where the time surface is

$(e_s, r, e_o)$ 1

The proposed 3-channel (3C) representation splits an accumulation window into three equal sub-windows and computes

$(e_s, r, e_o)$ 2

then stacks $(e_s, r, e_o)$ 3 as a pseudo-RGB frame. With Faster-RCNN + MobileNetV3, synthetic detection performance at $(e_s, r, e_o)$ 4 is 0.98 for E2F, 0.98 for LNES, 0.98 for TS, and 0.99 for 3C; at $(e_s, r, e_o)$ 5 the values are 0.74 / 0.73 / 0.74 / 0.95, so 3C yields +0.21 at 0.75. On real data, $(e_s, r, e_o)$ 6 is 0.71 for 3C versus 0.63 for TS.

For pose estimation, the paper reports two baseline families: a Direct two-branch CNN and a Hybrid pipeline consisting of Faster-RCNN (ResNet-50), HigherHRNet for 2D keypoints, and BPnP for PnP optimization. Pose errors are measured as

$(e_s, r, e_o)$ 7

$(e_s, r, e_o)$ 8

$(e_s, r, e_o)$ 9

On synthetic test data, the Direct model reports Data 97.3\%, $G_u(s)$ 0, $G_u(s)$ 1, and $G_u(s)$ 2; the Hybrid model reports 24.0\%, 3.2\%, 6.7 $G_u(s)$ 3, and 0.15. On real data, the Direct model reports 73.3\%, 5.1\%, 81.1 $G_u(s)$ 4, and 1.47, while the Hybrid model reports 17.3\%, 3.3\%, 79.0 $G_u(s)$ 5, and 1.41. The paper further notes that the Hybrid approach yields lower error but only produces valid poses on approximately 25\% of frames or less, whereas the Direct approach covers approximately 70\% of real frames or more but with larger orientation error.

6. SPADES-RGB and temporal monocular 6-DoF supervision

SPADES-RGB is a monocular RGB dataset used in a motion-aware ViT framework for spacecraft pose estimation (Sosa et al., 7 Sep 2025). It consists of monocular RGB stills, organised in temporal sequences of a moving spacecraft, collected in a “simulated orbital environment” using a PROBA-2 scaled mock-up in a Zero-G lab. Although described as a simulation, the summary states that the images are physically acquired with a real camera pointed at a physical mock-up, over a variety of backgrounds and lighting conditions. The spatial resolution is not explicitly stated.

The dataset contains 300 sequences, each with approximately 600 RGB frames, for approximately 180,000 images in total. The split is 210 training sequences, 45 validation sequences, and 45 test sequences. Per-frame annotations include a bounding box around the spacecraft, eight predefined 3D keypoints on the PROBA-2 mock-up with coordinates known from the 3D CAD model, and a ground-truth 6-DoF pose with translation

$G_u(s)$ 6

and rotation represented as the unit quaternion

$G_u(s)$ 7

The supervision is explicitly two-fold. First, the model predicts 2D keypoint heatmaps in each image, where each keypoint is rendered as a Gaussian or elliptical Gaussian for motion encoding. Second, 6-DoF pose is recovered with Perspective-n-Point (PnP) from estimated 2D locations and known 3D coordinates. The inference equation is

$G_u(s)$ 8

where $G_u(s)$ 9 is the $G_f$ 0-th 2D keypoint measurement in pixel homogeneous coordinates, $G_f$ 1 is the known 3D location in the object frame, $G_f$ 2 is the camera intrinsics matrix, and $G_f$ 3 is the 6-DoF pose.

Training samples are frame triplets

$G_f$ 4

with frames drawn at a fixed interval of seven frames apart to expose non-trivial motion. Each image is cropped around its ground-truth bounding box before resizing to the network input resolution. No additional explicit photometric augmentations are reported. The optical-flow module is kept frozen, while the ViT encoder is fully finetuned on SPADES-RGB.

Evaluation uses Percentage of Correct Keypoints (PCK) at thresholds of 1\%, 5\%, and 10\% of the bounding-box diagonal:

$G_f$ 5

Pose errors are

$G_f$ 6

$G_f$ 7

$G_f$ 8

The paper reports that sequences span a wide range of viewpoints and lighting conditions, and that random attitude changes yield coverage of nearly the full $G_f$ 9 rotation space together with a moderate range of translational offsets, though no quantitative histograms or exact pose-distribution curves are given. In the associated benchmark study, the motion-aware method improves over single-image baselines in both 2D keypoint localisation and 6-DoF pose estimation, and shows promising generalisation on real and synthetic data from SPARK-2024.