S-300K: Spatial Instruction Dataset
- S-300K is a spatial instruction and trajectory dataset that provides multi-step tool use traces and memory-augmented spatial evidence.
- It decomposes trajectories into final-answer, turn-level, and expert tool-use samples to train spatial reasoning in vision-language models.
- The dataset is generated via a teacher-student paradigm, enabling fine-tuning of models like Qwen3-VL-8B for advanced spatial tasks.
Searching arXiv for the cited S-Agent paper to ground the response. S-300K is a spatial instruction and trajectory dataset introduced as the supervised fine-tuning substrate linking the tool-augmented S-Agent teacher to the compact S-Agent-8B student model. Rather than a benchmark, it is a collection of spatial tool-use trajectories and derived training samples constructed by running zero-shot S-Agent instantiated with GPT-5.4 on a subset of SenseNova-SI-800K, quality-filtering the resulting traces, and decomposing them into multiple supervision granularities. In the reported system, this process yields 292,391 supervised fine-tuning samples, denoted “S-300K,” which are then used to fine-tune Qwen3-VL-8B into S-Agent-8B (Dai et al., 18 Jun 2026).
1. Definition and functional role
S-300K denotes the spatial instruction / trajectory dataset generated from S-Agent teacher trajectories and prepared for supervised fine-tuning. Its role is explicitly intermediate: S-Agent provides the teacher paradigm, S-300K provides the distilled training data, and S-Agent-8B is the student obtained by fine-tuning Qwen3-VL-8B-Instruct on those data (Dai et al., 18 Jun 2026).
This positioning matters because S-300K is not designed as an evaluation suite and is not merely a corpus of question–answer pairs. Its raw units are agent traces containing the original question, visual inputs, intermediate planner responses, issued tool calls, returned tool observations, and the final answer. A trajectory is therefore a stateful record of iterative spatial evidence acquisition rather than a single-step annotation. This makes S-300K closer to a distilled process dataset for spatial reasoning than to a conventional vision-language instruction set.
A common misconception is that “300K” identifies an exact corpus size. The paper states that the actual total is 292,391 supervised fine-tuning samples and that “300K” is a rounded name. Another common misconception is that S-300K is a benchmark analogous to MMSI-Bench or ViewSpatial; the paper is explicit that it is a training dataset.
2. Dataset scale, provenance, and decomposition
The dataset originates from 100,000 sampled questions drawn from SenseNova-SI-800K. After teacher execution and filtering, 51,596 trajectories are retained, and these are decomposed into three distinct supervision types whose union forms S-300K (Dai et al., 18 Jun 2026).
| Component | Count |
|---|---|
| Sampled questions from SenseNova-SI-800K | 100,000 |
| Quality-filtered trajectories | 51,596 |
| Final-answer trajectories | 51,596 |
| Turn-level trajectories | 154,590 |
| Nontrivial tool/expert trajectories | 86,205 |
| Total SFT samples in S-300K | 292,391 |
The decomposition is central to the dataset’s design. Final-answer trajectories preserve the entire reasoning chain from the initial prompt to the terminal answer. Turn-level trajectories isolate individual planner decisions together with localized context, which reduces excessively long contexts and teaches iterative tool-use planning. Tool/expert-level trajectories focus on specific expert invocations such as metric measurement, counting, or relative-position computation, but are included only when the input is complete, the tool response is available, and the result can be verified.
This multi-granularity structure suggests that S-300K is intended to supervise both global trajectory execution and local decision policies. A plausible implication is that its effectiveness depends not only on the correctness of final answers but also on the fidelity of intermediate tool-use patterns.
3. Construction pipeline and quality control
The construction pipeline has three stages: trajectory generation, trajectory filtering, and trajectory decomposition. Questions are sampled from SenseNova-SI-800K with preference for cases that are challenging for the student Qwen3-VL-8B and likely to require spatial tool use, including measurement, counting, relative position, camera/viewpoint reasoning, and grounding (Dai et al., 18 Jun 2026).
For each selected question, a frozen teacher S-Agent instantiated with GPT-5.4 generates a complete trajectory. At each reasoning step , the planner produces a request
where is the question, the visual input, Scene Memory, and Agent Memory. Tool execution returns an observation , and the memory state is updated by
The paper further describes the evidence/context split as
with as reusable scene evidence and 0 as process context.
Filtering is type-specific. Invalid trajectories with failed execution, unrecovered errors, or missing final answers are removed. For multiple-choice questions, the predicted option letter parsed from <answer>…</answer> must exactly match the ground-truth option letter. For numeric questions, parsed floating-point predictions are retained only if the Mean Relative Accuracy threshold is at least 1. For free-form text, normalization is applied and the retained cases require either exact match or that the ground-truth answer appear as a substring of the prediction.
The paper also states that tool use is not itself a hard filtering criterion. A high-quality trajectory may be retained even if the planner decides that tools are unnecessary. This is significant because S-300K is not defined as a dataset of compulsory tool-calling traces; it is a dataset of accepted agent behaviors under a tool-augmented regime.
4. Internal representation and annotation content
S-300K is fundamentally a serialized corpus of tool-augmented trajectories with rich spatial annotations. In raw form, each trajectory contains natural-language fields such as the question, intermediate planner thoughts and decisions, and the final answer, typically wrapped in a structured <answer>…</answer> field (Dai et al., 18 Jun 2026).
Its visual and spatial content includes image inputs, multi-view image sets, and video frames inherited from the source corpus. Tool-grounded annotations span several levels. At the 2D level, the traces include bounding boxes, labels, confidences, textual location descriptions, and visualization overlays from detect_objects_tool and vlm_ground_objects. At the 2D-to-3D level, they include depth maps, per-point depth values, estimated 3D coordinates, camera poses, and depth visualizations from depth_estimation_tool and metric_depth3d_tool. At the higher semantic level, they include aggregated spatial knowledge such as counts, distances, sizes, relative directions, orientations, and view-conditioned relations computed by Level-3 spatial experts.
The state representation is split into Scene Memory 2 and Agent Memory 3. Scene Memory stores entity-centric entries such as aliases, supporting frames, boxes, accumulated 3D attributes, and derived spatial facts. Agent Memory stores planner thoughts, tool calls, observations, failures, and intermediate conclusions. In the student model these are not separate latent modules; they are serialized as textual summaries in the context window.
The logged trajectory content is correspondingly broad: planner prompts and responses, tool calls, tool observations, intermediate artifacts, memory states, final answers, and evaluation results. This means that S-300K supervises not only answer generation but also a text-serialized approximation to memory maintenance and evidence integration.
5. Task coverage and spatial competencies
S-300K inherits task diversity from SenseNova-SI-800K and is further biased toward spatially demanding instances where tool use helps. The paper explicitly highlights counting, metric measurement, relative position, orientation, view-dependent reasoning, camera/viewpoint reasoning, grounding-dependent questions, and temporal or multi-frame counting and tracking (Dai et al., 18 Jun 2026).
Counting supervision covers both single-object counts and condition-aware counts over multiple frames. Metric measurement includes camera–object distance, object–object distance, object size, and room size. Relative-position tasks include left/right, front/back, cardinal directions, and distinctions between egocentric and world reference frames. Orientation tasks target intrinsic facing direction or pose. View-dependent reasoning covers object-centric front/back/left/right views, while camera/viewpoint reasoning concerns camera perspective, person perspective, and perspective-based relative direction.
The dataset also spans single images, multi-view image sets, and videos. Temporal structure enters through keyframe selection tools, Scene Memory linking entities across frames, and tasks involving motion-based relations, route planning, or counts accumulated over time. This organization aligns S-300K with the broader S-Agent formulation of spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction.
Because the source traces are generated under active planner control, the spatial competencies represented in S-300K are not limited to static geometric labels. They include policies for when to ground objects, when to invoke depth or 3D lifting, when to call specialized experts, and when to terminate with a final answer.
6. Training use, empirical effect, and limitations
S-300K is used to fine-tune Qwen3-VL-8B-Instruct with standard supervised next-token prediction over assistant responses, including serialized tool-use trajectories, tool observations, and final answers (Dai et al., 18 Jun 2026). The student is trained with LLaMA-Factory using the qwen3_vl_nothink conversation template, on 8 × B200 GPUs, with sequence length 8192 tokens, learning rate 4, cosine decay with 3% warmup, and 1 epoch over S-300K.
The paper characterizes this as imitation learning of the S-Agent teacher’s behavior: when and which tools to call, how to interpret tool outputs, how to maintain coarse memory, and how to produce final answers. No architectural modifications are introduced; S-Agent-8B remains a purely fine-tuned VLM, with tool calls, tool outputs, and memory summaries serialized as text.
The reported effect is substantial. Qwen3-VL-8B-Instruct scores 31.1 on MMSI, 42.2 on ViewSpatial, and 49.1 on ReVSI. Wrapping the same model with zero-shot S-Agent tools does not consistently help, yielding 30.7, 44.1, and 49.5. After fine-tuning on S-300K, S-Agent-8B reaches 41.6 on MMSI, 46.8 on ViewSpatial, and 52.8 on ReVSI. The paper also reports that S-Agent-8B is competitive with advanced closed-source models on these spatial benchmarks, with 41.6 versus GPT-5.4’s 41.9 and Gemini 3 Pro’s 45.2 on MMSI, and 46.8 versus GPT-5.4’s 45.6 and Gemini 3 Pro’s 50.4 on ViewSpatial.
Several limitations are implicit. S-300K inherits domain biases from SenseNova-SI-800K. Its trajectories depend on the quality of GPT-5.4 planning and tools such as GroundingDINO and Depth Anything 3, so intermediate errors may survive even when final-answer filtering succeeds. Numeric filtering with MRA 5 may bias retention toward teacher-confident cases. The 3D evidence is derived from learned depth models and geometric heuristics rather than explicit real-world 3D ground truth. Finally, the paper does not explicitly announce public release or licensing for S-300K; based on the provided text, it is presented as the internal dataset used to train S-Agent-8B rather than as an explicitly released public resource.
Within the broader landscape of spatial datasets, this places S-300K in a distinct category. Unlike evaluation suites such as BLINK, 3DSRBench, EmbSpatial, MMSI-Bench, VSI-Bench, ViewSpatial, ReVSI, and VSI-SUPER, it is purely for training. Unlike datasets described as large spatial instruction corpora, such as SenseNova-SI-800K or Cambrian-S, its distinguishing feature is teacher-generated multi-step tool use with explicit memory and geometric computation.