Papers
Topics
Authors
Recent
2000 character limit reached

Scanford: Robot-Powered Data Acquisition

Updated 2 December 2025
  • Scanford is a robot-powered data acquisition system that leverages an iterative 'data flywheel' for continual fine-tuning of foundation models.
  • It integrates a holonomic mobile base, Franka Emika FR3 manipulator, RGB-D camera, and LiDAR to autonomously scan, curate, and annotate shelf imagery.
  • Quantitative performance improvements include book identification accuracy rising from 32.4% to 71.8% and significant gains in multilingual OCR performance.

Scanford is a robot-powered data acquisition system designed for continual adaptation of foundation models through in-the-wild deployment. Instantiated as a mobile manipulation platform in an academic library environment, Scanford autonomously scans bookshelves, performs vision-language identification tasks, and automatically generates domain- and task-relevant training data. Its workflow operationalizes the "Robot-Powered Data Flywheel" paradigm: robots equipped with foundation models become real-world data generators, advancing both domain-specific performance (e.g., book identification) and domain-adjacent generalization (e.g., multilingual OCR) via iterative collection, curation, and model fine-tuning (Grannen et al., 24 Nov 2025).

1. System Architecture

Scanford is built on a holonomic mobile base (TidyBot++, 21-inch width) equipped with a Franka Emika FR3 manipulator for vertical reach. Sensory hardware includes an Intel RealSense D435 RGB-D camera (wrist-mounted, 810×1080 px) for shelf imagery and a Unitree L2 LiDAR (base-mounted) for navigation and drift correction. The software stack orchestrates the following modules:

  • Robot Controller: A pre-scripted "stop-and-scan" routine advances the robot in 0.3 m increments. At each waypoint, the manipulator sequentially moves through seven different shelf heights to maximize visual coverage, collecting RGB images at each pose. LiDAR-based drift correction is performed by fitting parallel planes to detected shelf faces and recentering the base as needed.
  • Vision-LLM (VLM): The Qwen2.5-VL (7B) model is deployed in a Retrieval-Augmented Generation (RAG) configuration. Prompt context is dynamically constructed using candidate book titles and call numbers from the library catalog corresponding to the spatial range currently being scanned.
  • Data Curation Pipeline: Predicted labels L^\hat L from the VLM undergo string-similarity matching using Gestalt pattern matching against catalog entries. Local ordering checks enforce left-to-right book order consistency with the catalog; predictions falling below a predefined similarity threshold are discarded.
  • Model Fine-tuning: The accumulated curated dataset Dt\mathcal D_t is used to fine-tune Qwen2.5-VL. Each iteration of adaptation is five epochs in duration, using AdamW optimizer (learning rate 2×1072\times10^{-7}, batch size 16, weight decay 0.01, cosine LR schedule, bfloat16).

2. Iterative Data Collection and Training Loop

The operational pipeline is governed by an iterative loop, comprising robot deployment, data curation, and model fine-tuning:

  1. RobotDeploy: Scanford traverses library aisles, pausing in 0.3 m increments. At each stop, it collects RGB images across multiple shelf heights.
  2. Label Prediction: The previous epoch’s fine-tuned VLM predicts book titles/call numbers in left-to-right order, leveraging real-time RAG context from the library catalog.
  3. Curate: Raw predictions Dtraw={(It(n),L^t(n))}n=1NtD_t^{\mathrm{raw}}=\{(I_t^{(n)},\hat L_t^{(n)})\}_{n=1}^{N_t} are filtered via string similarity (Gestalt) and ordering checks, yielding a curated dataset Dt={(It(n),Lt(n))}n=1MtD_t=\{(I_t^{(n)},L_t^{(n)})\}_{n=1}^{M_t}.
  4. FineTune: The cumulative dataset Dt=k=1tDk\mathcal D_t = \bigcup_{k=1}^t D_k is used to fine-tune the model, producing updated VLM parameters.
  5. Iterate: The loop continues until all aisles are scanned or a time horizon (typically due to battery limits) is reached.

The fine-tuning objective per iteration is: L(θ)=(I,L)Dtlogpθ(LI)+λθ2\mathcal L(\theta) = -\sum_{(I,L)\in\mathcal D_t}\log p_\theta(L|I) + \lambda\|\theta\|^2

3. Quantitative Performance and Evaluation

Scanford’s deployment encompassed 2,103 library shelves, yielding 8,232 raw images and a curated set of 5,019 annotated images. Human intervention requirements were minimal (26 resets over 10 days; ∼2.6/day, typically less than 5 minutes each). Estimated manual labor savings totaled approximately 18.7 hours.

Book Identification Accuracy

Model Zero-Shot (%) After Fine-Tune (%)
Qwen2.5-VL (7B) 32.4 71.8
Gemini (baseline) 43.7

Most performance gain (>35 percentage points) was achieved within the first 1.5 hours (∼1,350 images), after which gains plateaued.

Multilingual OCR Accuracy on Challenging Subsets

Language Model Zero-Shot (%) After Fine-Tune (%)
English Qwen2.5-VL 24.8 46.6
Gemini 30.7
Chinese Qwen2.5-VL 30.8 38.0
Gemini 3.4

These results indicate that continual adaptation via the Scanford framework significantly improves both task-focused performance and transfer to adjacent multilingual OCR tasks.

4. Robot-Powered Data Flywheel Paradigm

Scanford exemplifies the "Robot-Powered Data Flywheel," a closed-loop workflow wherein robots: (1) collect underrepresented, in-the-wild data; (2) enable automatic curation and annotation by leveraging structured sources (such as library catalogs) and VLMs; (3) use this data to fine-tune and adapt the exact deployed foundation model; and (4) redeploy the improved model, further reducing noise and increasing yield in subsequent iterations. Repeated cycles drive the model toward robustness against occlusion, multimodal content, damage, and lighting variation. This paradigm reduces human annotation effort and enables continual domain adaptation of foundation models (Grannen et al., 24 Nov 2025).

5. Limitations and Open Challenges

Several substantive limitations and open problems remain:

  • Engineering Overhead: The method requires moderate task-specific heuristics for aerial drift correction (LiDAR), shelf height scripting, and tuned curation thresholds.
  • Performance Ceiling: Even after extensive fine-tuning, neither book identification (maximum 71.8% accuracy) nor OCR (46.6% English, 38.0% Chinese) approaches completeness.
  • Human Interventions: While infrequent, human resets (26 over two weeks) are nonzero.
  • Pipeline Generality: The current deployment targets only a VLM. Extension to LLMs or vision-language-action (VLA) models will necessitate novel workflows and data-collection strategies.
  • Curation Bias: Dependence on catalog ordering and aggressive filtering discards edge cases, thereby limiting data diversity and coverage.

Planned directions include expanding to new task domains—particularly those within the “Zone of Proximal Development” for foundation models, such as grocery and healthcare—and exploring integration of robot-flywheel data with pre-training corpora and more advanced curation variants (including self-supervised and human-in-the-loop approaches) (Grannen et al., 24 Nov 2025).

6. Context and Significance

Scanford addresses the brittleness of large foundation models pre-trained primarily on internet data by bridging the gap to unstructured, noisy, real-world environments. Its autonomous loop supports rapid, minimally supervised data accumulation, enabling direct adaptation to domain-specific distributions. Quantitative improvements in both task and transfer metrics suggest broader applicability for embodied continual learning frameworks. This suggests that robot-powered data acquisition could become a critical mechanism for expanding the coverage and robustness of future foundation models across a wide array of settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Scanford.