Papers
Topics
Authors
Recent
2000 character limit reached

FS-COCO: Scene Sketch Dataset

Updated 21 November 2025
  • FS-COCO is a large-scale dataset of freehand scene sketches paired with photographs and textual captions, capturing complex scene details.
  • It employs temporally ordered vector representations with a two-level LSTM-based decoder to preserve stroke order and spatial accuracy.
  • Benchmark results demonstrate FS-COCO’s effectiveness in fine-grained sketch-based image retrieval and captioning, highlighting challenges in cross-modal adaptation.

FS-COCO is the first large-scale dataset of freehand scene sketches, each paired with both a corresponding reference photograph and a human-written text description. Designed specifically to address the need for scene-level, freehand sketch understanding, FS-COCO enables research on fine-grained sketch-based image retrieval, sketch captioning, and multi-modal representation learning in complex naturalistic settings. Unlike prior synthetically assembled or object-centric datasets, FS-COCO consists entirely of hand-drawn sketches by non-expert individuals, offering both object- and scene-level abstraction with temporally ordered vector data, making it a unique and critical resource for cross-modal understanding and practical visual applications (Chowdhury et al., 2022).

1. Dataset Construction and Composition

FS-COCO comprises 10,000 unique vector scene sketches, each created by one of 100 non-expert volunteers (100 scenes per participant). For each scene:

  • Source images are sampled from MS-COCO, representing both "thing" and "stuff" categories.
  • After viewing the scene's reference photograph for 60 seconds, subjects sketch from memory, with a strict time limit of 180 seconds per scene (average 1.7 attempts per scene).
  • Each sketch is immediately captioned by the same participant, following COCO-style guidelines (minimum five words, no "There is..." phrasing, color only if shown).

A non-expert judge checks each sketch for recognizability, requiring redraws when necessary.

Scenes span an estimated 150 categories (by segmentation upper bound) or 92 (by caption word occurrence lower bound). Typical objects include natural elements ("tree," "grass"), humans, animals, and common indoor items. Average object label coverage per sketch is 7.17 (segmentation) or 1.37 (caption-based word count). The dataset is divided into 7,000 training and 3,000 test sketch–photo–caption triplets, with established comparison subsets from SketchyScene (2,724 pairs) and SketchyCOCO (1,225 pairs).

2. Data Format and Preprocessing

Each FS-COCO sketch is represented as a temporally ordered vector sequence:

  • Each sketch is a list of strokes; each stroke is an ordered sequence of points.
  • A point Pt=(xt,yt,qt1,qt2,qt3)P_t = (x_t, y_t, q^1_t, q^2_t, q^3_t), with (xt,yt)[0,1]2(x_t, y_t) \in [0,1]^2 (absolute coordinates normalized to canvas size) and a one-hot state vector (q1,q2,q3)(q^1, q^2, q^3) to indicate "pen down," "pen up," or "end of sketch."
  • Stroke order is preserved, encoding the subject's drawing sequence and thus coarse-to-fine scene abstraction.

No smoothing or subsampling is performed, maintaining the sketches' native complexity (median: ~64 strokes, ~2,437 points/sketch). For raster-based architectures, vector sketches are rendered at fixed resolution (e.g., 224×224224 \times 224 pixels) using anti-aliased lines.

3. Hierarchical Encoding and Pre-training Architecture

To leverage the intrinsic temporal hierarchy and spatial complexity of scene sketches, FS-COCO introduces a two-level LSTM-based sketch decoder:

  • Global (stroke-level) LSTM: Encodes global context and predicts the latent for each stroke sequentially from the pooled raster-encoder feature vector lRl_R.
  • Local (point-level) LSTM: For each predicted stroke latent, recursively outputs the point sequence, using SiS_i for stroke ii.

Decoding halts for strokes upon predicting the pen-up bit, and for the sketch once the “end” token is produced. Training employs mean-squared error for positions and cross-entropy for the state bits, reconstructing the sketch from the encoder's latent as a pre-training (“pre-text”) task:

Lrecon=i,t(xi,t,yi,t)(x^i,t,y^i,t)2+CE(qi,t,q^i,t)L_\mathrm{recon} = \sum_{i, t} \left\| (x_{i,t}, y_{i,t}) - (\hat{x}_{i,t}, \hat{y}_{i,t}) \right\|^2 + \mathrm{CE}(q_{i,t}, \hat{q}_{i,t})

Following pre-training, the encoder is fine-tuned for downstream classification or retrieval tasks.

4. Benchmarks and Evaluation Protocols

Sketch-Based and Cross-Modal Retrieval

FS-COCO supports fine-grained image retrieval and caption-based retrieval benchmarks:

  • Pipeline: Sketches and photos are mapped to a shared embedding using Siamese architectures (e.g., Siam-VGG16) or multi-modal transformers (e.g., CLIP).
  • Metrics: Performance is reported as Recall@K (fraction of test queries whose ground-truth reference image is ranked in the top KK), with K=1,10K = 1, 10 standard.
  • Training splits: 7,000/3,000 (train/test) for FS-COCO; corresponding splits for SketchyScene and SketchyCOCO benchmarks.

Key results include:

  • Zero-shot CLIP (ViT-B/32) yields \sim1% R@1; CLIP* (LayerNorm fine-tuned) raises this to \sim5.5% R@1 and 26.5% R@10.
  • Siam-VGG16 trained on FS-COCO achieves \sim23% R@1 and 52% R@10.
  • Models trained on SketchyScene/SketchyCOCO generalize poorly to FS-COCO (R@1 << 2%).

Caption-Based Retrieval

CNN-RNN baselines reach \sim11% R@1 (image captions) and \sim7% R@1 (sketch captions); CLIP* on image captions approaches 22% R@1, similar to sketch-image retrieval with Siam-VGG16. The concatenation or addition of sketch and text embeddings consistently yields small but nontrivial improvements (e.g., CLIP*-add: 23.9% R@1).

Stroke-Order Salience

Empirical ablation demonstrates that stroke sequence encodes salience: removing the first (coarse, longer) strokes degrades retrieval accuracy much more than masking later (finer) strokes.

Scene Sketch Captioning

Classic image captioners (Show-Attend-Tell, AG-CVAE, LNFMM) adapted to sketch input yield BLEU-4 \sim16–17 and CIDEr \sim90–95. Pre-training the encoder–decoder on the raster-to-vector reconstruction task further improves CIDEr to \sim95.

5. Key Empirical Findings and Domain Insights

  • There is a strong domain gap between freehand scene sketches and prior semi-synthetic approaches (SketchyScene, SketchyCOCO), which do not generalize effectively to FS-COCO.
  • Scene sketches contain rich, spatially explicit visual detail—often surpassing captions in conveying scene structure, object layout, and fine-grained relations.
  • Textual modalities (captions) encode complementary information, such as object attributes and color, which are difficult to convey in monochrome sketches. Combining visual and textual modalities produces consistent synergistic improvements.
  • The coarse-to-fine, length-based temporal hierarchy found in single-object sketches is preserved in scene contexts.

6. Applications and Resources

FS-COCO enables research and development in diverse areas:

  • Large-scale sketch-based photo search and retrieval.
  • Human–computer interaction, notably rapid scene sketch interpretation.
  • Cross-modal generation (sketch \leftrightarrow photograph).
  • Scene sketch retrieval from natural language queries and vice versa.

Resources available include the full dataset, codebase, pretrained models, and annotation tools under CC BY-NC 4.0 at https://fscoco.github.io and accompanying GitHub repositories. Evaluation scripts and pre-trained checkpoints for siamese and transformer models are provided (Chowdhury et al., 2022).

7. Impact, Limitations, and Future Directions

FS-COCO establishes the foundation for scene-level sketch understanding using realistic, temporally-rich and paired multi-modal data. Its protocol exposes limitations in domain adaptation and synthetic-to-real transfer for scene sketches. By supporting high-fidelity, temporally-aware vector representations, and benchmarking multi-modal retrieval and captioning, FS-COCO sets a new standard for evaluating freehand scene sketch analysis. Ongoing work may focus on further integration with vision–language pretraining, multi-modal generation, and application to interactive vision systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FS-COCO Dataset.