FS-COCO: Scene Sketch Dataset
- FS-COCO is a large-scale dataset of freehand scene sketches paired with photographs and textual captions, capturing complex scene details.
- It employs temporally ordered vector representations with a two-level LSTM-based decoder to preserve stroke order and spatial accuracy.
- Benchmark results demonstrate FS-COCO’s effectiveness in fine-grained sketch-based image retrieval and captioning, highlighting challenges in cross-modal adaptation.
FS-COCO is the first large-scale dataset of freehand scene sketches, each paired with both a corresponding reference photograph and a human-written text description. Designed specifically to address the need for scene-level, freehand sketch understanding, FS-COCO enables research on fine-grained sketch-based image retrieval, sketch captioning, and multi-modal representation learning in complex naturalistic settings. Unlike prior synthetically assembled or object-centric datasets, FS-COCO consists entirely of hand-drawn sketches by non-expert individuals, offering both object- and scene-level abstraction with temporally ordered vector data, making it a unique and critical resource for cross-modal understanding and practical visual applications (Chowdhury et al., 2022).
1. Dataset Construction and Composition
FS-COCO comprises 10,000 unique vector scene sketches, each created by one of 100 non-expert volunteers (100 scenes per participant). For each scene:
- Source images are sampled from MS-COCO, representing both "thing" and "stuff" categories.
- After viewing the scene's reference photograph for 60 seconds, subjects sketch from memory, with a strict time limit of 180 seconds per scene (average 1.7 attempts per scene).
- Each sketch is immediately captioned by the same participant, following COCO-style guidelines (minimum five words, no "There is..." phrasing, color only if shown).
A non-expert judge checks each sketch for recognizability, requiring redraws when necessary.
Scenes span an estimated 150 categories (by segmentation upper bound) or 92 (by caption word occurrence lower bound). Typical objects include natural elements ("tree," "grass"), humans, animals, and common indoor items. Average object label coverage per sketch is 7.17 (segmentation) or 1.37 (caption-based word count). The dataset is divided into 7,000 training and 3,000 test sketch–photo–caption triplets, with established comparison subsets from SketchyScene (2,724 pairs) and SketchyCOCO (1,225 pairs).
2. Data Format and Preprocessing
Each FS-COCO sketch is represented as a temporally ordered vector sequence:
- Each sketch is a list of strokes; each stroke is an ordered sequence of points.
- A point , with (absolute coordinates normalized to canvas size) and a one-hot state vector to indicate "pen down," "pen up," or "end of sketch."
- Stroke order is preserved, encoding the subject's drawing sequence and thus coarse-to-fine scene abstraction.
No smoothing or subsampling is performed, maintaining the sketches' native complexity (median: ~64 strokes, ~2,437 points/sketch). For raster-based architectures, vector sketches are rendered at fixed resolution (e.g., pixels) using anti-aliased lines.
3. Hierarchical Encoding and Pre-training Architecture
To leverage the intrinsic temporal hierarchy and spatial complexity of scene sketches, FS-COCO introduces a two-level LSTM-based sketch decoder:
- Global (stroke-level) LSTM: Encodes global context and predicts the latent for each stroke sequentially from the pooled raster-encoder feature vector .
- Local (point-level) LSTM: For each predicted stroke latent, recursively outputs the point sequence, using for stroke .
Decoding halts for strokes upon predicting the pen-up bit, and for the sketch once the “end” token is produced. Training employs mean-squared error for positions and cross-entropy for the state bits, reconstructing the sketch from the encoder's latent as a pre-training (“pre-text”) task:
Following pre-training, the encoder is fine-tuned for downstream classification or retrieval tasks.
4. Benchmarks and Evaluation Protocols
Sketch-Based and Cross-Modal Retrieval
FS-COCO supports fine-grained image retrieval and caption-based retrieval benchmarks:
- Pipeline: Sketches and photos are mapped to a shared embedding using Siamese architectures (e.g., Siam-VGG16) or multi-modal transformers (e.g., CLIP).
- Metrics: Performance is reported as Recall@K (fraction of test queries whose ground-truth reference image is ranked in the top ), with standard.
- Training splits: 7,000/3,000 (train/test) for FS-COCO; corresponding splits for SketchyScene and SketchyCOCO benchmarks.
Key results include:
- Zero-shot CLIP (ViT-B/32) yields 1% R@1; CLIP* (LayerNorm fine-tuned) raises this to 5.5% R@1 and 26.5% R@10.
- Siam-VGG16 trained on FS-COCO achieves 23% R@1 and 52% R@10.
- Models trained on SketchyScene/SketchyCOCO generalize poorly to FS-COCO (R@1 2%).
Caption-Based Retrieval
CNN-RNN baselines reach 11% R@1 (image captions) and 7% R@1 (sketch captions); CLIP* on image captions approaches 22% R@1, similar to sketch-image retrieval with Siam-VGG16. The concatenation or addition of sketch and text embeddings consistently yields small but nontrivial improvements (e.g., CLIP*-add: 23.9% R@1).
Stroke-Order Salience
Empirical ablation demonstrates that stroke sequence encodes salience: removing the first (coarse, longer) strokes degrades retrieval accuracy much more than masking later (finer) strokes.
Scene Sketch Captioning
Classic image captioners (Show-Attend-Tell, AG-CVAE, LNFMM) adapted to sketch input yield BLEU-4 16–17 and CIDEr 90–95. Pre-training the encoder–decoder on the raster-to-vector reconstruction task further improves CIDEr to 95.
5. Key Empirical Findings and Domain Insights
- There is a strong domain gap between freehand scene sketches and prior semi-synthetic approaches (SketchyScene, SketchyCOCO), which do not generalize effectively to FS-COCO.
- Scene sketches contain rich, spatially explicit visual detail—often surpassing captions in conveying scene structure, object layout, and fine-grained relations.
- Textual modalities (captions) encode complementary information, such as object attributes and color, which are difficult to convey in monochrome sketches. Combining visual and textual modalities produces consistent synergistic improvements.
- The coarse-to-fine, length-based temporal hierarchy found in single-object sketches is preserved in scene contexts.
6. Applications and Resources
FS-COCO enables research and development in diverse areas:
- Large-scale sketch-based photo search and retrieval.
- Human–computer interaction, notably rapid scene sketch interpretation.
- Cross-modal generation (sketch photograph).
- Scene sketch retrieval from natural language queries and vice versa.
Resources available include the full dataset, codebase, pretrained models, and annotation tools under CC BY-NC 4.0 at https://fscoco.github.io and accompanying GitHub repositories. Evaluation scripts and pre-trained checkpoints for siamese and transformer models are provided (Chowdhury et al., 2022).
7. Impact, Limitations, and Future Directions
FS-COCO establishes the foundation for scene-level sketch understanding using realistic, temporally-rich and paired multi-modal data. Its protocol exposes limitations in domain adaptation and synthetic-to-real transfer for scene sketches. By supporting high-fidelity, temporally-aware vector representations, and benchmarking multi-modal retrieval and captioning, FS-COCO sets a new standard for evaluating freehand scene sketch analysis. Ongoing work may focus on further integration with vision–language pretraining, multi-modal generation, and application to interactive vision systems.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free