A Neural Representation of Sketch Drawings

Published 11 Apr 2017 in cs.NE, cs.LG, and stat.ML | (1704.03477v4)

Abstract: We present sketch-rnn, a recurrent neural network (RNN) able to construct stroke-based drawings of common objects. The model is trained on thousands of crude human-drawn images representing hundreds of classes. We outline a framework for conditional and unconditional sketch generation, and describe new robust training methods for generating coherent sketch drawings in a vector format.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (820)

View on Semantic Scholar

Summary

The paper introduces sketch-rnn, a novel RNN framework that encodes sketches into latent vectors for coherent vector image generation.
It employs a bidirectional encoder with an autoregressive decoder, enabling both conditional and unconditional sketch creation.
Experimental results reveal effective latent space interpolation, highlighting the model’s potential for creative and educational applications.

A Neural Representation of Sketch Drawings

The paper "A Neural Representation of Sketch Drawings" by David Ha and Douglas Eck investigates the generation of vector-based sketch images using recurrent neural networks (RNNs). This work presents sketch-rnn, a novel model that demonstrates the capability of RNNs to learn, represent, and generate simple sketches based on sequences of pen strokes.

Core Contributions

The paper introduces several key advancements:

Framework for Sketch Generation: It outlines a framework capable of both unconditional and conditional generation of vector images.
Robust Training Methods: The authors detail training procedures designed to improve the ability of RNNs to generate coherent sketches.
Latent Space Analysis: They explore the latent space developed by the model to understand how it represents and interpolates between different sketch images.
Release of Data and Tools: The implementation of sketch-rnn and a large dataset of hand-drawn sketches from the Quickdraw A.I. Experiment are made available to encourage further research.

Methodology

The model comprises three core components:

Encoder: A bidirectional RNN encodes a given sketch into a latent vector $z$ .
Decoder: An autoregressive RNN decodes $z$ to generate a sketch.
Latent Space Interpolation: Techniques such as spherical interpolation help visualize how different sketches transition within the latent space.

The dataset used consists of 75 classes, each containing 70K training samples and 2.5K validation and test samples. Sketches are represented as sequences of vectors denoting pen movements and states, allowing the model to learn and generate coherent sketches.

Experimental Results

Loss Metrics and Reconstruction Quality

Experimental results were discussed in terms of two primary metrics:

Reconstruction Loss ( $L_{R}$ ): Represents the log-likelihood of generating training data sketches.
Kullback-Leibler Divergence ( $L_{KL}$ ): Measures how closely the distribution of the latent vector $z$ matches a Gaussian prior.

Models trained with varying $w_{KL}$ settings show a tradeoff between these metrics. Lower reconstruction loss does not necessarily imply higher quality, as the models with better $L_{KL}$ often produced more coherent images.

Conditional and Unconditional Generation

Conditional generation allows the model to reproduce and manipulate sketches with encoded latent vectors. For instance, a model conditioned on a cat's head can be used to generate multiple similar images with minor variations, maintaining coherence. The study demonstrated that vector arithmetic in the latent space could generate meaningful combinations and transitions of features.

Unconditionally, the model can begin with partial sketches and generate multiple plausible completions. The paper highlights examples where the model generates various endings for incomplete sketches, demonstrating its ability to infer missing strokes.

Latent Space Features

The latent space properties were examined through interpolation tasks, where the model successfully blended features between different sketches, such as transitioning between a cat and a pig. This capacity for interpolation signifies the model's understanding of underlying sketch structure rather than mere replication.

Practical and Theoretical Implications

The ability of sketch-rnn to generate coherent images from simple sequences has implications for both creative industries and educational tools. Sketch-rnn can assist artists by suggesting completions or variations of their drawings, and educational applications can leverage it to teach drawing skills. Furthermore, the methodology could be combined with other generative models to explore novel applications in design and entertainment.

Conclusion

The research on sketch-rnn marks an important step in understanding how neural networks can learn and generate abstract representations of sketches. By encoding sketches as vectors and leveraging RNNs, the authors have demonstrated a new way to think about image generation that aligns more closely with human conceptual understanding rather than pixel-wise image synthesis. Future research could explore enhancing the quality and variety of generated sketches, incorporating more sophisticated latent variable models, or applying hybrid techniques combining sketch-rnn with other generative models.

Markdown Report Issue