PixelNN: Example-based Image Synthesis (1708.05349v1)

Published 17 Aug 2017 in cs.CV, cs.GR, and cs.LG

Abstract: We present a simple nearest-neighbor (NN) approach that synthesizes high-frequency photorealistic images from an "incomplete" signal such as a low-resolution image, a surface normal map, or edges. Current state-of-the-art deep generative models designed for such conditional image synthesis lack two important things: (1) they are unable to generate a large set of diverse outputs, due to the mode collapse problem. (2) they are not interpretable, making it difficult to control the synthesized output. We demonstrate that NN approaches potentially address such limitations, but suffer in accuracy on small datasets. We design a simple pipeline that combines the best of both worlds: the first stage uses a convolutional neural network (CNN) to maps the input to a (overly-smoothed) image, and the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner. We demonstrate our approach for various input modalities, and for various domains ranging from human faces to cats-and-dogs to shoes and handbags.

Citations (46)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that first applies CNN-based regression to produce a smoothed image and then refines it with pixel-wise nearest neighbor matching.
It addresses GAN limitations by ensuring diversity, interpretability, and user control through compositional matching with deep hypercolumn features.
Evaluations on varied datasets confirm that PixelNN achieves competitive performance and traceable dense pixel correspondences for high-quality image synthesis.

This paper, "PixelNN: Example-based Image Synthesis" (1708.05349), presents a practical approach for generating photorealistic images from incomplete input signals such as low-resolution images, surface normal maps, or edge maps. The authors argue that state-of-the-art generative adversarial networks (GANs), while powerful, suffer from limitations like mode collapse (generating limited diversity) and lack of interpretability or user control. To address these issues, PixelNN proposes a simple two-stage pipeline combining a convolutional neural network (CNN) regressor with a pixel-wise nearest neighbor (NN) method.

The core idea is that a traditional regressor (like a CNN trained with L2 loss) tends to produce smoothed outputs lacking high-frequency details. The PixelNN pipeline aims to add these high-frequency details by "copy-pasting" them from a training dataset using a nearest-neighbor search. This approach is inspired by classic non-parametric methods but overcomes their limitations (data scarcity, lack of distance metric, computational cost) by leveraging deep features and a compositional strategy.

The implementation involves the following two stages:

Stage 1: Regression with a CNN: A CNN is trained to directly map the incomplete input

x

to a corresponding output image

f(x)

. This is typically done using an L2 loss. The paper uses a fully-convolutional network architecture like PixelNet [PixelNet] for this task. This first stage provides a baseline, smoothed image that captures the overall structure and mid-frequency content implied by the input

x

. For training, paired data

(x_n, y_n)

are used, where

x_n

is the incomplete input and

y_n

is the target high-quality image.

# Conceptual representation of Stage 1 training
model = PixelNetRegressor()
optimizer = Adam(model.parameters())
criterion = L2Loss()

for input_batch, target_batch in dataloader:
    optimizer.zero_grad()
    output_batch = model(input_batch)
    loss = criterion(output_batch, target_batch)
    loss.backward()
    optimizer.step()

# After training, get the smoothed output for a query input x_query
smoothed_output = model(x_query)

Stage 2: Pixel-wise Nearest Neighbor Composition: Instead of using global image matching, the method performs nearest neighbor search on individual pixels. For each pixel $i$ $i$ in the smoothed output $f(x)$ $f (x)$ of the query input $x$ $x$ , the system finds the most similar pixel $j$ $j$ in the smoothed output $f(x_k)$ $f (x_{k})$ of a training example $k$ $k$ . Similarity is measured using a multi-scale deep descriptor extracted from the CNN. Once the best match $(j, k)$ $(j, k)$ is found for pixel $i$ $i$ , the corresponding high-frequency content is transferred: the final output pixel at location $i$ $i$ is constructed as $Comp_i(x) = f_i(x) + (y_{jk} - f_j(x_k))$ $C o m p_{i} (x) = f_{i} (x) + (y_{jk} - f_{j} (x_{k}))$ , where $y_{jk}$ $y_{jk}$ is the ground truth pixel $j$ $j$ from training image $k$ $k$ . The term $(y_{jk} - f_j(x_k))$ $(y_{jk} - f_{j} (x_{k}))$ represents the high-frequency detail difference between the ground truth and the smoothed output at the matched location.

To implement the pixel-wise matching, the paper leverages hypercolumn features [Hariharan15]. These features are formed by concatenating features from multiple layers of the pre-trained CNN used in Stage 1 (specifically, layers conv-{1_2, 2_2, 3_3, 4_3, 5_3} of a PixelNet model trained for semantic segmentation). The authors empirically found that features from a network trained for semantic segmentation are more effective for pixel-level correspondence and capturing nuanced details than those from classification networks. Cosine distance is used to measure similarity between pixel descriptors.

To generate multiple diverse outputs from a single input, the system does not rely on finding just the single best pixel match for each location independently. Instead, the process involves:
- First, finding $K$ globally nearest neighbor training examples based on their Stage 1 smoothed outputs (using global features like conv-5).
- Then, for each pixel in the query's smoothed output, searching for pixel-wise matches only within a local window ( $T \times T$ pixels) around the corresponding location in the $K$ selected training examples' smoothed outputs.
- Different combinations of $K$ (e.g., 1 to 10) and $T$ (e.g., 1 to the image size) yield different compositional strategies, leading to diverse results. A small $T$ (e.g., $T=1$ ) approximates a global exemplar match, while a large $T$ and $K > 1$ allow for richer composition. The paper generated 72 candidate outputs by varying these parameters.

# Conceptual representation of Stage 2
# Precompute smoothed outputs for training data and extract hypercolumn features
# train_smoothed_outputs[k] = model(train_inputs[k])
# train_hypercolumn_features[k][j] = extract_hypercolumn(train_smoothed_outputs[k], j)

query_smoothed_output = model(query_input)
query_hypercolumn_features = extract_hypercolumn_features(query_smoothed_output)

final_output = zeros_like(query_smoothed_output)

# Find K global neighbors (simplified)
# global_neighbors_indices = find_k_global_neighbors(query_smoothed_output, train_smoothed_outputs, K)

for i in range(num_pixels): # Iterate over pixels in query_smoothed_output
    query_pixel_feature = query_hypercolumn_features[i]
    best_match_pixel = None
    min_distance = infinity

    # Search for nearest neighbor pixel within selected training examples and window
    # for k in global_neighbors_indices:
    #     for j in pixels_in_window(i, T): # Iterate over pixels in window around i in train_smoothed_outputs[k]
    #         train_pixel_feature = train_hypercolumn_features[k][j]
    #         distance = cosine_distance(query_pixel_feature, train_pixel_feature)
    #         if distance < min_distance:
    #             min_distance = distance
    #             best_match_pixel = (k, j) # Store training example index k and pixel index j

    # (k_match, j_match) = best_match_pixel
    # final_output[i] = query_smoothed_output[i] + (train_ground_truth_outputs[k_match][j_match] - train_smoothed_outputs[k_match][j_match])

Practical Applications and Implementation Considerations:

Diversity: The primary benefit is generating diverse outputs for the same input, unlike many GANs that suffer from mode collapse (illustrated in Figure 1 and Figure 6, 7, 8). This is crucial for tasks where multiple plausible outcomes exist (e.g., generating textures, different hairstyles from an edge map).
Interpretability and Control: Because the output is a composition of training pixels, it is inherently interpretable – you can trace where each output pixel came from (Figure 3). This transparency enables user control: a user can guide the synthesis by selecting specific training examples or even specifying regions from which to copy features (Figure 9). This is implemented by pruning the training set used for NN search.
Input Modalities: The method is general and applies to various inputs, including low-resolution images (for super-resolution), surface normal maps, and edge maps, across different domains like faces, animals, shoes, and handbags (Figure 4, 5, 6).
Datasets: The paper demonstrates results on CelebA liu2015faceattributes, Oxford-IIIT Pet dataset parkhi12a, and datasets of shoes [fine-grained] and handbags [zhu2016generative].
Performance: Quantitatively, evaluated by metrics like angular error for normals and Average Precision (AP) for edges, PixelNN is shown to be competitive with or surpass a strong GAN baseline (pix2pix) [pix2pix2016], especially when evaluating the best-case output from multiple generations (Table 1, 2). The ability to generate multiple outputs allows selecting the best one, significantly improving performance over a single deterministic output.
Computational Requirements: While NN search can be computationally expensive, the paper's strategy of limiting the search space (within a window around K global neighbors) makes it more tractable. For deploying in real-time applications, further optimizations or specialized hardware might be necessary, as suggested by the authors (mentioning systems like Scanner).
Limitations: The method's primary failure mode occurs when suitable matching neighbors are not available in the training set for a particular pixel or configuration (Figure 10). The quality of the generated output is directly dependent on the diversity and quality of the training dataset and the effectiveness of the learned pixel descriptors.
Dense Correspondences: An interesting byproduct is the generation of dense pixel-level correspondences between the synthesized image and the training examples. This could be useful for other tasks like label transfer if the training data is augmented with semantic masks.

In summary, PixelNN offers a practical alternative to pure generative models for conditional image synthesis, emphasizing diversity, interpretability, and user control through a two-stage CNN-NN pipeline and compositional pixel matching using multi-scale deep features.

PDF Markdown

PixelNN: Example-based Image Synthesis (1708.05349v1)

Summary

Related Papers