Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression (1512.02311v1)

Published 8 Dec 2015 in cs.CV

Abstract: We introduce a new approach to intrinsic image decomposition, the task of decomposing a single image into albedo and shading components. Our strategy, which we term direct intrinsics, is to learn a convolutional neural network (CNN) that directly predicts output albedo and shading channels from an input RGB image patch. Direct intrinsics is a departure from classical techniques for intrinsic image decomposition, which typically rely on physically-motivated priors and graph-based inference algorithms. The large-scale synthetic ground-truth of the MPI Sintel dataset plays a key role in training direct intrinsics. We demonstrate results on both the synthetic images of Sintel and the real images of the classic MIT intrinsic image dataset. On Sintel, direct intrinsics, using only RGB input, outperforms all prior work, including methods that rely on RGB+Depth input. Direct intrinsics also generalizes across modalities; it produces quite reasonable decompositions on the real images of the MIT dataset. Our results indicate that the marriage of CNNs with synthetic training data may be a powerful new technique for tackling classic problems in computer vision.

Citations (178)

View on Semantic Scholar

Summary

The paper introduces a CNN-based approach for intrinsic image decomposition that directly predicts albedo and shading from single RGB patches without relying on depth data.
The research employs a multiscale CNN architecture with scale-invariant and gradient loss functions, trained on synthetic datasets to capture both local and contextual image features.
Experiments on MPI Sintel and MIT datasets demonstrate that the method achieves lower MSE and LMSE compared to traditional physics-based approaches.

Direct Intrinsics: CNN-Based Albedo and Shading Decomposition

The paper "Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression" introduces a novel methodology for intrinsic image decomposition, focusing on the estimation of albedo and shading from single RGB image patches using a data-driven approach. This approach leverages convolutional neural networks (CNNs) to predict both albedo and shading directly, differentiating itself from conventional methods that incorporate physics-based priors and graph-based inference mechanisms.

Intrinsic image decomposition aims to separate an image into its albedo and shading components, where albedo represents the reflectivity and inherent color of surfaces, and shading encapsulates the light interactions and shadow variations. The intrinsic model thus assumes an image as the product of these components. Historically, this decomposition problem utilized various physically-driven priors and relied heavily on depth information for enhanced accuracy. The proposed method, however, functions solely with RGB inputs, challenging the notion that depth maps are necessary for high-quality intrinsics recovery.

Methodology

The core innovation of this paper lies in the usage of synthetic training data, specifically the large-scale MPI Sintel dataset, to train a multiscale CNN architecture. This network consists of interconnected layers designed to capture both local details and broader contextual information from image patches. Building on the depth prediction networks proposed by Eigen and Fergus, the authors adapt similar structures while integrating scale-invariant and gradient-based loss functions to refine the predictive capacity of the network.

Key architectural elements include:

Multiscale CNN Regression (MSCR): This consists of convolution layers that operate at different resolutions to jointly predict albedo and shading outputs. The shared information between layers helps provide context while maintaining local precision.
Convolutional Methods: The network utilizes Parametric Rectified Linear Units (PReLUs) for activation functions and employs deconvolutional layers for upsampling.
Loss Functions: The paper details the formulation of scale-invariant and gradient loss functions, crucial for mitigating scale biases and enhancing piecewise smoothness in albedo predictions.

Results

Evaluated on both the MPI Sintel dataset and real-world MIT intrinsic image dataset, Direct Intrinsics demonstrates significant performance improvements over previous approaches that rely on color-depth fusion. Specifically, in comparison to methods like those proposed by Chen and Koltun, this RGB-only solution exhibits superior mean squared errors (MSE) and local MSE (LMSE) metrics on synthetic benchmarks. Additionally, when adapted to real images, the approach is competitive with physics-based systems in accurately recovering the shading and albedo components.

The use of synthetic datasets for training represents a practical shift, overcoming data collection constraints associated with real-world albedo and shading decomposition. This decision implies that properly managed synthetic data can be effectively generalizable to real-world complexities, although thorough domain adaptation strategies may be required for optimization across all open-world scenarios.

Implications and Future Work

This methodology introduces a paradigm shift in how foundational computer vision tasks, possibly extended to material recognition and image-based rendering, handle the decomposition problem. The results suggest that carefully constructed synthetic data, along with a robust CNN framework, can replicate and possibly surpass conventional methods reliant on physics-based inferences.

The paper also sparks discussions around the utility of purely data-driven models in decomposing complex intrinsic components without explicit physical priors, highlighting possible avenues for enhancing CNN capabilities through better data representation and hypercolumn connection strategies.

Future research could investigate bridging domain gaps further, integrating domain adaptation techniques, and experimenting with alternative synthetic datasets for training. This work lays foundational structures for more sophisticated models for scene understanding, revegetating mainstream assumptions about the necessity of depth information in intrinsic image tasks.

PDF Markdown