Learning Intrinsic Image Decomposition from Watching the World (1804.00582v1)

Published 2 Apr 2018 in cs.CV

Abstract: Single-view intrinsic image decomposition is a highly ill-posed problem, and so a promising approach is to learn from large amounts of data. However, it is difficult to collect ground truth training data at scale for intrinsic images. In this paper, we explore a different approach to learning intrinsic images: observing image sequences over time depicting the same scene under changing illumination, and learning single-view decompositions that are consistent with these changes. This approach allows us to learn without ground truth decompositions, and to instead exploit information available from multiple images when training. Our trained model can then be applied at test time to single views. We describe a new learning framework based on this idea, including new loss functions that can be efficiently evaluated over entire sequences. While prior learning-based methods achieve good performance on specific benchmarks, we show that our approach generalizes well to several diverse datasets, including MIT intrinsic images, Intrinsic Images in the Wild and Shading Annotations in the Wild.

Citations (167)

View on Semantic Scholar

Summary

The paper introduces a novel framework for intrinsic image decomposition that learns from unlabeled video sequences by exploiting temporal consistency.
This is achieved using new loss functions designed for sequences, enabling competitive performance like 20.3% WHDR on the IIW dataset without supervised annotations.
The approach signifies a step towards unsupervised learning in computer vision tasks where labeled data is scarce, applicable in dynamic environments.

An Innovative Learning Framework for Intrinsic Image Decomposition

Intrinsic image decomposition, a critical aspect of computational vision, involves the separation of an image into two distinct layers: reflectance and shading. This decomposition allows for more nuanced understanding and manipulation of visual content. The paper "Learning Intrinsic Image Decomposition from Watching the World" by Zhengqi Li and Noah Snavely presents a novel approach that leverages temporal image sequences with shifting illumination to address this highly ill-posed problem.

Methodology

The paper introduces a framework that trains Convolutional Neural Networks (CNNs) using non-annotated image sequences gathered from various sources. Unlike traditional supervised methods, which require vast quantities of labeled ground truth data, this approach exploits the inherent consistency found in sequence-based visual data. By employing sequences where the scene remains static but the illumination changes, the decomposition model learns to infer reflectance and shading without explicit supervision.

This is achieved through several innovative components:

Loss Functions: The methodology introduces new loss functions that can evaluate the decompositions over entire sequences efficiently. These include the all-pairs weighted least squares loss and dense spatio-temporal smoothness loss, which operate across all pixels and frames to enforce temporal consistency and smoothness constraints.
Reflectance Consistency and Image Reconstruction: Despite working with sequences, the learned model performs its task on individual images during inference. Reflectance consistency across sequences is emphasized in the learning process, ensuring robustness against diverse illumination conditions.
Dataset Compilation: The authors created an extensive dataset leveraging videos across multiple environments — both indoor and outdoor settings. This dataset serves as the backbone for training, offering wide applicability and generalization.

Results

The efficacy of this method is showcased by its competitive results across various benchmarks such as MIT intrinsic images, Intrinsic Images in the Wild (IIW), and Shading Annotations in the Wild (SAW). Particularly noteworthy is the network's performance on these datasets without access to supervised annotations, a testament to the profound applicability of the learning approach proposed.

Numerical Evaluation

The network trained under these conditions delivers promising quantitative outcomes. For instance, on the IIW dataset, the method achieves a weighted human disagreement rate (WHDR) of 20.3%, which is appreciable given that it is trained purely on unlabeled sequences.

Implications and Future Prospects

The approach indicates a significant shift towards unsupervised and semi-supervised learning methodologies in computer vision, especially in contexts where labeled data is challenging to acquire. The exploitation of sequence-based data for intrinsic image decomposition opens possibilities for improved modeling in dynamic and uncontrolled environments, extending beyond traditional single-frame learning paradigms.

Future directions could include:

Enhanced Multi-domain Generalization: Further enhancements in generalization across diverse visual domains could be pursued by augmenting the network with annotations when available, blending supervised signals with the sequence-based learning paradigm.
Optimization Integration: Incorporating the outputs of this sequence-based learning approach into optimization algorithms could refine intrinsic image decomposition further, addressing minor discrepancies or noise present in single-frame predictions.

This work serves as a foundational step towards learning rich, informative visual representations from the constant backdrop of our visual world, marking a substantial contribution to the field of computer vision and intrinsic image analysis.