Multi-task Self-Supervised Visual Learning (1708.07860v1)

Published 25 Aug 2017 in cs.CV

Abstract: We investigate methods for combining multiple self-supervised tasks--i.e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso regularization to encourage the network to factorize the information in its representation, and methods for "harmonizing" network inputs in order to learn a more unified representation. We evaluate all methods on ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our results show that deeper networks work better, and that combining tasks--even via a naive multi-head architecture--always improves performance. Our best joint network nearly matches the PASCAL performance of a model pre-trained on ImageNet classification, and matches the ImageNet network on NYU depth prediction.

Citations (620)

View on Semantic Scholar

Summary

The paper presents a unified multi-task framework that integrates relative positioning, colorization, exemplar learning, and motion segmentation to train a ResNet-101 model.
It systematically evaluates these tasks on ImageNet, PASCAL VOC, and NYU depth prediction, highlighting their contributions to semantic and geometric feature extraction.
Results show that naive task combinations nearly match supervised pre-training for object detection and even surpass it for depth prediction.

Multi-task Self-Supervised Visual Learning

In the paper "Multi-task Self-Supervised Visual Learning" by Carl Doersch and Andrew Zisserman, the authors explore strategies to enhance visual feature learning through self-supervision without manual labeling. The research aims to consolidate multiple self-supervised tasks into a unified visual representation, employing a ResNet-101 architecture.

Methodological Overview

The paper begins with a systematic analysis of four self-supervised tasks: relative position, colorization, exemplar learning, and motion segmentation. The purpose is to establish a comparable framework where these tasks are evaluated using the same deep network architecture.

Relative Position: Includes predicting spatial location of image patches.
Colorization: Involves restoring colors to grayscale images.
Exemplar Learning: Utilizes augmented image patches to form pseudo-classes to discriminate against each other.
Motion Segmentation: Predicts which pixels will move in future frames of a video.

Each task was evaluated on several common benchmarks: ImageNet image classification, PASCAL VOC object detection, and NYU depth prediction. These evaluations serve to gauge how effectively the features capture semantic and geometric aspects of images.

Multi-task Learning Approach

A significant contribution of the paper is the examination of multi-task learning. The authors investigate various combinations of the tasks, utilizing a shared trunk architecture with task-specific heads. The results indicate that deeper networks are more effective, and that even naive combination strategies can improve performance.

Naive Multi-task Architecture: Combines tasks in a straightforward joint training setup.
Lasso Regularization: Implements a sparse feature learning approach, where relevant features from different layers are linearly combined.
Harmonization: Attempts to resolve task-specific conflicts by modifying inputs to maintain consistency across tasks.

Results

The empirical outcomes demonstrate improvements across tasks when combined, with relative positioning and colorization being especially beneficial. A notable observation is that combining self-supervised tasks nearly matches supervised ImageNet pre-training for PASCAL object detection and surpasses it for NYU depth prediction. However, harmonization and lasso-based techniques yielded minimal improvements relative to naive task combinations.

Implications and Future Directions

The findings underscore the viability of self-supervision as an alternative to manual annotation, especially in scenarios where obtaining labeled data is expensive or impractical. The potential for broader application exists, provided future work focuses on optimizing task combinations and improving input harmonization techniques. Further exploration of different architectures, such as VGG-16 or deeper networks like ResNet-152 and DenseNet, might provide additional insights into optimizing the depth and scope of self-supervised learning.

Overall, this paper advances our understanding of multi-task learning in the domain of computer vision, demonstrating promising directions for self-supervised learning in real-world applications.

PDF Markdown

Related Papers

YouTube

Show All Videos