Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels (1512.06974v2)

Published 22 Dec 2015 in cs.CV

Abstract: When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

Citations (212)

View on Semantic Scholar

Summary

The paper introduces a dual-classifier approach that distinguishes between actual visual presence and subjective human annotation relevance.
It employs a shared CNN architecture with two specialized heads to effectively handle noisy labels in large-scale datasets like MS COCO and Yahoo Flickr 100M.
The model achieves significant improvements in performance metrics, evidencing enhanced mean average precision and Precision at Human Recall on noisy, user-generated datasets.

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

The paper "Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels" presents an innovative approach to resolving the inherent challenge of human reporting bias present in visual annotations. This paper explores the discrepancy, referred to as human reporting bias, between what is visually present in an image and what humans annotate, using noisy human-centric labels for training more accurate image classifiers.

Overview and Methodology

Human reporting bias stems from the subjective judgment of annotators when deciding what elements of an image to mention, often requiring a model to distinguish between "what's in the image" and "what's worth saying." This distinction is particularly necessary for applying deep learning classifiers to large-scale image datasets, where inconsistency in tags or missing information is common. The authors propose a dual-classifier approach to resolve this issue, organizing their model into a visual presence classifier and a relevance classifier, structured as distinct "heads" of a shared convolutional neural network (CNN) architecture.

The model's innovation lies in its formulation: differentiating human-centric predictions and actual visual presence predictions. This is achieved by conducting a factor decoupling, which allows the model to discern whether a concept is visually present and if it should be mentioned. The authors employ a convolutional neural network, exploiting the efficacy of multi-label learning to jointly optimize these classifiers. Importantly, the model integrates the prediction of human label annotations with predictions based on visual content, making it possible to navigate noisy human annotations effectively.

Results

The results demonstrate the model's enhanced classification performance on datasets such as MS COCO and Yahoo Flickr 100M. For instance, the proposed model outperforms conventional classifiers, doubling performance in some scenarios. Noteworthy is its capability to predict both human-centric labels and visually grounded labels accurately, evidenced by improvements in mean average precision (mAP) and Precision at Human Recall (PHR) metrics.

Moreover, the model's application to a vast dataset like Yahoo Flickr 100M underscores its ability to generalize beyond curated datasets to images annotated through user-generated content, establishing its utility in more varied and less controlled environments. Its superior performance highlights the potential for widespread application across real-world datasets where traditional models would falter without clean annotations.

Implications and Future Work

The implications of successfully disentangling human-centric labels from visual presence are manifold. Practically, this method ensures improved image classification and captioning, allowing more reliable applications in areas ranging from autonomous systems to digital content moderation and retrieval. Theoretically, it offers insights into managing noisy data in machine learning, presenting an avenue for future exploration.

Future work could entail investigating alternative approaches to latent variable estimation, including incorporating constraints or employing an Expectation-Maximization framework. Moreover, this paper opens up the potential for further exploration into the psycholinguistic factors influencing human annotation decisions and how such factors can be integrated into machine learning frameworks.

In conclusion, this paper delivers a substantive contribution to the field of computer vision by addressing the often overlooked yet significant issue of human reporting bias. Through a creative and structured approach, the authors advance the state-of-the-art in leveraging noisy annotations for visual classification, with substantial implications for both practical applications and theoretical explorations in machine learning.