Deep Learning Face Attributes in the Wild (1411.7766v3)

Published 28 Nov 2014 in cs.CV

Abstract: Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.

Citations (7,906)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage CNN framework that uses LNet for face localization and ANet for accurate attribute prediction.
It leverages weak supervision by pre-training LNet on ImageNet and ANet on a large face identity dataset to bypass detailed annotations.
Empirical results show significant performance gains, with accuracy improvements of 8% on CelebFaces and 13% on LFW.

Deep Learning Face Attributes in the Wild

The paper "Deep Learning Face Attributes in the Wild" by Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang, presents a robust, two-stage Convolutional Neural Network (CNN) framework for predicting face attributes in unconstrained settings. This work focuses on the integration of two specialized CNNs, LNet and ANet, which are fine-tuned jointly but pre-trained with different datasets to enhance localization and attribute prediction capabilities respectively.

Core Contributions

The core contributions of this work are multi-faceted:

Novel Framework Design:
- LNet is pre-trained with ImageNet categories to enhance face localization.
- ANet is pre-trained with a large face identity dataset to better predict facial attributes.
Weak Supervision:
- LNet uses image-level attribute tags without requiring precise face bounding boxes or landmarks, contrasting with traditional methods relying on exact face part annotations.
Efficiency Enhancements:
- A new method for fast feed-forward evaluation in CNNs with locally shared filters, significantly reducing computational overhead.
Empirical Performance:
- The proposed method surpasses state-of-the-art methods in attribute prediction accuracy by leveraging the aforementioned specialized designs for LNet and ANet.

Numerical Results

The paper quantifies remarkable improvements in attribute classification accuracy on the CelebFaces and Labeled Faces in the Wild (LFW) datasets:

The method boosts existing accuracies by 8% on CelebFaces and 13% on LFW.
For example, on CelebFaces, PANDA-l achieved 85% while the proposed LNets+ANet method achieved 87%.

Theoretical and Practical Implications

The theoretical advancements in this paper extend the understanding of how specialized pre-training regimes enhance the representational capabilities of CNNs:

Pre-training Strategies:

LNet, pre-trained on general object categories, learns robust localization features due to rich supervisory signals, while ANet, pre-trained on face identities, captures high-level semantic concepts intrinsic to face attributes.

Feature Discovery in Pre-trained Networks:

Hidden neurons in ANet, after pre-training, implicitly discover semantic concepts related to face identities, indicating that certain layers learn identity-related features without explicit attribute supervision.

Practically, this work simplifies data preparation by eliminating the need for detailed annotations of face landmarks and ensures that the final attribute prediction is robust to pose, illumination, and occlusion variations, making it applicable to real-world scenarios. The novel fast feed-forward algorithm enhances the speed of attribute recognition, enabling real-time applications.

Future Directions in AI

Future directions prompted by this research include:

Enhancing and generalizing pre-training strategies to other attribute recognition tasks beyond face attributes.
Extending the weak supervision concept to other areas of object detection and classification where bounding box annotation is challenging.
Exploring more efficient techniques to further reduce computational costs in CNNs without compromising accuracy, particularly for mobile and embedded applications.

In summary, the innovative framework and methodologies introduced in this paper contribute substantially to the field of face attribute recognition in challenging environments and provide a strong foundation for future explorations in deep learning-based attribute prediction.

PDF Markdown