Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trunk-Branch Ensemble Convolutional Neural Networks for Video-based Face Recognition

Published 19 Jul 2016 in cs.CV | (1607.05427v2)

Abstract: Human faces in surveillance videos often suffer from severe image blur, dramatic pose variations, and occlusion. In this paper, we propose a comprehensive framework based on Convolutional Neural Networks (CNN) to overcome challenges in video-based face recognition (VFR). First, to learn blur-robust face representations, we artificially blur training data composed of clear still images to account for a shortfall in real-world video training data. Using training data composed of both still images and artificially blurred data, CNN is encouraged to learn blur-insensitive features automatically. Second, to enhance robustness of CNN features to pose variations and occlusion, we propose a Trunk-Branch Ensemble CNN model (TBE-CNN), which extracts complementary information from holistic face images and patches cropped around facial components. TBE-CNN is an end-to-end model that extracts features efficiently by sharing the low- and middle-level convolutional layers between the trunk and branch networks. Third, to further promote the discriminative power of the representations learnt by TBE-CNN, we propose an improved triplet loss function. Systematic experiments justify the effectiveness of the proposed techniques. Most impressively, TBE-CNN achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces. With the proposed techniques, we also obtain the first place in the BTAS 2016 Video Person Recognition Evaluation.

Citations (380)

Summary

  • The paper presents TBE-CNN, a trunk-branch ensemble architecture that extracts both holistic and localized facial features to overcome blur, occlusion, and pose variations.
  • It employs artificial blur during training to develop blur-insensitive feature learning essential for video frame analysis.
  • The study introduces a Mean Distance Regularized Triplet Loss that improves inter- and intra-class separability, achieving superior performance on standard video face recognition benchmarks.

Trunk-Branch Ensemble Convolutional Neural Networks for Video-based Face Recognition

The paper "Trunk-Branch Ensemble Convolutional Neural Networks for Video-based Face Recognition" addresses critical challenges in video-based face recognition (VFR), particularly focusing on image blur, pose variations, and occlusions often present in videos from surveillance cameras. The authors propose a novel CNN architecture, Trunk-Branch Ensemble CNN (TBE-CNN), tailored to enhance the robustness of face recognition systems when dealing with video data.

Overview of Contributions

The study delivers several key contributions to the field of VFR:

  1. Artificial Blur in Training Data: Addressing the dearth of high-quality, real-world video training data, the authors apply artificial blur to still images during training to mimic video frames' quality. This strategy promotes the learning of blur-insensitive features, as the CNN must correctly classify both clear and blurred versions of an image into the same class.
  2. TBE-CNN Architecture: The proposed TBE-CNN architecture is designed to efficiently extract robust features by sharing low- and middle-level layers between a trunk network and several branch networks. The branch networks focus on localized facial regions, complementing the holistic view provided by the trunk network, thus enhancing recognition performance amid pose variations and occlusions.
  3. Enhanced Representation Learning: The paper introduces a refined triplet loss function, the Mean Distance Regularized Triplet Loss (MDR-TL), which imposes additional constraints on the distances between mean representations of different subjects. This regularization encourages a uniform distribution of inter- and intra-class distances, improving the discriminative power of the learned embeddings.

Experimental Results

The efficacy of these methodologies is demonstrated through extensive experiments on three prominent video databases: PaSC, COX Face, and YouTube Faces. The proposed methods consistently outperform state-of-the-art approaches, particularly highlighted by:

  • Achieving significant improvements in verification rates at a 1% false acceptance rate on the PaSC database, suggesting robust performance in unconstrained conditions.
  • Showing notably high rank-1 identification rates in both video-to-still (V2S) and still-to-video (S2V) tasks on the COX Face database, emphasizing the method's applicability to scenarios where the gallery might consist of still images, such as identification from surveillance footage.
  • Maintaining competitive performance on the YouTube Faces benchmark, despite the challenging verification tasks posed by low-resolution video content.

Practical and Theoretical Implications

The implications of this work are substantial for both theoretical exploration and practical application. From a practical perspective, the enhancements in blur, pose, and occlusion robustness make the TBE-CNN suitable for real-time surveillance and security systems, where varied and harsh environmental conditions prevail. Theoretically, the integration of simulated video frames in training elucidates a pathway for improving CNN generalization in VFR—an approach that could be adapted for other domains experiencing a lack of video data.

Future Prospects

The findings open several avenues for future research and improvements in AI applications:

  • Extension to Other Modalities: Similar trunk-branch architectures could be adapted for other vision tasks, such as action recognition from video.
  • Domain Adaptation: Augmenting the TBE-CNN with domain adaptation techniques might further enhance generalization to more disparate data sources without comprehensive retraining.
  • Efficient Training: Given the computational demands, developing more efficient training paradigms that maintain performance on limited hardware may foster wider adoption.

In summary, the paper presents a compelling advancement in VFR, particularly by adapting CNN architectures and training processes to better handle the complexities and variances inherent in video data. The techniques proposed not only achieve superior performance on modern benchmarks but also set the stage for more resilient and adaptable face recognition systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.