Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition (2407.19472v2)

Published 28 Jul 2024 in cs.CV

Abstract: We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition. These architectures have demonstrated significant success in various computer vision tasks beyond the ones for which they were designed. This work builds on our previous study using off-the-shelf Convolutional Neural Network (CNN) and extends it to include the more recently proposed Vision Transformers (ViT). Despite being trained for generic object classification, middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images. We also demonstrate that CNNs and ViTs are highly complementary since their combination results in boosted accuracy. In addition, we show that a small portion of these pre-trained models can achieve good accuracy, resulting in thinner models with fewer parameters, suitable for resource-limited environments such as mobiles. This efficiency improves if traditional handcrafted features are added as well.

Summary

The paper demonstrates that combining CNN and ViT features significantly improves periocular recognition performance.
It shows that CNNs extract optimal features from deeper layers while ViTs provide robust early representations, offering complementary benefits.
The study highlights that integrating lightweight and heavier models yields lower error rates (e.g., 7.72% EER), emphasizing efficiency in biometric recognition.

Combined CNN and ViT features off-the-shelf: Another baseline for recognition

This paper undertakes a rigorous examination of leveraging pre-trained Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of periocular recognition using features extracted off-the-shelf. The authors aim to build on foundational work that utilized CNN configurations by now incorporating ViTs to evaluate their efficacy in biometrics, particularly when detecting and identifying individuals via images of the periocular region.

Overview of the Research

The periocular region represents a less intrusive biometric modality, suitable for non-cooperative scenarios where traditional facial recognition might face obfuscations or require undesired levels of cooperation. The paper deploys several CNN architectures — ResNet variants R18, R50, and R101 — and ViTs — tiny, small, and base versions — to explore the possibilities inherent in both sets of models. These models have their origins rooted in being trained on the ImageNet Large Scale Visual Recognition Challenge, a benchmark noted for its sheer scale and variety in object classification tasks.

Key findings include the high complementarity between CNNs and ViTs, evidenced by improved recognition performance when features from these models are used collaboratively. Notably, the research notes that optimal layers for effective feature extraction in CNNs occur deeper within the networks, contrasting with ViTs where effective feature vectors emerge much earlier, often within the first third of the layers. This demonstrates a potential efficiency advantage in the early layer utilization of ViTs compared to CNNs for similar levels of abstraction and accuracy.

Numerical Results and Implications

Quantitatively, CNN and ViT combinations yield superior accuracy across various configurations. For instance, the integration of R50 CNN and base ViT achieves a competitive Equal Error Rate (EER) of 7.72%. When traditional hand-crafted features are also utilized — specifically, Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), and Scale-Invariant Feature Transform (SIFT) — additional gains in accuracy are observed, reinforcing the paper’s assertion of feature complementarity.

Moreover, the research articulates that models like the tiny ViT, with significantly fewer parameters (around 1.33 million), manage to approach the performance of heavier CNN models (e.g., R18 with 11.7 million parameters). This indicates that larger models do not unequivocally offer a return in accuracy proportional to their complexity and hint at a growing necessity to weigh computational efficiency against brute-force model size, especially pertinent in resource-constrained environments like mobile devices.

Theoretical and Practical Implications

Practically, the paper underscores the potential of leveraging existing pre-trained models to circumvent the constraints of data scarcity, a prevalent challenge in biometrics. Theoretically, the findings propose a narrative for deep network design — emphasizing the selective utilization of network depths based on architecture type (CNN vs ViT) and application context.

Potential future developments may seek to optimize and prune these architectures further, or combine more diverse models to identify potential complementarities. This work sets a precedent for applying similar methodologies across other biometric modalities or exploring architectures besides ResNet or baseline ViTs.

In conclusion, the paper posits a strong case for repurposing existing neural network architectures beyond their initial design intents, offering a methodology and results that will resonate with researchers aiming to enhance biometric recognition systems using off-the-shelf deep learning frameworks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/CSVisionPapers/status/1818448406693445637

YouTube

Show All Videos