- The paper demonstrates that combining CNN and ViT features significantly improves periocular recognition performance.
- It shows that CNNs extract optimal features from deeper layers while ViTs provide robust early representations, offering complementary benefits.
- The study highlights that integrating lightweight and heavier models yields lower error rates (e.g., 7.72% EER), emphasizing efficiency in biometric recognition.
Combined CNN and ViT features off-the-shelf: Another baseline for recognition
This paper undertakes a rigorous examination of leveraging pre-trained Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of periocular recognition using features extracted off-the-shelf. The authors aim to build on foundational work that utilized CNN configurations by now incorporating ViTs to evaluate their efficacy in biometrics, particularly when detecting and identifying individuals via images of the periocular region.
Overview of the Research
The periocular region represents a less intrusive biometric modality, suitable for non-cooperative scenarios where traditional facial recognition might face obfuscations or require undesired levels of cooperation. The paper deploys several CNN architectures — ResNet variants R18, R50, and R101 — and ViTs — tiny, small, and base versions — to explore the possibilities inherent in both sets of models. These models have their origins rooted in being trained on the ImageNet Large Scale Visual Recognition Challenge, a benchmark noted for its sheer scale and variety in object classification tasks.
Key findings include the high complementarity between CNNs and ViTs, evidenced by improved recognition performance when features from these models are used collaboratively. Notably, the research notes that optimal layers for effective feature extraction in CNNs occur deeper within the networks, contrasting with ViTs where effective feature vectors emerge much earlier, often within the first third of the layers. This demonstrates a potential efficiency advantage in the early layer utilization of ViTs compared to CNNs for similar levels of abstraction and accuracy.
Numerical Results and Implications
Quantitatively, CNN and ViT combinations yield superior accuracy across various configurations. For instance, the integration of R50 CNN and base ViT achieves a competitive Equal Error Rate (EER) of 7.72%. When traditional hand-crafted features are also utilized — specifically, Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG), and Scale-Invariant Feature Transform (SIFT) — additional gains in accuracy are observed, reinforcing the paper’s assertion of feature complementarity.
Moreover, the research articulates that models like the tiny ViT, with significantly fewer parameters (around 1.33 million), manage to approach the performance of heavier CNN models (e.g., R18 with 11.7 million parameters). This indicates that larger models do not unequivocally offer a return in accuracy proportional to their complexity and hint at a growing necessity to weigh computational efficiency against brute-force model size, especially pertinent in resource-constrained environments like mobile devices.
Theoretical and Practical Implications
Practically, the paper underscores the potential of leveraging existing pre-trained models to circumvent the constraints of data scarcity, a prevalent challenge in biometrics. Theoretically, the findings propose a narrative for deep network design — emphasizing the selective utilization of network depths based on architecture type (CNN vs ViT) and application context.
Potential future developments may seek to optimize and prune these architectures further, or combine more diverse models to identify potential complementarities. This work sets a precedent for applying similar methodologies across other biometric modalities or exploring architectures besides ResNet or baseline ViTs.
In conclusion, the paper posits a strong case for repurposing existing neural network architectures beyond their initial design intents, offering a methodology and results that will resonate with researchers aiming to enhance biometric recognition systems using off-the-shelf deep learning frameworks.