MiVOLO: Multi-input Transformer for Age and Gender Estimation (2307.04616v2)

Published 10 Jul 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Age and gender recognition in the wild is a highly challenging task: apart from the variability of conditions, pose complexities, and varying image quality, there are cases where the face is partially or completely occluded. We present MiVOLO (Multi Input VOLO), a straightforward approach for age and gender estimation using the latest vision transformer. Our method integrates both tasks into a unified dual input/output model, leveraging not only facial information but also person image data. This improves the generalization ability of our model and enables it to deliver satisfactory results even when the face is not visible in the image. To evaluate our proposed model, we conduct experiments on four popular benchmarks and achieve state-of-the-art performance, while demonstrating real-time processing capabilities. Additionally, we introduce a novel benchmark based on images from the Open Images Dataset. The ground truth annotations for this benchmark have been meticulously generated by human annotators, resulting in high accuracy answers due to the smart aggregation of votes. Furthermore, we compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges. Finally, we grant public access to our models, along with the code for validation and inference. In addition, we provide extra annotations for used datasets and introduce our new benchmark.

Citations (21)

View on Semantic Scholar

Summary

The paper introduces a dual-input transformer that fuses facial and body features to enhance recognition even when one modality is occluded.
MiVOLO achieves state-of-the-art performance on benchmarks like IMDB-Clean and surpasses human-level accuracy in age estimation.
The study establishes a new benchmark, LAGENDA, and demonstrates real-time processing capabilities essential for practical applications.

Overview of "MiVOLO: Multi-input Transformer for Age and Gender Estimation"

The paper "MiVOLO: Multi-input Transformer for Age and Gender Estimation" introduces a novel approach to tackle the challenges of age and gender recognition in computer vision, particularly focusing on images captured in uncontrolled environments. The proposed model, named MiVOLO, leverages the vision transformer VOLO architecture and integrates both facial and body information inputs to enhance recognition performance. This dual-input methodology aims to improve the generalization capacity of the model even in scenarios where facial visibility is compromised.

The authors successfully showcase that MiVOLO exhibits state-of-the-art (SOTA) performance across several established benchmarks. These results underscore the efficacy of their approach, which encompasses an innovative dual input/output configuration. Moreover, the model allows for real-time processing capabilities, which is essential for practical applications such as surveillance and retail analytics.

Methodology and Contributions

The MiVOLO architecture is built upon a multi-input configuration that processes both face and body crops. This configuration employs cross-attention mechanisms for cross-view feature fusion, enhancing the model's ability to extract meaningful features from both inputs. The fusion process ensures robustness in age and gender recognition tasks, especially when one of the inputs is absent or occluded.

A significant contribution of this work is the introduction of a new benchmark, LAGENDA, which is derived from the Open Images Dataset. This benchmark is meticulously annotated for age and gender, thereby addressing the biases present in existing celebrity-focused datasets. The annotations themselves utilize a weighted mean aggregation of user votes to attain high accuracy, offering a robust measure of performance.

Additionally, the paper explores human-level accuracy in age estimation, highlighting that MiVOLO surpasses human performance across most age ranges. This outcome is pivotal, considering the typical variance in human age estimation capabilities, particularly in natural, unposed photographs.

Experimental Results

The MiVOLO model achieves notable numerical results, delineated through extensive experiments on benchmarks like IMDB-Clean, UTKFace, Adience, FairFace, and AgeDB. For instance, in the IMDB-Clean dataset, MiVOLO underscores an average age Mean Absolute Error (MAE) that is superior to existing methods. The model trained with LAGENDA showcases enhanced adaptability and performance, affirming the utility of diverse training sets.

Furthermore, the paper details the efficiency of MiVOLO with a high processing frame rate of 971 FPS on NVIDIA V100 GPU, demonstrating the model's suitability for real-time deployments.

Implications and Future Directions

The introduction of MiVOLO and the associated benchmarks has substantial implications for advancing age and gender recognition technologies. Its dual-input mechanism offers a framework that can be expanded upon for other multi-modal recognition tasks in AI. Practically, the model's ability to handle images with obscured or non-visible facial features renders it applicable to industries where such conditions are commonplace.

Looking forward, this research opens avenues for further refinement via advanced segmentation techniques to ensure precise body feature extraction. The potential integration with self-supervised learning models like Masked Autoencoders could address the data-intensive nature of training robust age estimation models. Continued work could also focus on enriching datasets with underrepresented demographics to mitigate biases further.

Overall, the MiVOLO paper makes a compelling advancement in the domain of computer vision, contributing novel methods and insights that align with the broader goals of improving human-like perceptual capabilities in AI systems.

PDF Markdown

Related Papers

YouTube

Show All Videos