Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework (2504.10018v1)

Published 14 Apr 2025 in cs.CV and cs.AI

Abstract: Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians' external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on https://github.com/Event-AHU/OpenPAR

Summary

  • The paper introduces a novel benchmark dataset, EventPAR, with 100,000 paired RGB frames and event streams annotated with 50 attributes including 6 emotions.
  • The paper proposes an asymmetric RWKV fusion framework that efficiently combines spatial RGB features with temporal event data using a similarity-based token filtering strategy.
  • The paper demonstrates state-of-the-art performance on EventPAR, outperforming 18 baselines, and confirms enhanced robustness under challenging visual conditions.

This paper addresses the limitations of traditional RGB-based Pedestrian Attribute Recognition (PAR) systems, which struggle with challenging lighting conditions and motion blur, and typically ignore pedestrian emotions. The authors propose a novel multi-modal approach using both RGB and event cameras to improve robustness and incorporate emotional attributes.

The main contributions are twofold:

  1. EventPAR Dataset: A new large-scale benchmark dataset for multi-modal PAR.
    • Content: Contains 100,000 paired, spatio-temporally aligned RGB frames and event streams captured using a DVS346 camera.
    • Attributes: Annotates 50 attributes across 12 groups, including standard appearance attributes (gender, age, clothing, accessories, posture, activity) and, uniquely, six basic human emotions (Happiness, Sadness, Anger, Surprise, Fear, Disgust).
    • Diversity: Collected over several months, covering different seasons (summer/winter), scenes (day/night), and weather conditions (sunny/rainy).
    • Challenges: Includes naturally challenging conditions (illumination variations, motion blur, occlusion) and synthetically introduced degradations (noise, adversarial attacks) to simulate complex real-world scenarios.
    • Benchmark: Provides baseline results by retraining and evaluating 18 existing PAR models on this new dataset.
  2. Asymmetric RWKV Fusion Framework: A novel deep learning architecture designed for RGB-Event PAR.
    • Backbone: Uses Vision-RWKV (VRWKV) encoders, chosen for their efficiency and ability to model sequential data, to extract features from both RGB frames and event data streams independently. Event streams are stacked into frames aligned with RGB frames.
    • OTN-RWKV Fusion Module: An asymmetric fusion module designed to handle the characteristics of RGB and event data:
      • Event Token Filtering: Addresses the high density and redundancy often present in event data. It applies a similarity-based filtering mechanism (KNP_filter) to the event tokens (O_e') to select the most informative ones (O_e''), reducing computational overhead and ambiguity.
      • Interactive Fusion: Fuses the RGB features (O_r') and the filtered event features (O_e'') using a modified bidirectional WKV (Bi-WKV) mechanism inspired by cross-attention and the RWKV architecture. This allows for efficient interaction between the spatial information from RGB and the temporal information from events.
    • Prediction: Fused features are passed through an average pooling layer and a linear classifier to predict the attributes.
    • Training: Uses a Weighted Cross-Entropy (WCE) loss function to handle the inherent imbalance in attribute distribution.

Experiments and Results:

  • The proposed framework was evaluated on the new EventPAR dataset and two existing datasets (MARS-Attribute, DukeMTMC-VID-Attribute) where event data was simulated.
  • On EventPAR, the proposed model achieved state-of-the-art results, significantly outperforming 18 baseline methods. Performance was particularly strong when fusing RGB and event data (e.g., 87.66 mA, 89.07 F1) compared to using only RGB (79.32 mA, 83.22 F1) or only event data (87.10 mA, 88.91 F1).
  • On the simulated MARS and DukeMTMC-VID datasets, the method showed competitive performance compared to other state-of-the-art PAR algorithms.
  • Ablation studies validated the contributions of different components:
    • Using both RGB and event modalities improved performance over single modalities.
    • The proposed OTN-RWKV fusion outperformed simple fusion methods like concatenation or addition.
    • The similarity-based event aggregation strategy was superior to max/mean pooling or GNN-based methods.
    • The RWKV backbone outperformed ViT and ResNet-50 backbones in this task.
  • Visualizations demonstrated the model's ability to correct prediction errors made by single-modality inputs when using fused data, and illustrated the effectiveness of the event token filtering strategy.

Conclusion:

The paper introduces EventPAR, a valuable resource for multi-modal PAR research, and proposes an effective RWKV-based framework (OTN-RWKV) for fusing RGB and event data. This approach successfully leverages the complementary strengths of both sensor types, improving PAR accuracy and robustness, especially in challenging conditions, and uniquely incorporates emotion recognition. Future work aims to explore learnable event representations.