What do different evaluation metrics tell us about saliency models? (1604.03605v2)

Published 12 Apr 2016 in cs.CV

Abstract: How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.

Citations (779)

View on Semantic Scholar

Summary

The paper demonstrates that saliency evaluation metrics exhibit distinct sensitivities to false positives and negatives.
It reveals how metrics like sAUC address center bias while metrics like IG and KL sharply penalize missing fixations.
The study recommends aligning metric selection with specific tasks to ensure fair and effective assessments.

Saliency Model Evaluation: Insights and Recommendations

The evaluation of saliency models is a crucial aspect of computer vision research, particularly when predicting regions that attract human attention in images. A paper authored by Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Fredo Durand provides a comprehensive analysis of various metrics used to evaluate saliency models. Titled "What do different evaluation metrics tell us about saliency models?", the paper systematically examines eight salient evaluation metrics and their implications, thus offering a roadmap for more informed model selection and assessment.

Overview of Metrics

The paper explores the characteristics and behaviors of eight commonly used evaluation metrics:

Area under ROC Curve (AUC)
Shuffled AUC (sAUC)
Normalized Scanpath Saliency (NSS)
Similarity or Histogram Intersection (SIM)
Pearson’s Correlation Coefficient (CC)
Earth Mover’s Distance (EMD)
Kullback-Leibler Divergence (KL)
Information Gain (IG)

Through systematic experiments and visualizations, the authors unravel how these metrics react to variations in input parameters such as false positives, false negatives, blur, and spatial deviations. This analysis aids in understanding why different metrics rank saliency models disparately and points to the evaluation context in which each metric might be most suitable.

Key Findings and Implications

Sensitivity to False Positives and Negatives

The authors find that metrics such as KL, IG, and SIM are particularly sensitive to false negatives. This sensitivity arises because these metrics significantly penalize models that fail to predict fixations at actual fixation locations, interpreting zero values as the impossibility of fixations. On the other hand, metrics like AUC are primarily driven by high-valued predictions at fixated locations and are ambivalent to low-valued false positives.

Center Bias Handling

The treatment of center bias is a crucial aspect of saliency model evaluation. The sAUC metric is designed to penalize models that incorporate a strong center bias, thus ensuring fair assessment when such bias is not expected in the model design. Conversely, IG provides a means to measure performance relative to a baseline center bias model, thus explicitly accounting for systematic viewing biases. EMD stands out by incorporating spatial distance into its evaluation, preferring models that may hedge their bets spatially to approximate ground truth fixations more closely.

Metric Correlation and Model Ranking

A Spearman rank correlation analysis reveals that metrics like NSS, CC, AUC, EMD, and SIM form a similarity cluster, indicating high correlation in their rankings of saliency models. KL and IG diverge from this cluster due to their extreme sensitivity to zero-valued predictions at fixation locations. This divergence underscores the importance of understanding the underlying assumptions and properties of each metric when interpreting model performance.

Practical Recommendations

Given the diversity in metric behaviors, the paper offers several recommendations for designing new saliency benchmarks:

Defining Expected Input: Clearly specify whether the saliency models should be probabilistic and how systematic dataset biases, such as center bias, will be handled. Models should be regularized appropriately to align with the evaluation metrics used.
Handling Dataset Bias: Systematic dataset biases must be either modeled by the saliency models or accounted for during evaluation. For instance, providing a training dataset can help optimize model parameters such as blur, scale, and center bias.
Task-Specific Evaluation: Tailor the choice of evaluation metrics to the end application. For detection tasks, AUC, KL, and IG are suitable due to their handling of detection failures. For ranking importance in image regions, NSS or SIM might be more appropriate.
Metric Selection: Given the assumptions and definitions of saliency, choose metrics that match the model design. For comprehensive and fair evaluation, metrics such as CC or NSS are recommended for general saliency predictions, while KL and IG are preferred for models evaluated as probabilistic distributions.

Conclusion

The paper provides a critical examination of saliency model evaluation, elucidating the nuanced behaviors of different metrics and their implications. By offering clear recommendations, it aims to standardize the evaluation process and reduce inconsistencies across benchmarks. This analysis is invaluable for researchers looking to optimize saliency models and align their evaluation methods with specific applications or theoretical frameworks. As the field continues to evolve, adopting these recommendations could lead to more meaningful comparisons and advancements in saliency prediction research.

PDF Markdown