What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

Published 3 Apr 2019 in cs.CV | (1904.01906v4)

Abstract: Many new proposals for scene text recognition (STR) models have been introduced in recent years. While each claim to have pushed the boundary of the technology, a holistic and fair comparison has been largely missing in the field due to the inconsistent choices of training and evaluation datasets. This paper addresses this difficulty with three major contributions. First, we examine the inconsistencies of training and evaluation datasets, and the performance gap results from inconsistencies. Second, we introduce a unified four-stage STR framework that most existing STR models fit into. Using this framework allows for the extensive evaluation of previously proposed STR modules and the discovery of previously unexplored module combinations. Third, we analyze the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets. Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (441)

View on Semantic Scholar

Summary

The paper diagnoses major benchmarking issues in scene text recognition by revealing how inconsistent dataset selections can alter performance by over 15%.
It introduces a unified STR framework that categorizes models into transformation, feature extraction, sequence modeling, and prediction stages for fair comparisons.
The analysis shows that attention-based prediction boosts accuracy while ResNet feature extraction balances performance with resource efficiency.

Overview of "What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis"

The paper "What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis" critically examines scene text recognition (STR) models. This exploration unveils the deficiencies in current benchmarking practices due to inconsistent training and evaluation datasets. The authors present three primary contributions: a diagnostic of dataset inconsistencies, a unified STR framework under which existing models are evaluated, and a detailed analysis of module contributions to STR performance.

Analysis of Dataset Inconsistencies

The authors investigate the inconsistent use of training and evaluation datasets across various STR studies. By evaluating key datasets like MJSynth, SynthText, and several real-world datasets (e.g., IIIT5K, SVT, IC03), they identify performance discrepancies linked to varying dataset selections. Notably, some benchmarks, such as IC13 and IC15, are evaluated using different subsets within the community, which significantly affects model performance comparison. The authors demonstrate performance variances that exceed 15% due to dataset selection alone, emphasizing the need for standardized datasets in STR benchmarking.

Unified STR Framework

To remedy the issue of inconsistent evaluation, the authors propose a unified STR framework. This framework delineates STR models into four stages: Transformation, Feature Extraction, Sequence Modeling, and Prediction. By categorizing existing models into these stages, they enable the evaluation of module combinations comprehensively. This approach facilitates fair comparisons across models, highlighting how module choices within each stage affect overall model performance.

Module-wise Contributions and Performance Analysis

Under this unified framework, the paper evaluates 24 STR module combinations. Each combination is assessed for accuracy, inference speed, and memory usage using a consistent dataset. Their analysis identifies the Prediction module, particularly the use of the Attention mechanism, as critical for accuracy but costly in terms of speed and memory. The Feature Extraction module, especially the ResNet configuration, emerges as a pivotal factor in balancing accuracy with resource consumption.

While the feature extractor and prediction components are identified as having major impacts on time and memory respectively, the transformation and sequence modeling impact accuracy performance less significantly. This granularity allows practitioners to make data-driven decisions on module selection based on resource constraints and accuracy requirements.

Implications and Future Directions

This paper provides a critical insight into STR model evaluation practices and advocates for standardized benchmarking datasets. The implications are significant for both academia and industry, where consistent evaluation methodology can lead to more reliable performance assessments. The comprehensive analysis of module contributions furnishes researchers with insights into which modules to prioritize for improving accuracy or reducing computational overheads.

For future research, addressing the remaining failure cases, such as complex fonts, occlusions, or vertical text, remains crucial. Balancing high accuracy with resource-efficient designs will continue to be a focal area. Furthermore, the potential of integrating novel data augmentation techniques or adaptive learning methods that can adjust to varying dataset characteristics warrants exploration.

Overall, the authors contribute a meticulous and comprehensive critique of current STR model evaluation methodologies and furnish a robust framework that can standardize future research in this domain.

Markdown Report Issue