The paper introduces a benchmark for evaluating Vision-LLMs (VLMs) on Optical Character Recognition (OCR) tasks within dynamic video environments. The paper addresses the question of whether VLMs can replace traditional, domain-specific OCR systems. To this end, the authors present a new dataset and a comprehensive benchmarking analysis.
A novel dataset of 1,477 manually annotated video frames is introduced, covering domains such as code editors, news broadcasts, YouTube videos, and advertisements. This dataset is used to benchmark three VLMs—Claude-3, Gemini-1.5, and GPT-4o—alongside traditional computer vision OCR systems, namely EasyOCR and RapidOCR. The performance of these models is evaluated using metrics such as Word Error Rate (WER), Character Error Rate (CER), and accuracy.
The paper is structured to provide a comprehensive overview of the topic. It includes:
- A review of related work in traditional OCR.
- An outline of VLMs.
- A detailed description of the dataset creation and curation process.
- An explanation of the benchmarking methodology.
- A presentation of the evaluation results with visualizations.
- A summary of the findings, implications, and potential directions for future research.
The traditional approach to OCR is represented by RapidOCR and EasyOCR. RapidOCR is presented as a high-performance framework using ONNXRuntime, OpenVINO, and PaddlePaddle for fast inference across various platforms, supporting multilingual OCR tasks. EasyOCR is described as a lightweight toolkit employing a two-stage approach: text detection using the Character Region Awareness for Text Detection (CRAFT) algorithm and text recognition using a Convolutional Recurrent Neural Network (CRNN) with a Connectionist Temporal Classification (CTC) decoder.
The VLMs are described as integrating advancements in computer vision and natural language processing, potentially serving as universal solutions for tasks traditionally requiring separate models. The VLMs considered are:
- Claude-3 Sonnet, which focuses on intelligence and speed, integrating a visual encoder with a large language decoder.
- Gemini-1.5 Pro, which combines visual feature extraction with text generation in a multimodal transformer architecture, pre-trained on video-text datasets.
- GPT-4o, an evolution of the GPT-4 architecture, featuring an expanded context window and enhanced processing speed, extending its capabilities to multimodal tasks.
The dataset creation leverages VideoDB for image extraction and organization, automating the process of creating datasets from videos. The custom dataset comprises 1,477 frames across domains including code editors, news channels, YouTube videos, advertisements, talk shows, online lectures, and traffic rules.
The evaluation metrics are defined as follows:
- Character Error Rate (CER):
- = number of substitutions
- = number of deletions
- = number of insertions
- = total number of characters in the ground truth
- Word Error Rate (WER): similar to CER, but at the word level.
- Accuracy: , expressed as a percentage.
The qualitative results compare model outputs with ground truth, analyzing differences in sentence structure, content preservation, and clarity, including character-level additions, substitutions, and omissions. The analysis indicates that all models struggle with interpreting occluded text.
The quantitative results, as shown in Table 1 of the paper, indicate that GPT-4o achieves the highest overall accuracy (76.22%), while Gemini-1.5 Pro demonstrates the lowest word error rate (23.85%). RapidOCR and EasyOCR perform poorly, with higher error rates and lower accuracy. GPT-4o also demonstrates exceptional performance across all domains, with accuracy rates between 65-80%, particularly excelling in legal/educational content (approximately 84%). Gemini-1.5 Pro shows performance variability, struggling with finance/business/news content (around 50% accuracy). The traditional OCR solutions consistently underperform compared to the VLMs.
The authors note that VLMs are susceptible to content security policies that can prevent the generation of any output if the input content triggers security flags.
The paper concludes that VLMs outperform traditional computer vision models on dynamic video data, but further work is needed to improve their robustness against variations in video quality, font styles, and complex backgrounds. The authors suggest expanding the dataset, fine-tuning VLMs, and evaluating the effect of prompt variations to improve adaptability and performance. They also propose extending the research to other tasks typically performed by traditional computer vision models.