Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 102 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 181 tok/s Pro

2000 character limit reached

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens (2410.05261v1)

Published 7 Oct 2024 in cs.CV and cs.AI

Abstract: Reading dense text and locating objects within images are fundamental abilities for Large Vision-LLMs (LVLMs) tasked with advanced jobs. Previous LVLMs, including superior proprietary models like GPT-4o, have struggled to excel in both tasks simultaneously. Moreover, previous LVLMs with fine-grained perception cost thousands of tokens per image, making them resource-intensive. We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens. Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources. (2) Visual Encoder Reinforcement: We enhance the visual encoder through LVLM co-training, unlocking its potential for previously unseen tasks like Chinese OCR and grounding. (3) Data Diversity: We maintain a comparable scale of 100 million samples while diversifying the sources of pre-training data. We assess TextHawk2 across multiple benchmarks, where it consistently delivers superior performance and outperforms closed-source models of similar scale, such as achieving 78.4% accuracy on OCRBench, 81.4% accuracy on ChartQA, 89.6% ANLS on DocVQA, and 88.1% [email protected] on RefCOCOg-test.

Citations (3)

View on Semantic Scholar

Collections

Summary

The paper introduces a token compression method that reduces tokens by 16x, significantly boosting OCR and grounding performance.
It employs a unified visual encoder reinforced by LVLM co-training to excel in tasks such as Chinese OCR and visual grounding.
Experiments show high accuracies on multiple benchmarks, validating its enhanced efficiency and scalability for real-world applications.

Insightful Overview: TextHawk2

Abstract and Motivation

The paper introduces TextHawk2, a bilingual Large Vision-LLM (LVLM) that excels in Optical Character Recognition (OCR) and grounding tasks while significantly reducing the number of tokens per image by sixteen times compared to previous models. This reduction facilitates cheaper and more efficient training and deployment. TextHawk2 focuses on addressing two primary research questions: enhancing OCR performance with limited computational resources and training an LVLM with a unified visual encoder capable of handling multimodal understanding, OCR, and grounding tasks.

Technical Contributions

Token Compression: The paper describes a novel resampler that compresses visual tokens by a factor of sixteen without compromising fine-grained perception. This considerable reduction in token usage directly impacts computational efficiency and scalability.
Visual Encoder Reinforcement: TextHawk2 improves upon its predecessor by strengthening its visual encoder through LVLM co-training. This enhancement allows the model to effectively process tasks such as Chinese OCR and grounding, which were challenging for previous models.
Data Diversity: Maintaining a pre-training dataset of 100 million samples, TextHawk2 diversifies its pre-training data sources to improve generalization across various tasks. This diversification allows the model to outperform other models with similar scale, as demonstrated across multiple benchmarks.

Experiments and Results

TextHawk2 is assessed across several benchmarks. Key results include:

78.4% accuracy on OCRBench for text recognition.
81.4% accuracy on ChartQA.
89.6% ANLS on DocVQA.
88.1% [email protected] on RefCOCOg-test for visual grounding.

These results indicate the model's leading performance in both OCR and grounding, emphasizing its effective token compression and visual encoder capabilities.

Implications and Speculation

Practical Implications: The innovative token compression method allows for efficient handling of high-resolution images, making TextHawk2 resource-efficient for real-world applications in document intelligence, GUI agents, and visual assistance technologies.

Theoretical Implications: The success in achieving state-of-the-art results with a unified visual encoder suggests potential advancements in simplifying LVLM architectures. This model challenges the prevalent use of multiple encoders for different tasks, showcasing the benefits of integration and unified encoding.

Future Directions: The findings encourage further exploration into full-parameter pre-training and native resolution visual encoders to possibly amplify OCR and grounding capabilities. Additional research could also explore integrating Reinforcement Learning from Human Feedback (RLHF) to mitigate hallucinations and enhance model robustness.

Conclusion

TextHawk2 presents significant advancements in OCR and grounding tasks while being computationally efficient due to its token compression. The model sets a precedent for future work focused on enhancing LVLM efficiency and capability through unified encoder frameworks and diversified data curation strategies. Addressing current limitations could involve refining data for scene text recognition and deepening multimodal reasoning capabilities. The paper underscores the model’s strong position in advancing the field of vision-language integration.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (4)

Tweets

https://twitter.com/gm8xx8/status/1843495894135472148

https://twitter.com/calculito/status/1846108694351601783

YouTube

Show All Videos