H2OVL-Mississippi Vision Language Models Technical Report

Published 17 Oct 2024 in cs.CV, cs.AI, cs.CL, and cs.LG | (2410.13611v1)

Abstract: Smaller vision-LLMs (VLMs) are becoming increasingly important for privacy-focused, on-device applications due to their ability to run efficiently on consumer hardware for processing enterprise commercial documents and images. These models require strong language understanding and visual capabilities to enhance human-machine interaction. To address this need, we present H2OVL-Mississippi, a pair of small VLMs trained on 37 million image-text pairs using 240 hours of compute on 8 x H100 GPUs. H2OVL-Mississippi-0.8B is a tiny model with 0.8 billion parameters that specializes in text recognition, achieving state of the art performance on the Text Recognition portion of OCRBench and surpassing much larger models in this area. Additionally, we are releasing H2OVL-Mississippi-2B, a 2 billion parameter model for general use cases, exhibiting highly competitive metrics across various academic benchmarks. Both models build upon our prior work with H2O-Danube LLMs, extending their capabilities into the visual domain. We release them under the Apache 2.0 license, making VLMs accessible to everyone, democratizing document AI and visual LLMs.

Abstract PDF HTML Upgrade to Chat

Summary

The paper demonstrates that H2OVL-Mississippi models achieve state-of-the-art OCR and document processing performance while utilizing resource-efficient architectures.
It details the integration of a Vision Transformer with LLMs and MLPs, employing dynamic resolution and multi-scale adaptive cropping in a two-phase training process.
The report outlines future research directions in multilingual support, additional modality integration, and enhanced agent-based reasoning for broad application use.

Overview of "H2OVL-Mississippi Vision LLMs Technical Report"

The paper "H2OVL-Mississippi Vision LLMs Technical Report" introduces two smaller vision-LLMs (VLMs), specifically designed for privacy-focused, on-device applications. These models, termed as H2OVL-Mississippi-0.8B and H2OVL-Mississippi-2B, are trained to efficiently handle enterprise commercial documents and images, providing versatility for various document-centric and multi-modal tasks. The paper presents their design, training methodologies, and evaluation results across several benchmarks to illustrate their capabilities and performance.

Model Architecture and Training

The models are built upon the Vision Transformer (ViT) architecture integrated with Multi-Layer Perceptron (MLP) and LLMs, inspired by works such as LLaVA and InternVL. The architecture incorporates dynamic resolution strategies and multi-scale adaptive cropping to improve image processing efficiency while maintaining computational resource demands.

The H2OVL-Mississippi-0.8B model specializes in Optical Character Recognition (OCR) with a keen focus on text recognition tasks. Its training methodology involves two phases: a broad pre-training using diverse image-text data, followed by focused fine-tuning on OCR-centric tasks. Conversely, the H2OVL-Mississippi-2B is designed for general tasks, balancing across various vision-language activities and aiming for document understanding at a larger scale.

Evaluation and Performance

The models were evaluated on a comprehensive set of benchmarks, including general vision-language tasks, OCR, and document-specific tasks. H2OVL-Mississippi-0.8B demonstrates exceptional competence in text recognition on OCRBench, achieving state-of-the-art scores despite its relatively small size. H2OVL-Mississippi-2B offers competitive results across benchmarks, particularly excelling in multi-modal reasoning and text-centric VQA tasks, outperforming many larger models.

Their performance on document-specific information extraction tasks further validates their robust capabilities, achieving top-tier results in receipts and other structured document analyses. When compared to legacy state-of-the-art models, the H2OVL series offers a compelling balance between efficiency and precision, favoring environments where resource constraints are a priority.

Implications and Future Directions

The introduction of these smaller VLMs has significant implications for the democratization of AI, enabling more accessible deployment for various applications, including mobile devices and edge computing. The release under the Apache 2.0 license emphasizes the commitment to open-source contributions, facilitating research and practical applications worldwide.

Future research directions outlined in the paper suggest enhancements in multilingual capabilities, integration of additional modalities such as audio and video, and scaling of model sizes to tackle more complex tasks. The authors also propose exploring agent-based tasks and improving fine-grained visual capabilities for a broader spectrum of applications.

In conclusion, the H2OVL-Mississippi models represent a substantial contribution to the VLM landscape, offering scalable and efficient solutions for multi-modal AI applications. Their open availability ensures ongoing advancements and utilization across diverse sectors, marking a noteworthy advancement in the field of vision-language integration.

Markdown