ReaderLM-v2: Small Language Model for HTML to Markdown and JSON (2503.01151v1)

Published 3 Mar 2025 in cs.CL, cs.AI, and cs.IR

Abstract: We present ReaderLM-v2, a compact 1.5 billion parameter LLM designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding LLMs. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

Summary

Overview of ReaderLM-v2: A Small LLM for HTML to Structured Formats

The paper presents ReaderLM-v2, a 1.5 billion parameter LLM specifically developed for parsing and restructuring HTML content into Markdown and JSON formats. This model is positioned as a competitive solution to the challenges posed by messy web data that need to be converted into clean, structured formats. Key innovations in ReaderLM-v2 include a novel three-stage data synthesis pipeline termed "Draft-Refine-Critique" and a multi-objective training framework that promises both efficiency and accuracy far exceeding some more extensive models like GPT-4o-2024-08-06.

Model Contributions

Three-Stage Data Synthesis Pipeline:
- Draft: Generates initial synthetic data by converting HTML documents into target formats.
- Refine: Cleans and structures the drafted data, removing inconsistencies and ensuring adherence to format specifications.
- Critique: Evaluates refined data for accuracy and structural integrity, iterating until high-quality outputs are achieved.
Comprehensive Training Framework:
- Combines initial extensive pre-training with multi-objective optimization techniques, including supervised fine-tuning, direct preference optimization (DPO), and iterative self-play tuning, all aimed at refining the model’s output quality.

Evaluation and Results

ReaderLM-v2 demonstrates substantial improvements in structured content extraction tasks compared to other models. Specifically, it outperforms GPT-4 and other larger models by approximately 15-20% on benchmarks exceeding 100K tokens. Key metrics used in this evaluation include Rouge-L, Levenshtein Distance, and Jaro-Winkler Similarity. The results underline that despite its smaller size, ReaderLM-v2 maintains or exceeds the performance of proprietary models, a feat attributable to its finely-tuned architecture and training strategy.

Implications and Future Directions

This research exemplifies the potential of highly specialized LLMs to efficiently address complex, large-context data processing tasks. The success of ReaderLM-v2 suggests that smaller models, when rigorously optimized, can deliver competitive accuracy without the computational overhead associated with larger models.

Looking forward, advancements could focus on expanding ReaderLM-v2's capabilities to other structured data formats and modalities while continuing to ensure computational efficiency. Furthermore, the model's public availability on Hugging Face opens avenues for further exploration and integration in various applications requiring structured data extraction from web content, particularly in industries reliant on document processing and knowledge management.

The introduction of ReaderLM-v2 represents an important contribution to the field of natural language processing and structured data extraction, showcasing the effective use of compact models in achieving high-performance results with efficient resource utilization.