DocReward: Evaluating Document Style & Layout
- DocReward is a document reward model designed to evaluate visual professionalism by focusing on layout, formatting, and stylistic polish independent of text.
- It utilizes a dataset of 117K paired documents across 32 domains and 267 types to train a multimodal model with a pairwise preference loss approach.
- The model achieves superior human alignment with up to 89% accuracy, significantly outperforming previous baselines in document layout evaluation.
DocReward is a document reward model specifically designed to evaluate and guide the generation of professional documents by quantifying the quality of their visual structure and style, independent of textual content. Unlike prior approaches that primarily target textual correctness or fluency, DocReward fills a critical gap by providing an explicit and granular reward signal for document layout, formatting, and stylistic polish, enabling agentic generation workflows to produce outputs that not only convey correct information but also exhibit superior readability and visual professionalism (Liu et al., 13 Oct 2025).
1. Objective and Motivation
DocReward targets the assessment of “professionalism,” focusing on document structure (layout, alignment, organization) and style (typography, spacing, consistency), rather than semantic or textual accuracy. While LLMs and workflow agents have made substantial progress in modeling and generating high-quality text, these systems have historically overlooked the visual presentation of documents—a component essential for engagement, usability, and human preference, especially in technical, governmental, and professional domains. DocReward’s textual-quality-agnostic nature allows it to robustly evaluate and rank documents that are textually identical but visually distinct, providing actionable signals for improving non-textual qualities during automatic document synthesis.
2. Dataset Construction and Annotation
To support both the training and evaluation of DocReward, the DocPair dataset was created, characterized by:
- 117,000 document pairs, each pair consisting of two documents containing exactly the same textual content but differing in structure and style.
- Coverage of 32 domains and 267 document types, encompassing technical reports, government forms, academic papers, business communications, and more.
- Paired documents derived from both real human-authored exemplars (with high professionalism) and synthesized outputs generated by agentic workflows using models such as GPT-4o, GPT-5, and LLM-based refinement chains.
- Supervised annotation by selecting a “winner” (higher professionalism) and “loser” (lower professionalism) in each pair, with the only permissible criteria being differences in visual structure and style.
This construction ensures that learned reward signals are insensitive to textual content and maximize discriminative power along visual dimensions.
3. Model Architecture and Training Paradigm
DocReward is based on Qwen-2.5-VL, a multimodal model capable of end-to-end image processing. Each document is rendered into a sequence of page images (D_img), which are then input to the model. The architecture is appended with a regression head to output a single scalar professionalism score per document.
The model is trained on DocPair using the Bradley-Terry pairwise preference loss. Let denote the higher-ranked (“winner”) and the lower-ranked (“loser”) document images. The training objective is:
where is the predicted score and is the sigmoid. This loss enforces high-scoring of structurally and stylistically superior documents over visually inferior ones, with direct penalization of rank-reversal errors. During evaluation, DocReward supports both pairwise and pointwise assessment.
4. Performance Evaluation
Evaluation is based on the agreement of DocReward’s scoring with human annotator rankings over doc bundles, including challenging scenarios where all documents share the same text. The principal metric is human preference accuracy (the proportion of pairs where the model’s ranking matches that of annotators).
Key results:
| Setting | DocReward-7B (%) | GPT-5 (%) | GPT-4o (%) |
|---|---|---|---|
| Pointwise (All) | 89.22 | 69.77 | 58.62 |
| Real vs. Synth | 97.42 | — | — |
| Synth vs. Synth | — | — | — |
| Extrinsic Win Rate | 60.8 | 37.7 | — |
- In pointwise human alignment, DocReward-7B exceeds GPT-5 by 19.4 percentage points and GPT-4o by 30.6 points.
- In extrinsic model selection tasks (where DocReward is used to pick the most professional candidate from among generations), its win rate is 60.8% compared to GPT-5’s 37.7%.
- Performance is particularly notable in “Real vs. Synth” scenarios, where DocReward almost perfectly aligns with human judgments.
The training objective in the paper’s notation is:
subject to the constraint of identical textual content, ensuring style/structure-only assessment.
5. Applications and Impact
DocReward’s domain- and content-agnostic design enables integration into diverse agentic document generation pipelines:
- Document self-refinement: Generation agents can synthesize multiple versions of a document and use DocReward for selection, ensuring outputs meet human standards of professionalism in structure/style.
- Automated documentation and reporting: Technical, governmental, academic, and business document systems can leverage DocReward during post-processing or as an auxiliary reward for reinforcement learning agents.
- Large-scale evaluation: Benchmarking and improving generative models focusing on layout and typographic polish become possible without labor-intensive human review.
This addresses an acute need in production systems, where user trust, information readability, and professional aesthetics are essential but hard to guarantee with text-centric LLM metrics.
6. Limitations and Future Directions
DocReward’s development highlights several research vectors:
- Dataset expansion to include additional document types and more granular stylistic variations may further increase reward model robustness and generalization.
- Enhancement with additional signals, such as OCR-based content checkers or advanced spatial/layout encoders, could further disentangle structure from surface appearance.
- Multi-dimensional reward assessment—jointly evaluating text quality as a modular component alongside style and structure—could extend applicability, provided modular independence is preserved.
- Transfer of the modeling paradigm into adjacent domains (web design, graphics) where visual structure is paramount offers a plausible trajectory.
This suggests modular reward modeling, with fine-grained control over which document modality is optimized, will become central in the next generation of automated content creation and assessment systems.
7. Conclusion
DocReward establishes a rigorous methodology and empirical foundation for document professionalism reward modeling, decisively outperforming prior strong baselines (GPT-5, GPT-4o) in both intrinsic and extrinsic evaluations. Leveraging a large, carefully annotated multi-domain dataset and a visual-language architecture with pairwise preference learning, DocReward enables generation agents to produce outputs exhibiting human-preferred structural and stylistic qualities—a dimension previously neglected in document automation workflows. This advancement paves the way for further research and practical systems where not just the semantic, but also the visual integrity of documents, is optimized according to explicit, learned human standards (Liu et al., 13 Oct 2025).