- The paper presents a multimodal, hierarchy-aware pipeline that fuses OCR, layout parsing, and LLM-based sectioning for chunking long documents.
- It demonstrates significant performance gains in retrieval and QA, with improved precision, nDCG, and ANLS across diverse industrial datasets.
- The method’s robust integration of visual and textual data addresses challenges like context fragmentation and OCR-induced noise in complex documents.
MultiDocFusion: Hierarchical and Multimodal Chunking for Enhanced RAG on Long Industrial Documents
Motivation and Context
Processing complex, long-form industrial documents in RAG-based QA systems presents unique challenges. Existing chunking strategies—fixed-length or shallow semantic splits—are fundamentally text-centric and do not account for intricate structural and visual layouts inherent to real-world industrial documents, such as scanned PDFs, multi-page reports, and visually dense financial statements. Additionally, OCR artifacts introduce noise and misalignments that exacerbate information fragmentation, further hampering downstream retrieval and QA fidelity. The problem is particularly acute when semantic continuity and hierarchical context are critical for robust QA, as in multi-page regulatory, legal, and technical domains.
MultiDocFusion Pipeline Architecture
MultiDocFusion introduces a comprehensive multimodal chunking pipeline that integrates visual layout segmentation and explicit document hierarchy analysis to mitigate context fragmentation and semantic loss. The pipeline comprises four sequential modules:
- Document Parsing (DP): Leveraging vision-based object detection models (e.g., DETR, DiT, VGT), DP extracts spatially coherent regions (titles, section headers, text blocks, tables, figures) from each page, generating structured metadata with bounding boxes and segment types.
- Optical Character Recognition (OCR): Various OCR systems (e.g., EasyOCR, Tesseract, TrOCR) extract text from DP-identified regions, associating recognized text with corresponding structural metadata, forming an Annotated Layout.
- Document Section Hierarchical Parsing with LLM (DSHP-LLM): Instruction-tuned LLMs (e.g., fine-tuned Mistral-8B) reconstruct explicit section hierarchies by inferring parent-child relationships among candidate headers from OCR outputs. The model integrates LoRA-based PEFT for parameter efficiency, outputting JSON-structured hierarchical trees. General nodes (tables, figures, text blocks) are attached as children based on spatial order, providing precise logical structuring.
- DFS-based Grouping: A depth-first search algorithm traverses the hierarchical tree, aggregating parent and child content into hierarchically coherent chunks, splitting at a maximum token threshold. Chunks are explicitly marked by Markdown headers corresponding to tree depth, aligning semantic structure and visual context.
This pipeline systematically fuses visual layout, textual content, and structural hierarchy, generating multi-granular, semantically faithful chunks tailored for downstream RAG-based retrieval and QA tasks.
Experimental Evaluation
MultiDocFusion was extensively validated across diverse datasets representing the spectrum of industrial and academic document complexity: DocHieNet (mixed industrial/reports), HRDH (academic papers), DUDE (financial/manuals), MPVQA (multi-page VQA), CUAD (legal contracts), and MOAMOB (industrial/nuclear technical documents). The pipeline was benchmarked against fixed-length, semantic, LLM-based, and structure-based chunking baselines.
Fine-tuned DSHP-LLM models (especially Mistral-8B) significantly outperform general-purpose LLMs (e.g., GPT-4) in section hierarchy parsing tasks. On DocHieNet, DSHP-LLM yields TEDS 0.8230 (+16.71% over Mistral-8B baseline), and on HRDH, TEDS 0.9199 (+52.25% over Mistral-8B baseline). These results point to the necessity of dataset-specific instruction tuning for robust hierarchical parsing.
MultiDocFusion achieves strong retrieval gains: across DUDE and MPVQA, it improves retrieval precision by 8–15% and nDCG by comparable margins compared to all baselines. On legal contracts (CUAD), it yields highest precision (0.8651) and nDCG (0.8819), surpassing even specialized chunkers. Within MOAMOB, an extreme-case nuclear technical corpus, MultiDocFusion outperforms all baselines across Recall (0.6758), Precision (0.6184), and nDCG (0.6554).
In QA evaluation (ANLS, ROUGE-L, METEOR), MultiDocFusion consistently leads across all datasets, demonstrating a 2–3% ANLS QA score uplift compared to baselines—a significant effect given task complexity and answer requirements.
Robustness Analysis
MultiDocFusion sustains performance under variations in DP, OCR, and embedding models, consistently delivering the highest average nDCG. It is robust to OCR degradation (e.g., TrOCR noise), DP model variation, and embedding model changes, underscoring practical reliability for deployment in heterogeneous industrial environments.
Implications and Future Directions
Practical Impact
The empirical improvements affirm that hierarchy-aware, multimodal chunking is critical for high-fidelity retrieval and QA on long, complex industrial documents in RAG pipelines. By explicitly integrating both visual and structural signals, MultiDocFusion minimizes context breakage and achieves more faithful answer generation than text-only approaches.
Theoretical Considerations
MultiDocFusion induces an explicit document graph with typed hierarchical relations, enabling eventual extensions to graph-based retrieval (GraphRAG). While not directly evaluated in this work, such a formulation is theoretically advantageous for multi-hop and compositional reasoning tasks requiring traversal and aggregation across document structure.
Limitations and Open Problems
- DSHP-LLM's visual grounding is limited; fine-grained layout features (font size, color, column structure) are not utilized. Enhanced multimodal encoders or VLM backbones could further improve hierarchy inference in visually complex documents.
- The pipeline is serial and modular, susceptible to error propagation; end-to-end multimodal models may improve cross-modal consistency but would entail controllability/interpretabiliy trade-offs.
- Hierarchical chunking increases retrieval/storage overhead due to context duplication across parent/child nodes. Efficient chunk caching and budget-aware pruning remain necessary for large-scale deployment.
Ethical Considerations
Applications targeting sensitive industrial documents necessitate rigorous privacy and security compliance. As retrieval-augmented answers may propagate OCR or hierarchy-induced errors, deployment models must include mechanisms for accountability, bias suppression, and transparent provenance.
Conclusion
MultiDocFusion demonstrates that explicit modeling of document hierarchy and visual layout dramatically enhances both retrieval and QA in RAG pipelines for long, information-dense industrial documents (2604.12352). The integration of vision-based parsing, fine-tuned hierarchical LLMs, and DFS-based chunking creates a structurally faithful, semantically coherent document representation, outperforming all text-only and shallow structure baselines. Future advances will focus on multimodal visual grounding, graph-structured retrieval, and end-to-end hierarchical document processing for more robust, efficient, and contextually aware AI systems.