Evaluation of Embedding-Based and Generative Methods for LLM-Driven Document Classification: Opportunities and Challenges

Published 5 Apr 2026 in cs.IR, cs.AI, cs.CL, cs.CV, and cs.LG | (2604.04997v1)

Abstract: This work presents a comparative analysis of embedding-based and generative models for classifying geoscience technical documents. Using a multi-disciplinary benchmark dataset, we evaluated the trade-offs between model accuracy, stability, and computational cost. We find that generative Vision-LLMs (VLMs) like Qwen2.5-VL, enhanced with Chain-of-Thought (CoT) prompting, achieve superior zero-shot accuracy (82%) compared to state-of-the-art multimodal embedding models like QQMM (63%). We also demonstrate that while supervised fine-tuning (SFT) can improve VLM performance, it is sensitive to training data imbalance.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper shows that generative VLMs using advanced CoT prompting achieve 82% accuracy, outperforming embedding methods by nearly 20 points.
It demonstrates that tailored prompt engineering significantly boosts performance in both embedding-based and generative models.
Findings reveal that while generative models offer higher accuracy, they require more computational resources and exhibit nondeterminism compared to lightweight embedding approaches.

Comparative Evaluation of Embedding-Based and Generative VLMs for LLM-Driven Document Classification

Introduction

This paper presents a thorough empirical comparison between embedding-based and generative VLM paradigms for classifying technical geoscience documents, addressing a domain featuring substantial multimodality, suboptimal OCR, and document heterogeneity. The study encapsulates benchmarking on a proprietary, multi-class geoscience corpus using both established and state-of-the-art models, and systematically analyzes prompt design, fine-tuning effects, accuracy, stability, and resource implications. The analysis is substantiated with robust metrics, and it provides concise recommendations for deploying LLM-driven classification solutions in operational information management settings.

Experimental Methodology

Dataset and Benchmark Task

The dataset consists of technical documents classified into eight geoscience-centric categories (e.g., Geology, Petrophysics, Petroleum Engineering). To normalize model input, only the first page from multi-page documents is used, capturing the signal-rich initial context and canonical layouts. Source formats are diverse, including OCR-impaired scanned images and native PDFs, enforcing a realistic assessment of multimodal machine perception performance in industrial repositories.

Models and Workflows

Two primary model paradigms are evaluated:

Embedding-based Methods: These generate vector representations for both documents and class labels, leveraging domain-specific class definition prompts for enhanced embedding quality. Classification is realized through similarity voting based on cosine distance.
Generative VLMs: Vision-LLMs directly output class labels in response to document-image plus prompt input. Prompt engineering ranges from simple persona-based cues to sophisticated CoT formats tailored for geoscience semantics and workflow.

Additionally, the effect of SFT (Supervised Fine-Tuning) is studied on Qwen2.5-VL-7B, probing the trade-off between distributional data scarcity and overfitting risks.

Standard performance metrics include overall accuracy, macro F1-score, and clustering separability/compactness measures (e.g., intra/inter-cluster distance, silhouette score).

Main Results

Classification Accuracy and Prompt Sensitivity

The head-to-head evaluation decisively indicates that generative VLMs, particularly Qwen2.5-VL series, achieve higher zero-shot classification accuracy and F1 than their embedding-based counterparts on this realistic document classification task. Specifically, Qwen2.5-VL-72B with an advanced CoT "plus" prompt yields 82% accuracy and macro F1, outperforming the best embedding baseline (QQMM) by nearly 20 points.

Figure 1: Classification performance for both embedding and VLMs under various configurations. VLMs generally outperform embedding models, especially after fine-tuned with sufficient training samples. Dots are individual models' results.

Prompt engineering is unequivocally effective. Class definition prompting boosts QQMM's F1 from 0.55 to 0.64; CoT-style prompting uplifts Qwen2.5-VL-7B's F1 by 10 points. These gains underscore the insufficiency of vanilla zero-shot interaction and highlight the importance of incorporating domain knowledge through prompt design.

Fine-Tuning and Data Imbalance

SFT on the VLM substantially improves classification for head classes (F1/accuracy of 0.93 for classes with >150 samples); however, it introduces pronounced degradation on tail classes (<100 samples), underscoring the risk of overfitting and the persistent challenge of imbalanced industrial datasets. Thus, domain adaptation is promising but nontrivial; rigorous data curation remains a bottleneck.

Performance/Resource Trade-offs

Generative VLMs realize superior end-to-end performance but at the cost of significant inference resource overhead and occasional nondeterminism, complicating deployment at operational scale where reproducibility and throughput are critical. In contrast, embedding-based models deliver deterministic, lightweight inference suitable for batch scenarios and resource-constrained environments, albeit with lower ultimate accuracy.

Theoretical and Practical Implications

The empirical results highlight that contemporary generative VLMs possess a substantial advantage for complex, multimodal industrial document triage tasks, primarily through deeper image-text layout integration and contextualized prompt-conditioned reasoning.

Practically, engineering robust prompts or templates is as vital as model selection, offering substantial gains with minimal computational burden. Nevertheless, the clear sensitivity of SFT to class imbalance raises unresolved issues regarding robust generalization and the need for future research on data augmentation, re-weighting, and curriculum training techniques.

From a system design standpoint, a hybrid architecture leveraging lightweight embedding models for large-scale, resource-sensitive pre-filtering—followed by VLM reranking or "hard" instance validation—could offer pragmatic balance between accuracy and cost.

Future Directions

Data Imbalance Mitigation: Algorithms for low-shot/long-tail fine-tuning need to be developed to stabilize SFT performance across class distributions.
Prompt Automation: Methods for automated, dataset-specific prompt synthesis could further lower user burden and enhance reproducibility.
Resource-Efficient VLMs: Research in distillation or model compression for VLMs may make their deployment more attractive for production-scale applications.

Conclusion

This paper presents robust evidence that generative VLMs, especially those enhanced with CoT prompting, decisively outperform embedding-based approaches for multimodal document classification in real-world, OCR-noisy, visual/textual hybrid archives. However, these benefits are counterbalanced by resource and determinism limitations. Prompt engineering materially improves both paradigms. SFT enables state-of-the-art results under balanced data, but head-tail disparity persists. Results point to best practices for deployment and future research toward scalable, robust document understanding systems in industrial AI (2604.04997).

Markdown Report Issue