DocSLM: Compact Vision–Language Model

Updated 17 November 2025

DocSLM is a compact vision–language model that uses aggressive compression and entropy-driven streaming to enable long-document understanding under tight memory and compute constraints.
It employs a Hierarchical Multimodal Compressor to fuse visual, textual, and layout features into a fixed-length embedding, ensuring efficient per-page processing.
DocSLM achieves competitive accuracy with reduced latency and GPU memory usage, making it ideal for edge deployments and resource-constrained environments.

DocSLM is a compact vision–LLM (VLM) specifically engineered for long multimodal document understanding under stringent memory and compute constraints. Its core innovation lies in aggressive compression strategies and an entropy-driven streaming approach, enabling reliable performance on multi-page, visually complex documents using a small parameter footprint. DocSLM achieves comparable or superior accuracy to large-scale models in standard benchmarks while offering dramatically reduced latency, parameter count, and GPU memory usage, making it suitable for deployment on edge devices (Hannan et al., 14 Nov 2025).

1. Model Architecture

DocSLM comprises approximately 2 billion parameters and follows a modular workflow optimized for memory efficiency. At inference, a document $\mathcal{D} = \{d^1,\dots,d^N\}$ is processed sequentially page-by-page. Each page $d^n$ is encoded via the Hierarchical Multimodal Compressor (HMC), yielding a unified, fixed-length embedding of 576 tokens per page. Resulting page embeddings are aggregated into segments $\{s_t\}$ , each representing several pages. The document is then processed segment-by-segment using a Small LLM (SLM) that produces a segment-level answer distribution along with an uncertainty score. Segment predictions are filtered and combined into a single answer using a Streaming Abstention mechanism, ensuring constant memory utilization irrespective of document length.

2. Hierarchical Multimodal Compressor (HMC)

HMC employs a two-stage cross-modal fusion to integrate visual, textual, and layout signals:

Spatial Preprocessing: Each page image $d^n$ is partitioned into an $R \times C$ grid of patches $\{d^n_{i,j}\}$ , augmented with a global downsampled image $d^n_g$ .
Feature Extraction:
- A vision encoder $E_v$ extracts patch and global features: $\mathbf{V}^n_{i,j}=E_v(d^n_{i,j})$ , $\mathbf{V}^n_g=E_v(d^n_g)$ .
- OCR yields word–box pairs $\{(s_k^n,b_k^n)\}$ , and word tokens are embedded: $\mathbf{t}^n_k=E_t(s^n_k)$ .
Local Text–Vision Fusion: For each patch, OCR tokens spatially aligned via intersection-over-union thresholding ( $\mathrm{IoU}(b^n_k,\mathrm{bbox}(d^n_{i,j}))>\tau$ ) are aggregated:

$\tilde{\mathbf{V}}^n_{i,j} = \mathbf{V}^n_{i,j} + \mathrm{CA}(\mathbf{V}^n_{i,j}, \mathbf{T}^n_{i,j})$

where $\mathrm{CA}$ is a cross-attention module.

Global Compression: Cross-attention is applied between the global visual feature and compressed local patches:

$\hat{\mathbf{V}}^n_{g,i,j} = \mathbf{V}^n_{g,i,j} + \mathrm{CA}(\mathbf{V}^n_{g,i,j}, \tilde{\mathbf{V}}^n_{i,j})$

The page embedding is concatenated as:

$\hat{\mathbf{V}}^n_{g} = \mathsf{Concat}(\{\hat{\mathbf{V}}^n_{g,i,j}\}_{i=1..R,\,j=1..C})$

yielding a fixed length of $R \times C = 576$ tokens per page.

This scheme jointly preserves fine-grained local semantics and global document context while enforcing a fixed per-page token budget.

3. Streaming Abstention Mechanism

DocSLM enables scaling to arbitrary document length $N$ using sequential streaming and uncertainty-based answer selection:

Segmentation: The set of compressed page embeddings $\{\hat{\mathbf{V}}^n_g\}$ is partitioned into $T$ contiguous segments $\{s_t\}$ .
Segmentwise Processing: Each segment is independently scored by the SLM, producing a distribution $p_t(w_{1:K}\mid s_t,S)$ for the query $S$ .
Uncertainty Estimation: Average token-level entropy per segment,

$u_t = -\frac{1}{K}\sum_{k=1}^K\sum_w p_t(w\mid w_{<k}, s_t, S)\log p_t(w\mid w_{<k}, s_t, S),$

serves as the abstention score.

Supervision of Abstention: Segments lacking an answer during fine-tuning receive "Not Answerable" labels, calibrating $u_t$ for reliable abstention.
Aggregation: Segment predictions not abstaining (low $u_t$ ) are pooled, and the final answer $\hat{y}$ is chosen as the lowest-entropy candidate:

$\hat{y} = \arg\min_{p_t \in \mathcal{P}_\mathrm{valid}} u_t.$

Memory Efficiency: After each segment's forward pass, activations are discarded, preventing accumulation of key–value caches and guaranteeing constant memory regardless of $N$ .

4. Efficiency Metrics

DocSLM is quantitatively benchmarked against state-of-the-art LVLMs, with substantial resource reductions:

Metric	DocSLM	Docopilot-2B	InternVL2-8B / DocVLM-7B
Visual tokens/page	576	3,133	1,088–3,133
Parameter count	≈2B	2B	8B
Latency (ms/sample)	32.1	–	113.4 (InternVL2+RAG)
Memory plateau (pages)	~14 GB (>10)	Linear growth	Linear growth

DocSLM uses 82% fewer visual tokens per page than comparably sized models.
The parameter count is reduced by 75% relative to large LVLMs (8B vs. 2B).
Latency is reduced by 71.7% compared to InternVL2+RAG (32.1 ms vs. 113.4 ms).
Peak GPU memory remains nearly constant (≈14 GB) as document length increases, unlike large models whose requirements scale linearly with input size (Hannan et al., 14 Nov 2025).

5. Benchmark Performance

Performance is evaluated on multiple long-document and multimodal benchmarks, summarized below:

Dataset	Metric	DocSLM	Docopilot-2B	InternVL2-8B	DocVLM-7B	DocOwl2-8B	LayTokenLLM-8B
MMLongDoc	Accuracy (%)	22.7	21.8	17.4	–	13.4	–
MP-DocVQA	ANLS	70.0	76.2*	79.3	–	–	–
DUDE	ANLS	47.6	–	–	47.4	–	52.0
NewsVideoQA	ANLS	66.2	–	–	–	–	–

*Note: Docopilot-2B requires 3,133 tokens per image.

DocSLM matches or exceeds the accuracy of models with multi-fold larger token or parameter budgets on key tasks: it outperforms Docopilot-2B by +0.9 percentage points (pp) on MMLongDoc, DocOwl2-8B by +9.3 pp under comparable token constraints, and lays out state-of-the-art results among small models on NewsVideoQA. Processing is also 3.5× faster than retrieval-augmented (RAG) models under identical inference conditions.

6. Deployment Considerations

DocSLM's design supports practical deployment in constrained compute environments:

The fixed 576-token-per-page encoding and streaming segmentation constrain GPU memory usage to ≈14 GB for documents 50–120 pages long, supporting edge-GPU deployment.
Releasing activations after segment inference avoids key–value cache growth, enabling constant memory use irrespective of document length.
The SLM backbone (2B parameters) can run on most edge or high-end mobile GPUs due to the aggressive reduction in page and model size.
These characteristics collectively facilitate robust long-document multimodal question answering and reasoning capabilities on resource-constrained platforms without significant degradation in performance compared to substantially larger models (Hannan et al., 14 Nov 2025).

7. Technical Significance and Implications

DocSLM demonstrates that with hierarchical multimodal token compression and entropy-based segmental abstention, it is possible to achieve long-document vision–language understanding with large-model-level accuracy at a fraction of standard token, parameter, and memory budgets. This suggests a shift in the feasible deployment scenarios for multimodal models, notably enabling real-time and on-device processing of extended complex documents for practical applications, such as autonomous document analysis and mobile knowledge retrieval, in environments previously limited by hardware resources (Hannan et al., 14 Nov 2025).

Markdown Upgrade to Chat

References (1)

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocSLM.