Papers
Topics
Authors
Recent
2000 character limit reached

DocSLM: Compact Vision–Language Model

Updated 17 November 2025
  • DocSLM is a compact vision–language model that uses aggressive compression and entropy-driven streaming to enable long-document understanding under tight memory and compute constraints.
  • It employs a Hierarchical Multimodal Compressor to fuse visual, textual, and layout features into a fixed-length embedding, ensuring efficient per-page processing.
  • DocSLM achieves competitive accuracy with reduced latency and GPU memory usage, making it ideal for edge deployments and resource-constrained environments.

DocSLM is a compact vision–LLM (VLM) specifically engineered for long multimodal document understanding under stringent memory and compute constraints. Its core innovation lies in aggressive compression strategies and an entropy-driven streaming approach, enabling reliable performance on multi-page, visually complex documents using a small parameter footprint. DocSLM achieves comparable or superior accuracy to large-scale models in standard benchmarks while offering dramatically reduced latency, parameter count, and GPU memory usage, making it suitable for deployment on edge devices (Hannan et al., 14 Nov 2025).

1. Model Architecture

DocSLM comprises approximately 2 billion parameters and follows a modular workflow optimized for memory efficiency. At inference, a document D={d1,,dN}\mathcal{D} = \{d^1,\dots,d^N\} is processed sequentially page-by-page. Each page dnd^n is encoded via the Hierarchical Multimodal Compressor (HMC), yielding a unified, fixed-length embedding of 576 tokens per page. Resulting page embeddings are aggregated into segments {st}\{s_t\}, each representing several pages. The document is then processed segment-by-segment using a Small LLM (SLM) that produces a segment-level answer distribution along with an uncertainty score. Segment predictions are filtered and combined into a single answer using a Streaming Abstention mechanism, ensuring constant memory utilization irrespective of document length.

2. Hierarchical Multimodal Compressor (HMC)

HMC employs a two-stage cross-modal fusion to integrate visual, textual, and layout signals:

  • Spatial Preprocessing: Each page image dnd^n is partitioned into an R×CR \times C grid of patches {di,jn}\{d^n_{i,j}\}, augmented with a global downsampled image dgnd^n_g.
  • Feature Extraction:
    • A vision encoder EvE_v extracts patch and global features: Vi,jn=Ev(di,jn)\mathbf{V}^n_{i,j}=E_v(d^n_{i,j}), Vgn=Ev(dgn)\mathbf{V}^n_g=E_v(d^n_g).
    • OCR yields word–box pairs {(skn,bkn)}\{(s_k^n,b_k^n)\}, and word tokens are embedded: tkn=Et(skn)\mathbf{t}^n_k=E_t(s^n_k).
  • Local Text–Vision Fusion: For each patch, OCR tokens spatially aligned via intersection-over-union thresholding (IoU(bkn,bbox(di,jn))>τ\mathrm{IoU}(b^n_k,\mathrm{bbox}(d^n_{i,j}))>\tau) are aggregated:

V~i,jn=Vi,jn+CA(Vi,jn,Ti,jn)\tilde{\mathbf{V}}^n_{i,j} = \mathbf{V}^n_{i,j} + \mathrm{CA}(\mathbf{V}^n_{i,j}, \mathbf{T}^n_{i,j})

where CA\mathrm{CA} is a cross-attention module.

  • Global Compression: Cross-attention is applied between the global visual feature and compressed local patches:

V^g,i,jn=Vg,i,jn+CA(Vg,i,jn,V~i,jn)\hat{\mathbf{V}}^n_{g,i,j} = \mathbf{V}^n_{g,i,j} + \mathrm{CA}(\mathbf{V}^n_{g,i,j}, \tilde{\mathbf{V}}^n_{i,j})

The page embedding is concatenated as:

V^gn=Concat({V^g,i,jn}i=1..R,j=1..C)\hat{\mathbf{V}}^n_{g} = \mathsf{Concat}(\{\hat{\mathbf{V}}^n_{g,i,j}\}_{i=1..R,\,j=1..C})

yielding a fixed length of R×C=576R \times C = 576 tokens per page.

This scheme jointly preserves fine-grained local semantics and global document context while enforcing a fixed per-page token budget.

3. Streaming Abstention Mechanism

DocSLM enables scaling to arbitrary document length NN using sequential streaming and uncertainty-based answer selection:

  • Segmentation: The set of compressed page embeddings {V^gn}\{\hat{\mathbf{V}}^n_g\} is partitioned into TT contiguous segments {st}\{s_t\}.
  • Segmentwise Processing: Each segment is independently scored by the SLM, producing a distribution pt(w1:Kst,S)p_t(w_{1:K}\mid s_t,S) for the query SS.
  • Uncertainty Estimation: Average token-level entropy per segment,

ut=1Kk=1Kwpt(ww<k,st,S)logpt(ww<k,st,S),u_t = -\frac{1}{K}\sum_{k=1}^K\sum_w p_t(w\mid w_{<k}, s_t, S)\log p_t(w\mid w_{<k}, s_t, S),

serves as the abstention score.

  • Supervision of Abstention: Segments lacking an answer during fine-tuning receive "Not Answerable" labels, calibrating utu_t for reliable abstention.
  • Aggregation: Segment predictions not abstaining (low utu_t) are pooled, and the final answer y^\hat{y} is chosen as the lowest-entropy candidate:

y^=argminptPvalidut.\hat{y} = \arg\min_{p_t \in \mathcal{P}_\mathrm{valid}} u_t.

  • Memory Efficiency: After each segment's forward pass, activations are discarded, preventing accumulation of key–value caches and guaranteeing constant memory regardless of NN.

4. Efficiency Metrics

DocSLM is quantitatively benchmarked against state-of-the-art LVLMs, with substantial resource reductions:

Metric DocSLM Docopilot-2B InternVL2-8B / DocVLM-7B
Visual tokens/page 576 3,133 1,088–3,133
Parameter count ≈2B 2B 8B
Latency (ms/sample) 32.1 113.4 (InternVL2+RAG)
Memory plateau (pages) ~14 GB (>10) Linear growth Linear growth
  • DocSLM uses 82% fewer visual tokens per page than comparably sized models.
  • The parameter count is reduced by 75% relative to large LVLMs (8B vs. 2B).
  • Latency is reduced by 71.7% compared to InternVL2+RAG (32.1 ms vs. 113.4 ms).
  • Peak GPU memory remains nearly constant (≈14 GB) as document length increases, unlike large models whose requirements scale linearly with input size (Hannan et al., 14 Nov 2025).

5. Benchmark Performance

Performance is evaluated on multiple long-document and multimodal benchmarks, summarized below:

Dataset Metric DocSLM Docopilot-2B InternVL2-8B DocVLM-7B DocOwl2-8B LayTokenLLM-8B
MMLongDoc Accuracy (%) 22.7 21.8 17.4 13.4
MP-DocVQA ANLS 70.0 76.2* 79.3
DUDE ANLS 47.6 47.4 52.0
NewsVideoQA ANLS 66.2

*Note: Docopilot-2B requires 3,133 tokens per image.

DocSLM matches or exceeds the accuracy of models with multi-fold larger token or parameter budgets on key tasks: it outperforms Docopilot-2B by +0.9 percentage points (pp) on MMLongDoc, DocOwl2-8B by +9.3 pp under comparable token constraints, and lays out state-of-the-art results among small models on NewsVideoQA. Processing is also 3.5× faster than retrieval-augmented (RAG) models under identical inference conditions.

6. Deployment Considerations

DocSLM's design supports practical deployment in constrained compute environments:

  • The fixed 576-token-per-page encoding and streaming segmentation constrain GPU memory usage to ≈14 GB for documents 50–120 pages long, supporting edge-GPU deployment.
  • Releasing activations after segment inference avoids key–value cache growth, enabling constant memory use irrespective of document length.
  • The SLM backbone (2B parameters) can run on most edge or high-end mobile GPUs due to the aggressive reduction in page and model size.
  • These characteristics collectively facilitate robust long-document multimodal question answering and reasoning capabilities on resource-constrained platforms without significant degradation in performance compared to substantially larger models (Hannan et al., 14 Nov 2025).

7. Technical Significance and Implications

DocSLM demonstrates that with hierarchical multimodal token compression and entropy-based segmental abstention, it is possible to achieve long-document vision–language understanding with large-model-level accuracy at a fraction of standard token, parameter, and memory budgets. This suggests a shift in the feasible deployment scenarios for multimodal models, notably enabling real-time and on-device processing of extended complex documents for practical applications, such as autonomous document analysis and mobile knowledge retrieval, in environments previously limited by hardware resources (Hannan et al., 14 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to DocSLM.