Papers
Topics
Authors
Recent
2000 character limit reached

Doc2Token Technique: Token-Level Optimization

Updated 5 January 2026
  • Doc2Token is a dual-method approach for token-level optimization in both multimodal document compression and e-commerce retrieval.
  • In multimodal compression, it uses correlation-guided sampling to retain key image patch tokens while reducing redundancy by over 60%.
  • For e-commerce, it predicts missing yet relevant tokens with a T5 model, significantly boosting novel token expansion and retrieval metrics.

Doc2Token refers to two distinct methodologies developed for token-level optimization at different stages in document understanding and retrieval. In the multimodal vision-language modeling context, Doc2Token (formally "Token-level Correlation-guided Compression") is a parameter-free adaptive compressor that identifies and preserves the most informative image patch tokens for efficient downstream processing. In the information retrieval domain, Doc2Token is a document expansion framework that directly predicts missing yet relevant tokens to bridge the query-document vocabulary gap, particularly in large-scale e-commerce search. Both lines of work focus on selective token inclusion to improve efficiency and utility, but their technical realizations and objectives differ significantly.

In multimodal LLMs (MLLMs) for document understanding, images are typically cropped into sub-images and encoded into dense patch tokens using vision transformers (e.g., CLIP-ViT). Conventional approaches treat all patch tokens equivalently, ignoring variations in information content and leading to costly redundancies. The Doc2Token technique implements a two-stage, plug-and-play compressor between the vision encoder and the text adapter to enable adaptive, fine-grained token selection.

Redundancy Quantification with Patch–Patch Correlation

For a sub-image, let KRN×DK \in \mathbb{R}^{N \times D} be the matrix of 2\ell_2-normalized CLIP keys, with NN patch tokens. The cosine similarity matrix RRN×NR \in \mathbb{R}^{N \times N} is

R=K^K^,Rij=KiKjKi2Kj2R = \widehat{K} \widehat{K}^\top, \qquad R_{ij} = \frac{K_i \cdot K_j}{\|K_i\|_2 \|K_j\|_2}

A token ii is redundant if it exhibits ni>kn_i > k neighbors with Rij>αR_{ij} > \alpha, introducing thresholds α\alpha (e.g., $0.7$) and kk (e.g., $50$). The redundancy ratio r=NR/Nr = N_R/N, where NRN_R is the number of redundant tokens; information density is defined as d=1rd = 1-r.

CLS–Patch Correlation-Based Sampling

Token selection integrates:

  • Global branch: Outlier mining on deep-layer [CLS]–patch attention scores using an IQR-based outlier threshold T=Q3+1.5(Q3Q1)T = Q_3 + 1.5 \cdot (Q_3 - Q_1). Tokens ii with attention ad(i)>Ta_d(i)>T are global outliers.
  • Local branch: The attention vector ala_l from a lower CLIP-ViT layer is normalized into weights αi\alpha_i, and dN\lceil d N \rceil tokens are sampled without replacement by {αi}\{\alpha_i\}.
  • Final set: L=IJL = I \cup J (global outliers \cup local samples). A kNN aggregation step merges unselected tokens into the selected ones.

Algorithmic details are implemented with matrix multiplications and sampling in PyTorch, preserving full gradient flow except for nondifferentiable index selection.

2. Plug-and-Play Design and Integration

This compressor is inserted post-vision encoder, prior to text adaptation. For a batch of BB samples, each with SS sub-images, incoming tokens of shape RB×(SN)×D\mathbb{R}^{B \times (S \cdot N) \times D} are compressed to RB×(sns)×D\mathbb{R}^{B \times (\sum_s n_s) \times D} with ns<Nn_s < N per sub-image, determined adaptively. Key attributes:

  • Parameter-free: All operational thresholds are fixed; no gradient-based updates.
  • Differentiable aggregation: Preserved token embeddings preserve backpropagation pathways.
  • Parallelism: Sub-image and batch processing are parallelizable; kNN aggregation is localized.

This parameter-free, adaptive system supports any cropping-based MLLM pipeline without retraining.

3. Experimental Results and Evaluation

Benchmarks were conducted on mPLUG-DocOwl 1.5 (CLIP-ViT/L-14 + LLaMA-2-7B) across ten major document-understanding datasets covering text-rich forms, QA, charts, and web screenshots. Doc2Token was evaluated against DocPedia, Monkey, UReader, PruMerge, and PruMerge+:

Model DocVQA InfoVQA DeepForm KLC WTQ TabFact ChartQA TextVQA TextCaps VisualMRC
mPLUG-DocOwl1.5 (no comp.) 81.6 50.4 68.8 37.9 39.8 80.4 70.5 68.8 132.0 239.5
PruMerge (plug-play) 53.6 29.6 9.3 28.3 23.7 71.4 55.8 60.0 120.7 125.9
Ours (~66% tokens, plug-play) 72.6 49.6 63.2 34.6 35.2 75.2 64.0 68.0 132.1 254.3
  • Average token count reduced to ~34% of the original, with approximately 5% accuracy drop in plug-and-play mode.
  • Ablations confirm both global and local correlation-based branches are critical for maximal information retention.
  • Adaptive sampling outperforms all fixed-ratio baselines.

In the context of e-commerce search, Doc2Token addresses the lexical gap between customer queries and product metadata. Whereas Doc2Query (Nogueira et al., 2019) generates full synthetic queries for document expansion (often duplicating tokens already present), Doc2Token directly predicts only those relevant tokens that are missing from the document.

Token Prediction Task

Given document tokens DD and vocabulary VV, Doc2Token learns M(D)VDM(D) \subset V \setminus D—the missing, semantically pertinent tokens. A T5 encoder-decoder model is fine-tuned with a cross-entropy objective, frequency-weighted by historical query token frequency (wk=fkα,α=0.5w_k = f_k^\alpha, \alpha=0.5). For each input document, the decoder autoregressively predicts the top-KK novel tokens via beam search.

Metric: Novel ROUGE Score

Standard ROUGE-1 is adapted to focus on truly novel expansions: nROUGE ⁣P=1Iiy^iyiy^i,nROUGE ⁣R=1Iiy^iyiyi,nROUGEF1=2nROUGEPnROUGERnROUGEP+nROUGER\text{nROUGE}_{\!P} = \frac{1}{I} \sum_i \frac{| \hat{y}_i \cap y_i^* |}{|\hat{y}_i|}, \quad \text{nROUGE}_{\!R} = \frac{1}{I}\sum_i \frac{| \hat{y}_i \cap y_i^* |}{|y_i^*|}, \quad \text{nROUGE}_{F1} = \frac{2\,\text{nROUGE}_P\,\text{nROUGE}_R}{\text{nROUGE}_P+\text{nROUGE}_R} where yiy_i^* are the reference novel tokens for product ii.

Empirical Comparison

Doc2Token yields ~100% novel token expansions, outperforming Doc2Query's ~20% at a comparable ROUGE:

  • nROUGE F1 improves by up to 3.9 points for the same budget of novel tokens.
  • Statistically significant improvements (95% confidence via bootstrap) in nROUGE F1.

Training and inference runtimes with full-match filtering:

  • Doc2Token: training 108min/epoch, inference 76min/100k products.
  • Doc2Query: training 166min/epoch, inference 141min/100k products.

5. Deployment Results and Practical Guidelines

Doc2Token for e-commerce retrieval was deployed to Walmart.com’s Solr stack. Predicted token expansions are indexed in a dedicated field. Online A/B tests yielded:

  • NDCG@10 increased from 0.485 to 0.487 (+0.49%, p=0.066p=0.066)
  • Revenue per session increased by +0.28% (p=0.013p=0.013, statistically significant)
  • Doc2Token was launched to all Walmart.com traffic after positive results.

For multitask document understanding, recommended hyperparameters (α=0.7,k=50\alpha=0.7, k=50) showed robust behavior across ten datasets. Optimal CLIP-ViT layers for local/global correlation are 8 and 24, respectively.

6. Theoretical and Practical Implications

Both instances of Doc2Token demonstrate the advantage of explicit, token-level reasoning about information redundancy and retrieval effectiveness. In multimodal compression, quantifying and sampling by correlation matrix and attention afford large reductions in token count with minimal accuracy loss, yielding quadratic compute savings in transformer attention. In document expansion, directing model capacity to missing, relevant tokens rather than full synthetic queries maximizes utility per expansion slot and has measurable business impact.

A plausible implication is that explicit modeling of token informativeness, whether via correlation or retrieval history, is broadly valuable in both language and vision settings. The modularity and parameter-free nature of these approaches could facilitate broad adoption as preprocessing or mid-pipeline optimizations. However, the scope and robustness of correlation-based redundancy detection in highly structured or noisy documents remains an area for further study.

While sharing the Doc2Token nomenclature, the methods are fundamentally distinct:

  • The MLLM compressor leverages spatial pattern redundancy and self-attention structure in vision transformer representations, yielding downstream efficiency and accuracy gains without learned parameters or retraining.
  • The retrieval-oriented expansion model applies token-level generative modeling with cross-entropy loss and bespoke novelty metrics, directly optimized for novel token recall and business metrics.

Both contrast with global fixed-threshold pruning or query expansion with full-sequence generation, illustrating the benefit of targeted and interpretable token-level interventions in both domains.

References:

  • "Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding" (Zhang et al., 2024)
  • "Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search" (Li et al., 2024)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Doc2Token Technique.