Doc2Token Technique: Token-Level Optimization

Updated 5 January 2026

Doc2Token is a dual-method approach for token-level optimization in both multimodal document compression and e-commerce retrieval.
In multimodal compression, it uses correlation-guided sampling to retain key image patch tokens while reducing redundancy by over 60%.
For e-commerce, it predicts missing yet relevant tokens with a T5 model, significantly boosting novel token expansion and retrieval metrics.

Doc2Token refers to two distinct methodologies developed for token-level optimization at different stages in document understanding and retrieval. In the multimodal vision-language modeling context, Doc2Token (formally "Token-level Correlation-guided Compression") is a parameter-free adaptive compressor that identifies and preserves the most informative image patch tokens for efficient downstream processing. In the information retrieval domain, Doc2Token is a document expansion framework that directly predicts missing yet relevant tokens to bridge the query-document vocabulary gap, particularly in large-scale e-commerce search. Both lines of work focus on selective token inclusion to improve efficiency and utility, but their technical realizations and objectives differ significantly.

In multimodal LLMs (MLLMs) for document understanding, images are typically cropped into sub-images and encoded into dense patch tokens using vision transformers (e.g., CLIP-ViT). Conventional approaches treat all patch tokens equivalently, ignoring variations in information content and leading to costly redundancies. The Doc2Token technique implements a two-stage, plug-and-play compressor between the vision encoder and the text adapter to enable adaptive, fine-grained token selection.

Redundancy Quantification with Patch–Patch Correlation

For a sub-image, let $K \in \mathbb{R}^{N \times D}$ be the matrix of $\ell_2$ -normalized CLIP keys, with $N$ patch tokens. The cosine similarity matrix $R \in \mathbb{R}^{N \times N}$ is

$R = \widehat{K} \widehat{K}^\top, \qquad R_{ij} = \frac{K_i \cdot K_j}{\|K_i\|_2 \|K_j\|_2}$

A token $i$ is redundant if it exhibits $n_i > k$ neighbors with $R_{ij} > \alpha$ , introducing thresholds $\alpha$ (e.g., $0.7$) and $k$ (e.g., $50$). The redundancy ratio $r = N_R/N$ , where $N_R$ is the number of redundant tokens; information density is defined as $d = 1-r$ .

CLS–Patch Correlation-Based Sampling

Token selection integrates:

Global branch: Outlier mining on deep-layer [CLS]–patch attention scores using an IQR-based outlier threshold $T = Q_3 + 1.5 \cdot (Q_3 - Q_1)$ . Tokens $i$ with attention $a_d(i)>T$ are global outliers.
Local branch: The attention vector $a_l$ from a lower CLIP-ViT layer is normalized into weights $\alpha_i$ , and $\lceil d N \rceil$ tokens are sampled without replacement by $\{\alpha_i\}$ .
Final set: $L = I \cup J$ (global outliers $\cup$ local samples). A kNN aggregation step merges unselected tokens into the selected ones.

Algorithmic details are implemented with matrix multiplications and sampling in PyTorch, preserving full gradient flow except for nondifferentiable index selection.

2. Plug-and-Play Design and Integration

This compressor is inserted post-vision encoder, prior to text adaptation. For a batch of $B$ samples, each with $S$ sub-images, incoming tokens of shape $\mathbb{R}^{B \times (S \cdot N) \times D}$ are compressed to $\mathbb{R}^{B \times (\sum_s n_s) \times D}$ with $n_s < N$ per sub-image, determined adaptively. Key attributes:

Parameter-free: All operational thresholds are fixed; no gradient-based updates.
Differentiable aggregation: Preserved token embeddings preserve backpropagation pathways.
Parallelism: Sub-image and batch processing are parallelizable; kNN aggregation is localized.

This parameter-free, adaptive system supports any cropping-based MLLM pipeline without retraining.

3. Experimental Results and Evaluation

Benchmarks were conducted on mPLUG-DocOwl 1.5 (CLIP-ViT/L-14 + LLaMA-2-7B) across ten major document-understanding datasets covering text-rich forms, QA, charts, and web screenshots. Doc2Token was evaluated against DocPedia, Monkey, UReader, PruMerge, and PruMerge+:

Model	DocVQA	InfoVQA	DeepForm	KLC	WTQ	TabFact	ChartQA	TextVQA	TextCaps	VisualMRC
mPLUG-DocOwl1.5 (no comp.)	81.6	50.4	68.8	37.9	39.8	80.4	70.5	68.8	132.0	239.5
PruMerge (plug-play)	53.6	29.6	9.3	28.3	23.7	71.4	55.8	60.0	120.7	125.9
Ours (~66% tokens, plug-play)	72.6	49.6	63.2	34.6	35.2	75.2	64.0	68.0	132.1	254.3

Average token count reduced to ~34% of the original, with approximately 5% accuracy drop in plug-and-play mode.
Ablations confirm both global and local correlation-based branches are critical for maximal information retention.
Adaptive sampling outperforms all fixed-ratio baselines.

In the context of e-commerce search, Doc2Token addresses the lexical gap between customer queries and product metadata. Whereas Doc2Query (Nogueira et al., 2019) generates full synthetic queries for document expansion (often duplicating tokens already present), Doc2Token directly predicts only those relevant tokens that are missing from the document.

Token Prediction Task

Given document tokens $D$ and vocabulary $V$ , Doc2Token learns $M(D) \subset V \setminus D$ —the missing, semantically pertinent tokens. A T5 encoder-decoder model is fine-tuned with a cross-entropy objective, frequency-weighted by historical query token frequency ( $w_k = f_k^\alpha, \alpha=0.5$ ). For each input document, the decoder autoregressively predicts the top- $K$ novel tokens via beam search.

Metric: Novel ROUGE Score

Standard ROUGE-1 is adapted to focus on truly novel expansions: $\text{nROUGE}_{\!P} = \frac{1}{I} \sum_i \frac{| \hat{y}_i \cap y_i^* |}{|\hat{y}_i|}, \quad \text{nROUGE}_{\!R} = \frac{1}{I}\sum_i \frac{| \hat{y}_i \cap y_i^* |}{|y_i^*|}, \quad \text{nROUGE}_{F1} = \frac{2\,\text{nROUGE}_P\,\text{nROUGE}_R}{\text{nROUGE}_P+\text{nROUGE}_R}$ where $y_i^*$ are the reference novel tokens for product $i$ .

Empirical Comparison

Doc2Token yields ~100% novel token expansions, outperforming Doc2Query's ~20% at a comparable ROUGE:

nROUGE F1 improves by up to 3.9 points for the same budget of novel tokens.
Statistically significant improvements (95% confidence via bootstrap) in nROUGE F1.

Training and inference runtimes with full-match filtering:

Doc2Token: training 108min/epoch, inference 76min/100k products.
Doc2Query: training 166min/epoch, inference 141min/100k products.

5. Deployment Results and Practical Guidelines

Doc2Token for e-commerce retrieval was deployed to Walmart.com’s Solr stack. Predicted token expansions are indexed in a dedicated field. Online A/B tests yielded:

NDCG@10 increased from 0.485 to 0.487 (+0.49%, $p=0.066$ )
Revenue per session increased by +0.28% ( $p=0.013$ , statistically significant)
Doc2Token was launched to all Walmart.com traffic after positive results.

For multitask document understanding, recommended hyperparameters ( $\alpha=0.7, k=50$ ) showed robust behavior across ten datasets. Optimal CLIP-ViT layers for local/global correlation are 8 and 24, respectively.

6. Theoretical and Practical Implications

Both instances of Doc2Token demonstrate the advantage of explicit, token-level reasoning about information redundancy and retrieval effectiveness. In multimodal compression, quantifying and sampling by correlation matrix and attention afford large reductions in token count with minimal accuracy loss, yielding quadratic compute savings in transformer attention. In document expansion, directing model capacity to missing, relevant tokens rather than full synthetic queries maximizes utility per expansion slot and has measurable business impact.

A plausible implication is that explicit modeling of token informativeness, whether via correlation or retrieval history, is broadly valuable in both language and vision settings. The modularity and parameter-free nature of these approaches could facilitate broad adoption as preprocessing or mid-pipeline optimizations. However, the scope and robustness of correlation-based redundancy detection in highly structured or noisy documents remains an area for further study.

While sharing the Doc2Token nomenclature, the methods are fundamentally distinct:

The MLLM compressor leverages spatial pattern redundancy and self-attention structure in vision transformer representations, yielding downstream efficiency and accuracy gains without learned parameters or retraining.
The retrieval-oriented expansion model applies token-level generative modeling with cross-entropy loss and bespoke novelty metrics, directly optimized for novel token recall and business metrics.

Both contrast with global fixed-threshold pruning or query expansion with full-sequence generation, illustrating the benefit of targeted and interpretable token-level interventions in both domains.

References:

"Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding" (Zhang et al., 2024)
"Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search" (Li et al., 2024)

Markdown Upgrade to Chat

References (3)

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding (2024)

Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search (2024)

Document Expansion by Query Prediction (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Doc2Token Technique.

Doc2Token Technique: Token-Level Optimization

1. Multimodal Document Compression via Token-level Correlation (Zhang et al., 2024)

Redundancy Quantification with Patch–Patch Correlation

CLS–Patch Correlation-Based Sampling

2. Plug-and-Play Design and Integration

3. Experimental Results and Evaluation

4. E-commerce Search Expansion with Doc2Token (Li et al., 2024)

Token Prediction Task

Metric: Novel ROUGE Score

Empirical Comparison

5. Deployment Results and Practical Guidelines

6. Theoretical and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Doc2Token Technique: Token-Level Optimization

1. Multimodal Document Compression via Token-level Correlation (Zhang et al., 2024)

Redundancy Quantification with Patch–Patch Correlation

CLS–Patch Correlation-Based Sampling

2. Plug-and-Play Design and Integration

3. Experimental Results and Evaluation

4. E-commerce Search Expansion with Doc2Token (Li et al., 2024)

Token Prediction Task

Metric: Novel ROUGE Score

Empirical Comparison

5. Deployment Results and Practical Guidelines

6. Theoretical and Practical Implications

7. Related Methods and Distinctions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research