Doc2Token Technique: Token-Level Optimization
- Doc2Token is a dual-method approach for token-level optimization in both multimodal document compression and e-commerce retrieval.
- In multimodal compression, it uses correlation-guided sampling to retain key image patch tokens while reducing redundancy by over 60%.
- For e-commerce, it predicts missing yet relevant tokens with a T5 model, significantly boosting novel token expansion and retrieval metrics.
Doc2Token refers to two distinct methodologies developed for token-level optimization at different stages in document understanding and retrieval. In the multimodal vision-language modeling context, Doc2Token (formally "Token-level Correlation-guided Compression") is a parameter-free adaptive compressor that identifies and preserves the most informative image patch tokens for efficient downstream processing. In the information retrieval domain, Doc2Token is a document expansion framework that directly predicts missing yet relevant tokens to bridge the query-document vocabulary gap, particularly in large-scale e-commerce search. Both lines of work focus on selective token inclusion to improve efficiency and utility, but their technical realizations and objectives differ significantly.
1. Multimodal Document Compression via Token-level Correlation (Zhang et al., 2024)
In multimodal LLMs (MLLMs) for document understanding, images are typically cropped into sub-images and encoded into dense patch tokens using vision transformers (e.g., CLIP-ViT). Conventional approaches treat all patch tokens equivalently, ignoring variations in information content and leading to costly redundancies. The Doc2Token technique implements a two-stage, plug-and-play compressor between the vision encoder and the text adapter to enable adaptive, fine-grained token selection.
Redundancy Quantification with Patch–Patch Correlation
For a sub-image, let be the matrix of -normalized CLIP keys, with patch tokens. The cosine similarity matrix is
A token is redundant if it exhibits neighbors with , introducing thresholds (e.g., $0.7$) and (e.g., $50$). The redundancy ratio , where is the number of redundant tokens; information density is defined as .
CLS–Patch Correlation-Based Sampling
Token selection integrates:
- Global branch: Outlier mining on deep-layer [CLS]–patch attention scores using an IQR-based outlier threshold . Tokens with attention are global outliers.
- Local branch: The attention vector from a lower CLIP-ViT layer is normalized into weights , and tokens are sampled without replacement by .
- Final set: (global outliers local samples). A kNN aggregation step merges unselected tokens into the selected ones.
Algorithmic details are implemented with matrix multiplications and sampling in PyTorch, preserving full gradient flow except for nondifferentiable index selection.
2. Plug-and-Play Design and Integration
This compressor is inserted post-vision encoder, prior to text adaptation. For a batch of samples, each with sub-images, incoming tokens of shape are compressed to with per sub-image, determined adaptively. Key attributes:
- Parameter-free: All operational thresholds are fixed; no gradient-based updates.
- Differentiable aggregation: Preserved token embeddings preserve backpropagation pathways.
- Parallelism: Sub-image and batch processing are parallelizable; kNN aggregation is localized.
This parameter-free, adaptive system supports any cropping-based MLLM pipeline without retraining.
3. Experimental Results and Evaluation
Benchmarks were conducted on mPLUG-DocOwl 1.5 (CLIP-ViT/L-14 + LLaMA-2-7B) across ten major document-understanding datasets covering text-rich forms, QA, charts, and web screenshots. Doc2Token was evaluated against DocPedia, Monkey, UReader, PruMerge, and PruMerge+:
| Model | DocVQA | InfoVQA | DeepForm | KLC | WTQ | TabFact | ChartQA | TextVQA | TextCaps | VisualMRC |
|---|---|---|---|---|---|---|---|---|---|---|
| mPLUG-DocOwl1.5 (no comp.) | 81.6 | 50.4 | 68.8 | 37.9 | 39.8 | 80.4 | 70.5 | 68.8 | 132.0 | 239.5 |
| PruMerge (plug-play) | 53.6 | 29.6 | 9.3 | 28.3 | 23.7 | 71.4 | 55.8 | 60.0 | 120.7 | 125.9 |
| Ours (~66% tokens, plug-play) | 72.6 | 49.6 | 63.2 | 34.6 | 35.2 | 75.2 | 64.0 | 68.0 | 132.1 | 254.3 |
- Average token count reduced to ~34% of the original, with approximately 5% accuracy drop in plug-and-play mode.
- Ablations confirm both global and local correlation-based branches are critical for maximal information retention.
- Adaptive sampling outperforms all fixed-ratio baselines.
4. E-commerce Search Expansion with Doc2Token (Li et al., 2024)
In the context of e-commerce search, Doc2Token addresses the lexical gap between customer queries and product metadata. Whereas Doc2Query (Nogueira et al., 2019) generates full synthetic queries for document expansion (often duplicating tokens already present), Doc2Token directly predicts only those relevant tokens that are missing from the document.
Token Prediction Task
Given document tokens and vocabulary , Doc2Token learns —the missing, semantically pertinent tokens. A T5 encoder-decoder model is fine-tuned with a cross-entropy objective, frequency-weighted by historical query token frequency (). For each input document, the decoder autoregressively predicts the top- novel tokens via beam search.
Metric: Novel ROUGE Score
Standard ROUGE-1 is adapted to focus on truly novel expansions: where are the reference novel tokens for product .
Empirical Comparison
Doc2Token yields ~100% novel token expansions, outperforming Doc2Query's ~20% at a comparable ROUGE:
- nROUGE F1 improves by up to 3.9 points for the same budget of novel tokens.
- Statistically significant improvements (95% confidence via bootstrap) in nROUGE F1.
Training and inference runtimes with full-match filtering:
- Doc2Token: training 108min/epoch, inference 76min/100k products.
- Doc2Query: training 166min/epoch, inference 141min/100k products.
5. Deployment Results and Practical Guidelines
Doc2Token for e-commerce retrieval was deployed to Walmart.com’s Solr stack. Predicted token expansions are indexed in a dedicated field. Online A/B tests yielded:
- NDCG@10 increased from 0.485 to 0.487 (+0.49%, )
- Revenue per session increased by +0.28% (, statistically significant)
- Doc2Token was launched to all Walmart.com traffic after positive results.
For multitask document understanding, recommended hyperparameters () showed robust behavior across ten datasets. Optimal CLIP-ViT layers for local/global correlation are 8 and 24, respectively.
6. Theoretical and Practical Implications
Both instances of Doc2Token demonstrate the advantage of explicit, token-level reasoning about information redundancy and retrieval effectiveness. In multimodal compression, quantifying and sampling by correlation matrix and attention afford large reductions in token count with minimal accuracy loss, yielding quadratic compute savings in transformer attention. In document expansion, directing model capacity to missing, relevant tokens rather than full synthetic queries maximizes utility per expansion slot and has measurable business impact.
A plausible implication is that explicit modeling of token informativeness, whether via correlation or retrieval history, is broadly valuable in both language and vision settings. The modularity and parameter-free nature of these approaches could facilitate broad adoption as preprocessing or mid-pipeline optimizations. However, the scope and robustness of correlation-based redundancy detection in highly structured or noisy documents remains an area for further study.
7. Related Methods and Distinctions
While sharing the Doc2Token nomenclature, the methods are fundamentally distinct:
- The MLLM compressor leverages spatial pattern redundancy and self-attention structure in vision transformer representations, yielding downstream efficiency and accuracy gains without learned parameters or retraining.
- The retrieval-oriented expansion model applies token-level generative modeling with cross-entropy loss and bespoke novelty metrics, directly optimized for novel token recall and business metrics.
Both contrast with global fixed-threshold pruning or query expansion with full-sequence generation, illustrating the benefit of targeted and interpretable token-level interventions in both domains.
References:
- "Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding" (Zhang et al., 2024)
- "Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search" (Li et al., 2024)