MaxSim Scoring: BiMax for Document Alignment
- MaxSim Scoring is a document similarity metric that segments documents and uses cosine similarity to measure cross-lingual matches.
- BiMax averages maximum segment similarities in both directions, yielding recall rates up to 96.1% with a 100× speedup over optimal transport methods.
- The method supports various segmentation, embedding strategies, and real-world applications such as neural machine translation and web mining through the open-source EmbDA toolkit.
The Bidirectional MaxSim (BiMax) score is a cross-lingual document-to-document similarity metric introduced to improve both the efficiency and accuracy of large-scale document alignment, particularly in web mining scenarios. Unlike prior approaches such as Optimal Transport (OT) and TK-PERT, BiMax is designed to deliver near state-of-the-art recall while enabling substantially faster computation on massive multilingual corpora. Its open-source implementation, EmbDA, provides researchers with tooling to reproduce and extend alignment experiments across various language pairs and embedding backbones (Wang et al., 17 Oct 2025).
1. Mathematical Definition of MaxSim and BiMax
Given a source document and target document , these are decomposed into sets of non-overlapping segments and respectively. Each segment is mapped to a dense vector via a pretrained multilingual embedding model, such as LaBSE or LASER-2. Segment-to-segment similarity is computed as cosine similarity: The unidirectional MaxSim score aggregates, for each source segment , the maximum similarity to any segment in , averaged over all source segments: To address the intrinsic asymmetry, BiMax averages the two directions:
2. Computational Efficiency and Complexity
BiMax is quantitatively distinguished from OT-based methods by its computational profile. The primary cost for BiMax is the construction of the segment similarity matrix , which involves operations, with being the embedding dimension. Subsequent row-wise and column-wise maximizations require . Thus, total time complexity is , strictly quadratic in the segment counts.
In contrast, OT-based metrics require solving a transportation or Sinkhorn optimization on nodes, leading to practical costs of or per iteration, often with many iterations needed. Empirically, BiMax achieves up to a 100-fold speedup over OT in document alignment experiments (Wang et al., 17 Oct 2025).
3. Segmentation, Embedding, and Implementation Choices
BiMax is compatible with multiple segmentation and embedding strategies:
- Segmentation: Sentence-based (SBS), Blob (grouped fixed-length, e.g., 64 tokens), and Overlapping Fixed-Length Spans (OFLS, e.g., 30-token windows with 0.5 overlap).
- Embedding Backbones: LaBSE, LASER-2, LEALLA (a lightweight LaBSE variant), and multilingual Sentence-Transformers (distiluse-base-multilingual-cased-v2, paraphrase-MiniLM-L12-v2, paraphrase-mpnet-base-v2), as well as multi-task models (BGE M3, jina-embeddings-v3).
- Candidate Generation: Mean-pooling and TK-PERT summarization, with fast retrieval via Faiss.IndexFlatIP on GPU to fetch top- candidate targets by embedding proximity.
- Re-ranking: BiMax is applied to each candidate pair for final decision-making, optionally comparing with OT or Greedy Movers’ Distance as ablation baselines.
Hyperparameters for TK-PERT, such as number of windows () and peakedness (), are tuned according to the document alignment task (e.g., , on WMT16; , on MnRN). Experiments are typically run on A6000 or H100 GPUs.
4. Empirical Evaluation and Comparative Results
On the WMT16 bilingual document alignment benchmark (682K English, 522K French documents, 2.4K gold pairs), BiMax achieves soft-recall rates comparable to or slightly exceeding OT across various segmentation-embedding configurations. For instance, with LaBSE+OFLS, BiMax yields up to 96.1% recall, closely aligned with OT’s 96.8%, but at an approximately 100× greater throughput (13.2k pairs/sec for BiMax with mean pooling, versus 97–99 pairs/sec for OT) (Wang et al., 17 Oct 2025).
Ablation studies on the MnRN (Ja–En) corpus and several low-resource datasets (En–Si, En–Ta, Si–Ta) further highlight that BiMax, in conjunction with state-of-the-art embeddings, consistently delivers the highest F1 scores and recall: e.g., LaBSE+OFLS+BiMax on MnRN attains F1 ≈ 0.918 versus OT’s ≈ 0.837, with preprocessing times of 100s (BiMax) versus 640s (OT). BiMax shows superior performance on short document pairs (≤256 tokens), while OT/TK-PERT may better handle very long documents; this suggests hybrid pipelines adapting the scorer to document length.
5. Downstream Applications and Impact
Cross-lingual document alignment via BiMax supports end-to-end mining for neural machine translation (NMT) and web-scale parallel data extraction. In WMT23 downstream NMT experiments, all major aligners (OT, TK-PERT, BiMax) yield nearly identical translation quality in terms of BLEU and chrF (±0.1 BLEU) on four held-out domains (EMEA, EUB, EP, JRC), but BiMax enables twice the alignment speed (13k vs. 6–7k pairs/sec). Augmenting translation training with 400k BiMax-aligned pairs improves BLEU by 1–2 points in some domains (Wang et al., 17 Oct 2025).
6. Tools and Reproducibility
The EmbDA toolkit (https://github.com/EternalEdenn/EmbDA) comprises modular implementations for segmentation (SBS, Blob, OFLS), candidate vector indexing (Mean-Pool, TK-PERT with Faiss), and re-ranking via BiMax and OT (with both greedy and Sinkhorn solvers). EmbDA supports two-stage mining pipelines, logging, and robust evaluation.
A pseudocode sketch for single pair scoring:
1 2 3 4 5 6 7 8 |
S = segment(D_s) T = segment(D_t) E_s = [φ(s) for s in S] E_t = [φ(t) for t in T] M = cosine_matrix(E_s, E_t) f_s = mean(max(M, axis=1)) f_t = mean(max(M, axis=0)) BiMax = 0.5 * (f_s + f_t) |
7. Limitations, Ablations, and Future Directions
While BiMax offers significant computational advantages and strong recall, it is not universally optimal across all document lengths—OT and TK-PERT can outperform on very long or highly diverse documents. Ablation results indicate the necessity of carefully selecting segmentation strategies and embedding backbones tailored to the corpus properties. Hybrid pipelines that dynamically select the alignment metric based on input characteristics are suggested as a promising next step. The comprehensive ablation and benchmarking framework of (Wang et al., 17 Oct 2025) provides a blueprint for such future research.