MBE2.0: Multimodal E-commerce Benchmark
- MBE2.0 is a multimodal benchmark that provides a comprehensive testbed for realistic e-commerce product understanding, addressing issues like modality imbalance and low data quality.
- It employs an MLLM-based co-augmentation pipeline that enriches both textual and visual data, ensuring precise intra-product alignment and diversity in samples.
- The benchmark supports robust evaluation across retrieval, classification, and attribute prediction tasks, driving advancements in e-commerce representation learning.
MBE2.0 (Multimodal Benchmark for E-commerce 2.0) is a large-scale, co-augmented multimodal representation benchmark specifically constructed to advance e-commerce product understanding. Released with the MOON2.0 framework, it serves as both a training and evaluation suite for multimodal models, focusing on rigorous real-world retrieval, classification, and attribute prediction tasks. MBE2.0 directly addresses limitations of prior benchmarks, specifically modality imbalance, neglected intra-product alignment, and insufficient data quality and diversity, through its scale, construction, and MLLM-based co-augmentation pipeline (Nie et al., 16 Nov 2025).
1. Motivation and Scope
MBE2.0 is motivated by the observation that existing benchmarks for e-commerce multimodal representation learning suffer from three main deficits: modality imbalance (artificially fixed image/text ratios in pre-training versus the diverse requirements of downstream retrieval and recognition tasks), under-utilization of intra-product alignment (favoring inter-product contrastive scenarios), and low data quality (noisy, short titles and unvaried, synthetic product imagery). MBE2.0 is designed to provide a realistic and comprehensive testbed that improves robustness and alignment for models exposed to authentic e-commerce scenarios, including noisy metadata and visually diverse samples (Nie et al., 16 Nov 2025).
2. Dataset Structure and Construction
Modalities and Composition
Every data sample in MBE2.0 is a triplet of (query, positive item, negative item), each annotated with:
- Text: An original, terse product title (~6–12 tokens), plus for training, an enriched version (15–25 tokens) generated via MLLM-guided textual expansion.
- Images: The main-subject product image per item, as well as multiple co-augmented visual variants ( per item in practice).
The dataset consists of 5,751,594 training triplets and a held-out test set of 966,241 triplets, totaling approximately 6.7 million co-augmented multimodal samples. Each element in a triplet (query, positive, negative) provides one text modality and 1 plus image modalities at training time.
Collection and Pre-processing
Sampling is performed from e-commerce platform logs (January 2023 to June 2025):
- Positive triplets: Obtained from image- or text-query sessions that led to purchases.
- Negative triplets: Drawn from non-clicked exposures with low predicted relevance.
- Samples are joined on the positive item to form multimodal queries (title + image).
- User identifiers are stripped to retain only product images and text.
Co-augmentation Pipeline
- Textual Enrichment (MLLM-based): Given original title , description , and image , entities are extracted; a fine-tuned MLLM expands into :
- Visual Expansion (MLLM-based):
- Clean/Crop: Cleaned subject image .
- Diversification: variants are generated (varying background, angle, details).
- Quality Filtering: Each is scored ; images with are discarded.
- Dynamic Sample Filtering: During contrastive training, each triplet (representations ) receives a reliability weight
where is the sigmoid, controls slope, and decays through training. Triplets with () receive reduced loss weighting.
3. Benchmark Tasks and Evaluation Metrics
MBE2.0 defines three primary evaluation axes:
- A. Multimodal Retrieval
Five zero-shot settings:
- (text → multimodal)
- (image → multimodal)
- (multimodal → multimodal)
- (text → image)
- (image → text) For query and candidate set , retrieve:
Metric: Recall@k:
- B. Product Classification Assigns each test item to 1/𝐾 coarse categories. Metrics:
- C. Attribute Prediction Multi-label tagging using the above metrics, macro-averaged per example.
4. Baseline Performance and Ablations
The main zero-shot results for MOON2.0 and nine baselines are summarized below.
| Method | t→mm R@10 | i→mm R@10 | mm→mm R@10 | t→i R@10 | i→t R@10 | Class Acc | Attr Acc |
|---|---|---|---|---|---|---|---|
| SigLIP2 | 16.23 | 65.86 | 52.17 | 28.10 | 28.64 | 11.18 | 38.91 |
| BGE-VL-Large | 18.92 | 64.84 | 63.33 | 32.22 | 30.24 | 59.36 | 69.61 |
| FashionCLIP | 12.95 | 69.20 | 69.13 | 28.89 | 25.09 | 45.27 | 60.65 |
| InternVL3-2B | 0.61 | 8.63 | 13.45 | 0.64 | 0.76 | 20.23 | 37.31 |
| Qwen2.5-VL-3B | 7.58 | 36.55 | 48.77 | 4.35 | 5.42 | 35.67 | 44.83 |
| GME | 64.41 | 64.98 | 73.90 | 41.77 | 38.26 | 64.92 | 70.76 |
| MM-Embed | 52.83 | 44.48 | 66.67 | 30.11 | 31.56 | 58.37 | 63.98 |
| CASLIE-S | 13.56 | 67.02 | 65.59 | 28.16 | 27.15 | 23.25 | 34.99 |
| MOON | 43.24 | 78.11 | 80.78 | 44.02 | 36.65 | 59.70 | 63.55 |
| MOON2.0 | 63.09 | 91.08 | 94.21 | 73.12 | 64.91 | 68.08 | 84.29 |
Ablation results indicate that removal of any single MOON2.0 component–Modality-driven Mixture-of-Experts (MoE), Dual-level alignment, Co-augmentation, or Dynamic filtering–incurs a 5–15 point drop in R@10 and 2–9 point drop in accuracy, suggesting each module is critical for optimal performance. No formal statistical significance tests are reported.
5. Qualitative Analyses and Error Modes
Attention heatmaps in the MOON2.0 paper indicate that, relative to mixed-training baselines, the model suppresses non-informative tokens and background regions while focusing sharply on product-specific textual and visual attributes (e.g., "knitted cardigan", brand names, garment details). The most challenging cases in MBE2.0 involve very fine-grained attribute distinctions (such as "v-neck" vs. "polo-neck"), images with cluttered or creative backgrounds, and product titles where key entities are omitted.
6. Significance and Future Directions
MBE2.0 establishes a new standard for large-scale, co-augmented multimodal benchmarks in e-commerce by provisioning over 6.7 million samples, task diversity (retrieval, classification, attributes), and standardized metrics (Recall@k, accuracy, precision, recall, F₁-score). Its data construction and augmentation pipeline are specifically targeted at the major shortcomings of prior work. A plausible implication is that benchmarks with similar scale and augmentation strategies can generalize this approach to other domains where modality balance, alignment, and real-world noise are critical (Nie et al., 16 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free