Papers
Topics
Authors
Recent
2000 character limit reached

MBE2.0: Multimodal E-commerce Benchmark

Updated 23 November 2025
  • MBE2.0 is a multimodal benchmark that provides a comprehensive testbed for realistic e-commerce product understanding, addressing issues like modality imbalance and low data quality.
  • It employs an MLLM-based co-augmentation pipeline that enriches both textual and visual data, ensuring precise intra-product alignment and diversity in samples.
  • The benchmark supports robust evaluation across retrieval, classification, and attribute prediction tasks, driving advancements in e-commerce representation learning.

MBE2.0 (Multimodal Benchmark for E-commerce 2.0) is a large-scale, co-augmented multimodal representation benchmark specifically constructed to advance e-commerce product understanding. Released with the MOON2.0 framework, it serves as both a training and evaluation suite for multimodal models, focusing on rigorous real-world retrieval, classification, and attribute prediction tasks. MBE2.0 directly addresses limitations of prior benchmarks, specifically modality imbalance, neglected intra-product alignment, and insufficient data quality and diversity, through its scale, construction, and MLLM-based co-augmentation pipeline (Nie et al., 16 Nov 2025).

1. Motivation and Scope

MBE2.0 is motivated by the observation that existing benchmarks for e-commerce multimodal representation learning suffer from three main deficits: modality imbalance (artificially fixed image/text ratios in pre-training versus the diverse requirements of downstream retrieval and recognition tasks), under-utilization of intra-product alignment (favoring inter-product contrastive scenarios), and low data quality (noisy, short titles and unvaried, synthetic product imagery). MBE2.0 is designed to provide a realistic and comprehensive testbed that improves robustness and alignment for models exposed to authentic e-commerce scenarios, including noisy metadata and visually diverse samples (Nie et al., 16 Nov 2025).

2. Dataset Structure and Construction

Modalities and Composition

Every data sample in MBE2.0 is a triplet of (query, positive item, negative item), each annotated with:

  • Text: An original, terse product title (~6–12 tokens), plus for training, an enriched version (15–25 tokens) generated via MLLM-guided textual expansion.
  • Images: The main-subject product image per item, as well as multiple co-augmented visual variants (nn48n_n \approx 4\text{--}8 per item in practice).

The dataset consists of 5,751,594 training triplets and a held-out test set of 966,241 triplets, totaling approximately 6.7 million co-augmented multimodal samples. Each element in a triplet (query, positive, negative) provides one text modality and 1 plus nnn_n image modalities at training time.

Collection and Pre-processing

Sampling is performed from e-commerce platform logs (January 2023 to June 2025):

  • Positive triplets: Obtained from image- or text-query sessions that led to purchases.
  • Negative triplets: Drawn from non-clicked exposures with low predicted relevance.
  • Samples are joined on the positive item to form multimodal queries (title + image).
  • User identifiers are stripped to retain only product images and text.

Co-augmentation Pipeline

  • Textual Enrichment (MLLM-based): Given original title TT, description DD, and image II, entities E={e1,,em}E=\{e_1,\ldots,e_m\} are extracted; a fine-tuned MLLM expands TT into T+T^+:

T+=MLLMtext(T,I,E).T^+ = \text{MLLM}_{\text{text}}(T, I, E).

  • Visual Expansion (MLLM-based):
    • Clean/Crop: Cleaned subject image Im=MLLMclean(I)I^m=\text{MLLM}_{\text{clean}}(I).
    • Diversification: nnn_n variants Ikc=MLLMedit(Im,T,promptk)I_k^c = \text{MLLM}_{\text{edit}}(I^m, T, \text{prompt}_k) are generated (varying background, angle, details).
    • Quality Filtering: Each IkcI_k^c is scored sk=CLIP(Ikc,T)s_k = \text{CLIP}(I_k^c,T); images with sk<τclips_k<\tau_{\text{clip}} are discarded.
  • Dynamic Sample Filtering: During contrastive training, each triplet (representations rq,rp,rnr_q,r_p,r_n) receives a reliability weight

φ=σ[α((rqrp)(rqrn)Δ)],\varphi = \sigma\left[\alpha((r_q \cdot r_p)-(r_q \cdot r_n)-\overline{\Delta})\right],

where σ\sigma is the sigmoid, α\alpha controls slope, and Δ\overline{\Delta} decays through training. Triplets with φ<δ\varphi<\delta (δ=0.6\delta=0.6) receive reduced loss weighting.

3. Benchmark Tasks and Evaluation Metrics

MBE2.0 defines three primary evaluation axes:

  • A. Multimodal Retrieval Five zero-shot settings:
    1. qtcmmq^t \rightarrow c^{mm} (text → multimodal)
    2. qicmmq^i \rightarrow c^{mm} (image → multimodal)
    3. qmmcmmq^{mm} \rightarrow c^{mm} (multimodal → multimodal)
    4. qtciq^t \rightarrow c^i (text → image)
    5. qictq^i \rightarrow c^t (image → text) For query q{t,i,mm}q \in \{t, i, mm\} and candidate set CC, retrieve:

c^=argmaxcCsim(F(q),F(c))\hat{c} = \arg\max_{c \in C} \text{sim}(F(q), F(c))

Metric: Recall@k:

R@k=1Ni=1N1[gtitop-kpredi]R@k = \frac{1}{N} \sum_{i=1}^N \mathbb{1}\left[\text{gt}_i \in \text{top-}k_{\text{pred}_i}\right]

  • B. Product Classification Assigns each test item to 1/𝐾 coarse categories. Metrics:

Precision=TPTP+FP,Recall=TPTP+FN,F1=2PrecisionRecallPrecision+Recall\text{Precision} = \frac{TP}{TP+FP}, \quad \text{Recall} = \frac{TP}{TP+FN}, \quad F_1 = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}

  • C. Attribute Prediction Multi-label tagging using the above metrics, macro-averaged per example.

4. Baseline Performance and Ablations

The main zero-shot results for MOON2.0 and nine baselines are summarized below.

Method t→mm R@10 i→mm R@10 mm→mm R@10 t→i R@10 i→t R@10 Class Acc Attr Acc
SigLIP2 16.23 65.86 52.17 28.10 28.64 11.18 38.91
BGE-VL-Large 18.92 64.84 63.33 32.22 30.24 59.36 69.61
FashionCLIP 12.95 69.20 69.13 28.89 25.09 45.27 60.65
InternVL3-2B 0.61 8.63 13.45 0.64 0.76 20.23 37.31
Qwen2.5-VL-3B 7.58 36.55 48.77 4.35 5.42 35.67 44.83
GME 64.41 64.98 73.90 41.77 38.26 64.92 70.76
MM-Embed 52.83 44.48 66.67 30.11 31.56 58.37 63.98
CASLIE-S 13.56 67.02 65.59 28.16 27.15 23.25 34.99
MOON 43.24 78.11 80.78 44.02 36.65 59.70 63.55
MOON2.0 63.09 91.08 94.21 73.12 64.91 68.08 84.29

Ablation results indicate that removal of any single MOON2.0 component–Modality-driven Mixture-of-Experts (MoE), Dual-level alignment, Co-augmentation, or Dynamic filtering–incurs a 5–15 point drop in R@10 and 2–9 point drop in accuracy, suggesting each module is critical for optimal performance. No formal statistical significance tests are reported.

5. Qualitative Analyses and Error Modes

Attention heatmaps in the MOON2.0 paper indicate that, relative to mixed-training baselines, the model suppresses non-informative tokens and background regions while focusing sharply on product-specific textual and visual attributes (e.g., "knitted cardigan", brand names, garment details). The most challenging cases in MBE2.0 involve very fine-grained attribute distinctions (such as "v-neck" vs. "polo-neck"), images with cluttered or creative backgrounds, and product titles where key entities are omitted.

6. Significance and Future Directions

MBE2.0 establishes a new standard for large-scale, co-augmented multimodal benchmarks in e-commerce by provisioning over 6.7 million samples, task diversity (retrieval, classification, attributes), and standardized metrics (Recall@k, accuracy, precision, recall, F₁-score). Its data construction and augmentation pipeline are specifically targeted at the major shortcomings of prior work. A plausible implication is that benchmarks with similar scale and augmentation strategies can generalize this approach to other domains where modality balance, alignment, and real-world noise are critical (Nie et al., 16 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MBE2.0.