MBE2.0: Multimodal E-commerce Benchmark

Updated 23 November 2025

MBE2.0 is a multimodal benchmark that provides a comprehensive testbed for realistic e-commerce product understanding, addressing issues like modality imbalance and low data quality.
It employs an MLLM-based co-augmentation pipeline that enriches both textual and visual data, ensuring precise intra-product alignment and diversity in samples.
The benchmark supports robust evaluation across retrieval, classification, and attribute prediction tasks, driving advancements in e-commerce representation learning.

MBE2.0 (Multimodal Benchmark for E-commerce 2.0) is a large-scale, co-augmented multimodal representation benchmark specifically constructed to advance e-commerce product understanding. Released with the MOON2.0 framework, it serves as both a training and evaluation suite for multimodal models, focusing on rigorous real-world retrieval, classification, and attribute prediction tasks. MBE2.0 directly addresses limitations of prior benchmarks, specifically modality imbalance, neglected intra-product alignment, and insufficient data quality and diversity, through its scale, construction, and MLLM-based co-augmentation pipeline (Nie et al., 16 Nov 2025).

1. Motivation and Scope

MBE2.0 is motivated by the observation that existing benchmarks for e-commerce multimodal representation learning suffer from three main deficits: modality imbalance (artificially fixed image/text ratios in pre-training versus the diverse requirements of downstream retrieval and recognition tasks), under-utilization of intra-product alignment (favoring inter-product contrastive scenarios), and low data quality (noisy, short titles and unvaried, synthetic product imagery). MBE2.0 is designed to provide a realistic and comprehensive testbed that improves robustness and alignment for models exposed to authentic e-commerce scenarios, including noisy metadata and visually diverse samples (Nie et al., 16 Nov 2025).

2. Dataset Structure and Construction

Modalities and Composition

Every data sample in MBE2.0 is a triplet of (query, positive item, negative item), each annotated with:

Text: An original, terse product title (~6–12 tokens), plus for training, an enriched version (15–25 tokens) generated via MLLM-guided textual expansion.
Images: The main-subject product image per item, as well as multiple co-augmented visual variants ( $n_n \approx 4\text{--}8$ per item in practice).

The dataset consists of 5,751,594 training triplets and a held-out test set of 966,241 triplets, totaling approximately 6.7 million co-augmented multimodal samples. Each element in a triplet (query, positive, negative) provides one text modality and 1 plus $n_n$ image modalities at training time.

Collection and Pre-processing

Sampling is performed from e-commerce platform logs (January 2023 to June 2025):

Positive triplets: Obtained from image- or text-query sessions that led to purchases.
Negative triplets: Drawn from non-clicked exposures with low predicted relevance.
Samples are joined on the positive item to form multimodal queries (title + image).
User identifiers are stripped to retain only product images and text.

Co-augmentation Pipeline

Textual Enrichment (MLLM-based): Given original title $T$ , description $D$ , and image $I$ , entities $E=\{e_1,\ldots,e_m\}$ are extracted; a fine-tuned MLLM expands $T$ into $T^+$ :

$T^+ = \text{MLLM}_{\text{text}}(T, I, E).$

Visual Expansion (MLLM-based):
- Clean/Crop: Cleaned subject image $I^m=\text{MLLM}_{\text{clean}}(I)$ .
- Diversification: $n_n$ variants $I_k^c = \text{MLLM}_{\text{edit}}(I^m, T, \text{prompt}_k)$ are generated (varying background, angle, details).
- Quality Filtering: Each $I_k^c$ is scored $s_k = \text{CLIP}(I_k^c,T)$ ; images with $s_k<\tau_{\text{clip}}$ are discarded.
Dynamic Sample Filtering: During contrastive training, each triplet (representations $r_q,r_p,r_n$ ) receives a reliability weight

$\varphi = \sigma\left[\alpha((r_q \cdot r_p)-(r_q \cdot r_n)-\overline{\Delta})\right],$

where $\sigma$ is the sigmoid, $\alpha$ controls slope, and $\overline{\Delta}$ decays through training. Triplets with $\varphi<\delta$ ( $\delta=0.6$ ) receive reduced loss weighting.

3. Benchmark Tasks and Evaluation Metrics

MBE2.0 defines three primary evaluation axes:

A. Multimodal Retrieval Five zero-shot settings:
1. $q^t \rightarrow c^{mm}$ (text → multimodal)
2. $q^i \rightarrow c^{mm}$ (image → multimodal)
3. $q^{mm} \rightarrow c^{mm}$ (multimodal → multimodal)
4. $q^t \rightarrow c^i$ (text → image)
5. $q^i \rightarrow c^t$ (image → text) For query $q \in \{t, i, mm\}$ and candidate set $C$ , retrieve:

$\hat{c} = \arg\max_{c \in C} \text{sim}(F(q), F(c))$

Metric: Recall@k:

$R@k = \frac{1}{N} \sum_{i=1}^N \mathbb{1}\left[\text{gt}_i \in \text{top-}k_{\text{pred}_i}\right]$

B. Product Classification Assigns each test item to 1/𝐾 coarse categories. Metrics:

$\text{Precision} = \frac{TP}{TP+FP}, \quad \text{Recall} = \frac{TP}{TP+FN}, \quad F_1 = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}$

C. Attribute Prediction Multi-label tagging using the above metrics, macro-averaged per example.

4. Baseline Performance and Ablations

The main zero-shot results for MOON2.0 and nine baselines are summarized below.

Method	t→mm R@10	i→mm R@10	mm→mm R@10	t→i R@10	i→t R@10	Class Acc	Attr Acc
SigLIP2	16.23	65.86	52.17	28.10	28.64	11.18	38.91
BGE-VL-Large	18.92	64.84	63.33	32.22	30.24	59.36	69.61
FashionCLIP	12.95	69.20	69.13	28.89	25.09	45.27	60.65
InternVL3-2B	0.61	8.63	13.45	0.64	0.76	20.23	37.31
Qwen2.5-VL-3B	7.58	36.55	48.77	4.35	5.42	35.67	44.83
GME	64.41	64.98	73.90	41.77	38.26	64.92	70.76
MM-Embed	52.83	44.48	66.67	30.11	31.56	58.37	63.98
CASLIE-S	13.56	67.02	65.59	28.16	27.15	23.25	34.99
MOON	43.24	78.11	80.78	44.02	36.65	59.70	63.55
MOON2.0	63.09	91.08	94.21	73.12	64.91	68.08	84.29

Ablation results indicate that removal of any single MOON2.0 component–Modality-driven Mixture-of-Experts (MoE), Dual-level alignment, Co-augmentation, or Dynamic filtering–incurs a 5–15 point drop in R@10 and 2–9 point drop in accuracy, suggesting each module is critical for optimal performance. No formal statistical significance tests are reported.

5. Qualitative Analyses and Error Modes

Attention heatmaps in the MOON2.0 paper indicate that, relative to mixed-training baselines, the model suppresses non-informative tokens and background regions while focusing sharply on product-specific textual and visual attributes (e.g., "knitted cardigan", brand names, garment details). The most challenging cases in MBE2.0 involve very fine-grained attribute distinctions (such as "v-neck" vs. "polo-neck"), images with cluttered or creative backgrounds, and product titles where key entities are omitted.

6. Significance and Future Directions

MBE2.0 establishes a new standard for large-scale, co-augmented multimodal benchmarks in e-commerce by provisioning over 6.7 million samples, task diversity (retrieval, classification, attributes), and standardized metrics (Recall@k, accuracy, precision, recall, F₁-score). Its data construction and augmentation pipeline are specifically targeted at the major shortcomings of prior work. A plausible implication is that benchmarks with similar scale and augmentation strategies can generalize this approach to other domains where modality balance, alignment, and real-world noise are critical (Nie et al., 16 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding (2025)

MBE2.0: Multimodal E-commerce Benchmark

1. Motivation and Scope

2. Dataset Structure and Construction

Modalities and Composition

Collection and Pre-processing

Co-augmentation Pipeline

3. Benchmark Tasks and Evaluation Metrics

4. Baseline Performance and Ablations

5. Qualitative Analyses and Error Modes

6. Significance and Future Directions

Whiteboard

Follow Topic

Continue Learning

MBE2.0: Multimodal E-commerce Benchmark

1. Motivation and Scope

2. Dataset Structure and Construction

Modalities and Composition

Collection and Pre-processing

Co-augmentation Pipeline

3. Benchmark Tasks and Evaluation Metrics

4. Baseline Performance and Ablations

5. Qualitative Analyses and Error Modes

6. Significance and Future Directions

Whiteboard

Follow Topic

Continue Learning

Related Topics