Amazon Co-purchase Multimodal Graph Dataset
- The Amazon Product Co-purchase Multimodal Graph Dataset is a collection of benchmark graphs where each node represents an Amazon product with complete text and high-resolution image attributes.
- It supports tasks like link prediction, node classification, and cross-modal retrieval by encoding co-purchase and co-view relationships from curated Amazon metadata.
- The dataset employs rigorous filtering and preprocessing methods to ensure high-quality, multimodal features derived from product titles, descriptions, and images.
The Amazon Product Co-purchase Multimodal Graph Dataset defines a suite of benchmark graphs in which each node represents an Amazon product, edges encode co-purchase or co-view relations, and each node is annotated with rich multimodal attributes—most prominently, text (titles, descriptions) and high-resolution images. This data type underpins recent advances in multimodal graph learning, supporting studies in domains such as recommendation, product retrieval, link prediction, and category classification. Notably, all contemporary instantiations of these datasets ensure that every retained node is associated with a complete set of modalities, permitting rigorous evaluation and development of graph neural networks, multimodal representation learning algorithms, and structure-aware multimodal pretraining frameworks.
1. Construction Principles and Data Sources
The foundational construction of the Amazon Product Co-purchase Multimodal Graph Dataset involves extracting and filtering product nodes, processing edge relations, and ensuring modality completeness:
- Graph Origin: Most variants rely on Amazon's publicly released metadata, review, and purchase-event corpora (notably the 2018 and 2024 versions curated by McAuley et al. and Hou et al.).
- Node Set : Each node represents a unique Amazon product within a defined domain or category. Only products with valid English titles, descriptions, and at least one high-resolution image are retained; products lacking any modality are discarded prior to graph induction (Lu, 4 Nov 2025).
- Edge Definition : Edges represent co-purchase relationships, either from explicit “also_buy”/“also_view” fields (Yan et al., 11 Oct 2024) or by identifying pairs of products purchased by at least one (or, commonly, ≥3) users (Liu et al., 2020, Lu, 4 Nov 2025). Edge filtering includes:
- Removal of self-loops and isolated nodes (Yan et al., 11 Oct 2024).
- Minimum edge frequency thresholds (e.g., only retaining pairs) (Liu et al., 2020, Lu, 4 Nov 2025).
- Optional k-core decomposition to remove low-degree nodes ( in (Lu, 4 Nov 2025)).
- Category-Specific Construction: Datasets are often organized by Amazon’s taxonomy (e.g., “Electronics,” “Movies & TV”), permitting both within-category and cross-category analysis (Lu, 4 Nov 2025).
2. Dataset Statistics and Coverage
Published datasets exhibit substantial diversity in scale, coverage, and modality completeness:
| Dataset Source | Nodes () | Edges () | Categories | Modalities | Tasks |
|---|---|---|---|---|---|
| MM-GRAPH Sports (Zhu et al., 24 Jun 2024) | 50,250 | 356,202 | None | Title, Image | LP only |
| MM-GRAPH Cloth (Zhu et al., 24 Jun 2024) | 125,839 | 951,271 | None | Title, Image | LP only |
| MAGB Movies (Yan et al., 11 Oct 2024) | 16,672 | 218,390 | 20 | Title+Desc, Image | Node Cls |
| MAGB Toys (Yan et al., 11 Oct 2024) | 20,695 | 126,886 | 18 | Title+Desc, Image | Node Cls |
| MAGB Grocery (Yan et al., 11 Oct 2024) | 84,379 | 693,154 | 20 | Title+Desc, Image | Node Cls |
| SLIP Electronics (Lu, 4 Nov 2025) | 98,000 | 2,015,000 | 1 | Title, Image | Retrieval/Cls |
| PMGT VG (Liu et al., 2020) | 5,032 | 83,981 | N/A | Description, Image | Rec, Cls, CTR |
| PMGT Toys (Liu et al., 2020) | 17,388 | 232,720 | N/A | Description, Image | Rec, Cls, CTR |
| PMGT Tools (Liu et al., 2020) | 15,619 | 178,834 | N/A | Description, Image | Rec, Cls, CTR |
All major releases report 100% multimodal node coverage after filtering, with no missing text or images in final node sets (Yan et al., 11 Oct 2024, Lu, 4 Nov 2025).
3. Modality Extraction and Feature Representation
Amazon Product Co-purchase Multimodal Graphs consistently provide two primary node modalities:
- Textual Features:
- Extraction: Product titles, descriptions, or concatenated segments are lowercased, tokenized, and truncated (usually to 128–256 tokens) (Zhu et al., 24 Jun 2024, Yan et al., 11 Oct 2024, Liu et al., 2020).
- Encoding: Text is embedded via pretrained PLMs such as CLIP-text (d=512), T5-Base (d=768), all-MiniLM-L12-v2 (d=384), BERT-base (d=768), or Llama-2-family LMs (Lu, 4 Nov 2025, Zhu et al., 24 Jun 2024).
- Pooling: For long-form inputs, mean pooling is performed over the last-layer token embeddings (Yan et al., 11 Oct 2024).
- Normalization: Extracted feature vectors are typically ℓ₂-normalized (Zhu et al., 24 Jun 2024, Lu, 4 Nov 2025).
- Visual Features:
- Extraction: Primary item images (preferentially high-resolution) are center-cropped/resized (224×224 for ViT/CLIP, 299×299 for Inception-v4), normalized to model-specific channel means/stds (Lu, 4 Nov 2025, Liu et al., 2020).
- Encoding: Visual embeddings are generated using CLIP-ViT (d=512), ViT-Base (d=768), Inception-v4 (d=1536), or other PVMs (Yan et al., 11 Oct 2024, Liu et al., 2020).
- Pooling: Features from multiple images are averaged if present (Liu et al., 2020).
- Normalization: Outputs are ℓ₂-normalized.
- Combined Feature Construction:
- Frequently, text and image features are concatenated row-wise, yielding (Zhu et al., 24 Jun 2024).
- Some methods apply further fusion via attention or shallow MLPs before input to downstream GNNs (Liu et al., 2020).
While the majority of implementations extract features via frozen encoders, several frameworks (e.g., PMGT, SLIP) employ linear projections or graph neural network layers for further feature transformation and fusion (Liu et al., 2020, Lu, 4 Nov 2025).
4. Graph Construction and Edge Semantics
Edge construction and weighting strategies closely track the co-purchase semantics endemic to e-commerce graphs:
- Edge Inclusion:
- Edges are included between items co-purchased or co-viewed by at least one (commonly ≥3) distinct users, or when they appear in each other’s “also_buy”/“also_view” lists (Yan et al., 11 Oct 2024, Liu et al., 2020).
- In a more restrictive regime, k-core pruning (with k=5) removes sparsely connected nodes (Lu, 4 Nov 2025).
- Edge Weights:
- Most datasets use unweighted, undirected edges: (Zhu et al., 24 Jun 2024).
- Weighted graphs are constructed in PMGT using
where is the number of shared purchasers, and the degrees (Liu et al., 2020).
Adjacency Matrix:
- Provided as sparse binary or real-valued matrices, with normalization (e.g., ) deferred to downstream message-passing architectures (Zhu et al., 24 Jun 2024).
- Train/Validation/Test Splits:
- Edge-based splits (e.g., 80/10/10 LP splits with negatives (Zhu et al., 24 Jun 2024)) or node-based splits for node classification (e.g., 60/20/20 in MAGB) (Yan et al., 11 Oct 2024).
A plausible implication is that these graph construction and edge semantics conventions, especially edge weighting and hard negative sampling, are tailored to encourage strong signal propagation and realistic evaluation of link and neighbor prediction tasks.
5. Downstream Tasks, Evaluation, and Benchmarks
Depending on graph instance and benchmark protocol, the following tasks are supported:
- Link Prediction (LP):
- Setup: Predict existence of edges among held-out pairs. MM-GRAPH splits edges 80/10/10 (train/val/test), with for each positive edge 150 hard negatives sampled via the HeaRT algorithm (Zhu et al., 24 Jun 2024).
- Metrics: Mean Reciprocal Rank (MRR), Hits@1, Hits@10.
- Node Classification:
- Many variants (notably the MAGB graphs) provide category labels over a moderate number of classes (C=18–20) (Yan et al., 11 Oct 2024).
- Metrics: Standard accuracy and macro-F1 across classes.
- Cross-Modal Retrieval:
- Employed for frameworks like SLIP, evaluating Retrieval@K and MRR in both directions (Image→Text, Text→Image) (Lu, 4 Nov 2025).
- SLIP’s Electronics subset, for example, yields I2T MRR=0.585 (CLIP baseline 0.520) and T2I MRR=0.582 (CLIP baseline 0.517), with R@1 improvement from 0.403 (CLIP) to 0.478 (SLIP).
- Recommendation and CTR Prediction:
- PMGT supports fine-grained click-through-rate prediction, recasting the graph as sample-positive and sample-negative item pairs for supervised learning (Liu et al., 2020).
- Feature Reconstruction and Contrastive Pretraining:
- PMGT leverages masked node feature reconstruction across multimodal neighborhoods; SLIP employs a structural contrastive loss to encourage both modality and relational alignment, with the total loss:
where
and is the positive mask for pairs within h hops in the batch subgraph (Lu, 4 Nov 2025).
6. Preprocessing, Negative Sampling, and Data Augmentation
Standardized preprocessing steps and negative sampling methodologies enhance reproducibility and strict test isolation:
- Text: Lowercased, tokenized; often truncated to fixed length; processed with model-specific tokenization pipelines (Zhu et al., 24 Jun 2024, Liu et al., 2020, Lu, 4 Nov 2025).
- Images: Resized to fixed spatial resolution; normalized per encoder’s requirements (ImageNet, CLIP stats) (Lu, 4 Nov 2025).
- Feature Normalization: ℓ₂-normalization post-encoding is standard (Zhu et al., 24 Jun 2024, Lu, 4 Nov 2025).
- Negative Sampling: For LP and retrieval, hard negative edges/pairs are generated via specialized samplers (e.g., HeaRT for MM-GRAPH’s LP task, 150 negatives per positive edge) (Zhu et al., 24 Jun 2024).
- Graph Augmentation: No edge dropping or node masking at data-load time; augmentation is left to modeling frameworks if desired (Zhu et al., 24 Jun 2024, Liu et al., 2020).
- Masking for Pretraining: In PMGT, 20% of sampled contextual neighbors are masked as part of the masked node feature reconstruction objective, with replacements and masking decisions inspired by BERT’s pretraining protocol (Liu et al., 2020).
7. Applications, Impact, and Methodological Insights
The Amazon Product Co-purchase Multimodal Graph Dataset is central to the development and benchmarking of multimodal graph representation learning:
- Benchmarking GNN-as-Predictor and VLM-as-Predictor Paradigms: Empirical studies reveal that integrating text, images, and topology in GNN pipelines outperforms unimodal baselines for node classification and product categorization (e.g., TV-GNN achieves 56.45% accuracy, 52.28% F1 on MAGB-Movies) (Yan et al., 11 Oct 2024).
- Structure-aware Multimodal Pretraining: SLIP introduces a structural contrastive loss, grouping co-purchased products closer in embedding space while improving cross-modal retrieval metrics by 12.5% MRR over the CLIP baseline (Lu, 4 Nov 2025).
- Recommendation and CTR Prediction: PMGT demonstrates that incorporating multimodal side information enhances downstream performance in recommendation and click-through-rate tasks (Liu et al., 2020).
- Reproducibility and Open Tools: All scripts for downloading raw Amazon data, constructing graphs, extracting features (CLIP, T5, ViT, Inception), splitting edges/nodes, and generating negatives are publicly available via associated repositories, enabling comprehensive reproducibility (Zhu et al., 24 Jun 2024, Yan et al., 11 Oct 2024).
- Modal Bias and Data Sparsity: Studies within these benchmarks report that modality importance may shift with domain, and intrinsic biases or missing information among modalities can hamper model performance in low-data settings (Yan et al., 11 Oct 2024). This suggests a continued need for robust fusion and modality-completion strategies.
The dataset's rigorous construction, scale, and modality completeness have made it a de facto reference for the evaluation and development of structure-aware, multimodal graph learning systems in large-scale recommendation and retrieval applications.