Amazon Co-purchase Multimodal Graph Dataset

Updated 10 November 2025

The Amazon Product Co-purchase Multimodal Graph Dataset is a collection of benchmark graphs where each node represents an Amazon product with complete text and high-resolution image attributes.
It supports tasks like link prediction, node classification, and cross-modal retrieval by encoding co-purchase and co-view relationships from curated Amazon metadata.
The dataset employs rigorous filtering and preprocessing methods to ensure high-quality, multimodal features derived from product titles, descriptions, and images.

The Amazon Product Co-purchase Multimodal Graph Dataset defines a suite of benchmark graphs in which each node represents an Amazon product, edges encode co-purchase or co-view relations, and each node is annotated with rich multimodal attributes—most prominently, text (titles, descriptions) and high-resolution images. This data type underpins recent advances in multimodal graph learning, supporting studies in domains such as recommendation, product retrieval, link prediction, and category classification. Notably, all contemporary instantiations of these datasets ensure that every retained node is associated with a complete set of modalities, permitting rigorous evaluation and development of graph neural networks, multimodal representation learning algorithms, and structure-aware multimodal pretraining frameworks.

1. Construction Principles and Data Sources

The foundational construction of the Amazon Product Co-purchase Multimodal Graph Dataset involves extracting and filtering product nodes, processing edge relations, and ensuring modality completeness:

Graph Origin: Most variants rely on Amazon's publicly released metadata, review, and purchase-event corpora (notably the 2018 and 2024 versions curated by McAuley et al. and Hou et al.).
Node Set $V$ : Each node represents a unique Amazon product within a defined domain or category. Only products with valid English titles, descriptions, and at least one high-resolution image are retained; products lacking any modality are discarded prior to graph induction (Lu, 4 Nov 2025).
Edge Definition $E$ : Edges represent co-purchase relationships, either from explicit “also_buy”/“also_view” fields (Yan et al., 2024) or by identifying pairs of products purchased by at least one (or, commonly, ≥3) users (Liu et al., 2020, Lu, 4 Nov 2025). Edge filtering includes:
- Removal of self-loops and isolated nodes (Yan et al., 2024).
- Minimum edge frequency thresholds (e.g., only retaining $r_{ht} \geq 3$ pairs) (Liu et al., 2020, Lu, 4 Nov 2025).
- Optional k-core decomposition to remove low-degree nodes ( $k=5$ in (Lu, 4 Nov 2025)).
Category-Specific Construction: Datasets are often organized by Amazon’s taxonomy (e.g., “Electronics,” “Movies & TV”), permitting both within-category and cross-category analysis (Lu, 4 Nov 2025).

2. Dataset Statistics and Coverage

Published datasets exhibit substantial diversity in scale, coverage, and modality completeness:

Dataset Source	Nodes ( $n$ )	Edges ( $m$ )	Categories	Modalities	Tasks
MM-GRAPH Sports (Zhu et al., 2024)	50,250	356,202	None	Title, Image	LP only
MM-GRAPH Cloth (Zhu et al., 2024)	125,839	951,271	None	Title, Image	LP only
MAGB Movies (Yan et al., 2024)	16,672	218,390	20	Title+Desc, Image	Node Cls
MAGB Toys (Yan et al., 2024)	20,695	126,886	18	Title+Desc, Image	Node Cls
MAGB Grocery (Yan et al., 2024)	84,379	693,154	20	Title+Desc, Image	Node Cls
SLIP Electronics (Lu, 4 Nov 2025)	98,000	2,015,000	1	Title, Image	Retrieval/Cls
PMGT VG (Liu et al., 2020)	5,032	83,981	N/A	Description, Image	Rec, Cls, CTR
PMGT Toys (Liu et al., 2020)	17,388	232,720	N/A	Description, Image	Rec, Cls, CTR
PMGT Tools (Liu et al., 2020)	15,619	178,834	N/A	Description, Image	Rec, Cls, CTR

All major releases report 100% multimodal node coverage after filtering, with no missing text or images in final node sets (Yan et al., 2024, Lu, 4 Nov 2025).

3. Modality Extraction and Feature Representation

Amazon Product Co-purchase Multimodal Graphs consistently provide two primary node modalities:

Textual Features:
- Extraction: Product titles, descriptions, or concatenated segments are lowercased, tokenized, and truncated (usually to 128–256 tokens) (Zhu et al., 2024, Yan et al., 2024, Liu et al., 2020).
- Encoding: Text is embedded via pretrained PLMs such as CLIP-text (d=512), T5-Base (d=768), all-MiniLM-L12-v2 (d=384), BERT-base (d=768), or Llama-2-family LMs (Lu, 4 Nov 2025, Zhu et al., 2024).
- Pooling: For long-form inputs, mean pooling is performed over the last-layer token embeddings (Yan et al., 2024).
- Normalization: Extracted feature vectors are typically ℓ₂-normalized (Zhu et al., 2024, Lu, 4 Nov 2025).
Visual Features:
- Extraction: Primary item images (preferentially high-resolution) are center-cropped/resized (224×224 for ViT/CLIP, 299×299 for Inception-v4), normalized to model-specific channel means/stds (Lu, 4 Nov 2025, Liu et al., 2020).
- Encoding: Visual embeddings are generated using CLIP-ViT (d=512), ViT-Base (d=768), Inception-v4 (d=1536), or other PVMs (Yan et al., 2024, Liu et al., 2020).
- Pooling: Features from multiple images are averaged if present (Liu et al., 2020).
- Normalization: Outputs are ℓ₂-normalized.
Combined Feature Construction:
- Frequently, text and image features are concatenated row-wise, yielding $X = [T \| V] \in \mathbb{R}^{n \times (d_t + d_v)}$ (Zhu et al., 2024).
- Some methods apply further fusion via attention or shallow MLPs before input to downstream GNNs (Liu et al., 2020).

While the majority of implementations extract features via frozen encoders, several frameworks (e.g., PMGT, SLIP) employ linear projections or graph neural network layers for further feature transformation and fusion (Liu et al., 2020, Lu, 4 Nov 2025).

4. Graph Construction and Edge Semantics

Edge construction and weighting strategies closely track the co-purchase semantics endemic to e-commerce graphs:

Edge Inclusion:
- Edges are included between items co-purchased or co-viewed by at least one (commonly ≥3) distinct users, or when they appear in each other’s “also_buy”/“also_view” lists (Yan et al., 2024, Liu et al., 2020).
- In a more restrictive regime, k-core pruning (with k=5) removes sparsely connected nodes (Lu, 4 Nov 2025).
Edge Weights:
- Most datasets use unweighted, undirected edges: $A_{ij} \in \{0,1\}$ (Zhu et al., 2024).
- Weighted graphs are constructed in PMGT using
$\omega_{ht} = \frac{\log r_{ht} + 1}{\log \sqrt{d_h d_t} + 1}$

where $r_{ht}$ is the number of shared purchasers, and $(d_h, d_t)$ the degrees (Liu et al., 2020).
Adjacency Matrix:
- Provided as sparse binary or real-valued matrices, with normalization (e.g., $\hat{A} = D^{-1/2}(A + I)D^{-1/2}$ ) deferred to downstream message-passing architectures (Zhu et al., 2024).
Train/Validation/Test Splits:
- Edge-based splits (e.g., 80/10/10 LP splits with negatives (Zhu et al., 2024)) or node-based splits for node classification (e.g., 60/20/20 in MAGB) (Yan et al., 2024).

A plausible implication is that these graph construction and edge semantics conventions, especially edge weighting and hard negative sampling, are tailored to encourage strong signal propagation and realistic evaluation of link and neighbor prediction tasks.

5. Downstream Tasks, Evaluation, and Benchmarks

Depending on graph instance and benchmark protocol, the following tasks are supported:

Link Prediction (LP):
- Setup: Predict existence of edges among held-out pairs. MM-GRAPH splits edges 80/10/10 (train/val/test), with for each positive edge 150 hard negatives sampled via the HeaRT algorithm (Zhu et al., 2024).
- Metrics: Mean Reciprocal Rank (MRR), Hits@1, Hits@10.
Node Classification:
- Many variants (notably the MAGB graphs) provide category labels over a moderate number of classes (C=18–20) (Yan et al., 2024).
- Metrics: Standard accuracy and macro-F1 across classes.
Cross-Modal Retrieval:
- Employed for frameworks like SLIP, evaluating Retrieval@K and MRR in both directions (Image→Text, Text→Image) (Lu, 4 Nov 2025).
- SLIP’s Electronics subset, for example, yields I2T MRR=0.585 (CLIP baseline 0.520) and T2I MRR=0.582 (CLIP baseline 0.517), with R@1 improvement from 0.403 (CLIP) to 0.478 (SLIP).
Recommendation and CTR Prediction:
- PMGT supports fine-grained click-through-rate prediction, recasting the graph as sample-positive and sample-negative item pairs for supervised learning (Liu et al., 2020).
Feature Reconstruction and Contrastive Pretraining:
- PMGT leverages masked node feature reconstruction across multimodal neighborhoods; SLIP employs a structural contrastive loss to encourage both modality and relational alignment, with the total loss:
$L_\text{total} = L_\text{CLIP} + \lambda_\text{graph} L_\text{graph} + \lambda_\text{aux} L_\text{aux}$

where

$L_\text{graph} = -\frac{1}{\|\mathbf{M}^+\|_1 + \epsilon} \sum_{i,j} \mathbf{M}^+_{ij} \log P_{i,j}$

and $\mathbf{M}^+$ is the positive mask for pairs within h hops in the batch subgraph (Lu, 4 Nov 2025).

6. Preprocessing, Negative Sampling, and Data Augmentation

Standardized preprocessing steps and negative sampling methodologies enhance reproducibility and strict test isolation:

Text: Lowercased, tokenized; often truncated to fixed length; processed with model-specific tokenization pipelines (Zhu et al., 2024, Liu et al., 2020, Lu, 4 Nov 2025).
Images: Resized to fixed spatial resolution; normalized per encoder’s requirements (ImageNet, CLIP stats) (Lu, 4 Nov 2025).
Feature Normalization: ℓ₂-normalization post-encoding is standard (Zhu et al., 2024, Lu, 4 Nov 2025).
Negative Sampling: For LP and retrieval, hard negative edges/pairs are generated via specialized samplers (e.g., HeaRT for MM-GRAPH’s LP task, 150 negatives per positive edge) (Zhu et al., 2024).
Graph Augmentation: No edge dropping or node masking at data-load time; augmentation is left to modeling frameworks if desired (Zhu et al., 2024, Liu et al., 2020).
Masking for Pretraining: In PMGT, 20% of sampled contextual neighbors are masked as part of the masked node feature reconstruction objective, with replacements and masking decisions inspired by BERT’s pretraining protocol (Liu et al., 2020).

7. Applications, Impact, and Methodological Insights

The Amazon Product Co-purchase Multimodal Graph Dataset is central to the development and benchmarking of multimodal graph representation learning:

Benchmarking GNN-as-Predictor and VLM-as-Predictor Paradigms: Empirical studies reveal that integrating text, images, and topology in GNN pipelines outperforms unimodal baselines for node classification and product categorization (e.g., TV-GNN achieves 56.45% accuracy, 52.28% F1 on MAGB-Movies) (Yan et al., 2024).
Structure-aware Multimodal Pretraining: SLIP introduces a structural contrastive loss, grouping co-purchased products closer in embedding space while improving cross-modal retrieval metrics by 12.5% MRR over the CLIP baseline (Lu, 4 Nov 2025).
Recommendation and CTR Prediction: PMGT demonstrates that incorporating multimodal side information enhances downstream performance in recommendation and click-through-rate tasks (Liu et al., 2020).
Reproducibility and Open Tools: All scripts for downloading raw Amazon data, constructing graphs, extracting features (CLIP, T5, ViT, Inception), splitting edges/nodes, and generating negatives are publicly available via associated repositories, enabling comprehensive reproducibility (Zhu et al., 2024, Yan et al., 2024).
Modal Bias and Data Sparsity: Studies within these benchmarks report that modality importance may shift with domain, and intrinsic biases or missing information among modalities can hamper model performance in low-data settings (Yan et al., 2024). This suggests a continued need for robust fusion and modality-completion strategies.

The dataset's rigorous construction, scale, and modality completeness have made it a de facto reference for the evaluation and development of structure-aware, multimodal graph learning systems in large-scale recommendation and retrieval applications.

PDF Markdown Chat (Pro)

References (4)

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment (2025)

When Graph meets Multimodal: Benchmarking and Meditating on Multimodal Attributed Graphs Learning (2024)

Pre-training Graph Transformer with Multimodal Side Information for Recommendation (2020)

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Amazon Product Co-purchase Multimodal Graph Dataset.

Amazon Co-purchase Multimodal Graph Dataset

1. Construction Principles and Data Sources

2. Dataset Statistics and Coverage

3. Modality Extraction and Feature Representation

4. Graph Construction and Edge Semantics

5. Downstream Tasks, Evaluation, and Benchmarks

6. Preprocessing, Negative Sampling, and Data Augmentation

7. Applications, Impact, and Methodological Insights

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Amazon Co-purchase Multimodal Graph Dataset

1. Construction Principles and Data Sources

2. Dataset Statistics and Coverage

3. Modality Extraction and Feature Representation

4. Graph Construction and Edge Semantics

5. Downstream Tasks, Evaluation, and Benchmarks

6. Preprocessing, Negative Sampling, and Data Augmentation

7. Applications, Impact, and Methodological Insights

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research