Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Amazon Co-purchase Multimodal Graph Dataset

Updated 10 November 2025
  • The Amazon Product Co-purchase Multimodal Graph Dataset is a collection of benchmark graphs where each node represents an Amazon product with complete text and high-resolution image attributes.
  • It supports tasks like link prediction, node classification, and cross-modal retrieval by encoding co-purchase and co-view relationships from curated Amazon metadata.
  • The dataset employs rigorous filtering and preprocessing methods to ensure high-quality, multimodal features derived from product titles, descriptions, and images.

The Amazon Product Co-purchase Multimodal Graph Dataset defines a suite of benchmark graphs in which each node represents an Amazon product, edges encode co-purchase or co-view relations, and each node is annotated with rich multimodal attributes—most prominently, text (titles, descriptions) and high-resolution images. This data type underpins recent advances in multimodal graph learning, supporting studies in domains such as recommendation, product retrieval, link prediction, and category classification. Notably, all contemporary instantiations of these datasets ensure that every retained node is associated with a complete set of modalities, permitting rigorous evaluation and development of graph neural networks, multimodal representation learning algorithms, and structure-aware multimodal pretraining frameworks.

1. Construction Principles and Data Sources

The foundational construction of the Amazon Product Co-purchase Multimodal Graph Dataset involves extracting and filtering product nodes, processing edge relations, and ensuring modality completeness:

  • Graph Origin: Most variants rely on Amazon's publicly released metadata, review, and purchase-event corpora (notably the 2018 and 2024 versions curated by McAuley et al. and Hou et al.).
  • Node Set VV: Each node represents a unique Amazon product within a defined domain or category. Only products with valid English titles, descriptions, and at least one high-resolution image are retained; products lacking any modality are discarded prior to graph induction (Lu, 4 Nov 2025).
  • Edge Definition EE: Edges represent co-purchase relationships, either from explicit “also_buy”/“also_view” fields (Yan et al., 11 Oct 2024) or by identifying pairs of products purchased by at least one (or, commonly, ≥3) users (Liu et al., 2020, Lu, 4 Nov 2025). Edge filtering includes:
  • Category-Specific Construction: Datasets are often organized by Amazon’s taxonomy (e.g., “Electronics,” “Movies & TV”), permitting both within-category and cross-category analysis (Lu, 4 Nov 2025).

2. Dataset Statistics and Coverage

Published datasets exhibit substantial diversity in scale, coverage, and modality completeness:

Dataset Source Nodes (nn) Edges (mm) Categories Modalities Tasks
MM-GRAPH Sports (Zhu et al., 24 Jun 2024) 50,250 356,202 None Title, Image LP only
MM-GRAPH Cloth (Zhu et al., 24 Jun 2024) 125,839 951,271 None Title, Image LP only
MAGB Movies (Yan et al., 11 Oct 2024) 16,672 218,390 20 Title+Desc, Image Node Cls
MAGB Toys (Yan et al., 11 Oct 2024) 20,695 126,886 18 Title+Desc, Image Node Cls
MAGB Grocery (Yan et al., 11 Oct 2024) 84,379 693,154 20 Title+Desc, Image Node Cls
SLIP Electronics (Lu, 4 Nov 2025) 98,000 2,015,000 1 Title, Image Retrieval/Cls
PMGT VG (Liu et al., 2020) 5,032 83,981 N/A Description, Image Rec, Cls, CTR
PMGT Toys (Liu et al., 2020) 17,388 232,720 N/A Description, Image Rec, Cls, CTR
PMGT Tools (Liu et al., 2020) 15,619 178,834 N/A Description, Image Rec, Cls, CTR

All major releases report 100% multimodal node coverage after filtering, with no missing text or images in final node sets (Yan et al., 11 Oct 2024, Lu, 4 Nov 2025).

3. Modality Extraction and Feature Representation

Amazon Product Co-purchase Multimodal Graphs consistently provide two primary node modalities:

  • Textual Features:
  • Visual Features:
    • Extraction: Primary item images (preferentially high-resolution) are center-cropped/resized (224×224 for ViT/CLIP, 299×299 for Inception-v4), normalized to model-specific channel means/stds (Lu, 4 Nov 2025, Liu et al., 2020).
    • Encoding: Visual embeddings are generated using CLIP-ViT (d=512), ViT-Base (d=768), Inception-v4 (d=1536), or other PVMs (Yan et al., 11 Oct 2024, Liu et al., 2020).
    • Pooling: Features from multiple images are averaged if present (Liu et al., 2020).
    • Normalization: Outputs are ℓ₂-normalized.
  • Combined Feature Construction:
    • Frequently, text and image features are concatenated row-wise, yielding X=[TV]Rn×(dt+dv)X = [T \| V] \in \mathbb{R}^{n \times (d_t + d_v)} (Zhu et al., 24 Jun 2024).
    • Some methods apply further fusion via attention or shallow MLPs before input to downstream GNNs (Liu et al., 2020).

While the majority of implementations extract features via frozen encoders, several frameworks (e.g., PMGT, SLIP) employ linear projections or graph neural network layers for further feature transformation and fusion (Liu et al., 2020, Lu, 4 Nov 2025).

4. Graph Construction and Edge Semantics

Edge construction and weighting strategies closely track the co-purchase semantics endemic to e-commerce graphs:

  • Edge Inclusion:
    • Edges are included between items co-purchased or co-viewed by at least one (commonly ≥3) distinct users, or when they appear in each other’s “also_buy”/“also_view” lists (Yan et al., 11 Oct 2024, Liu et al., 2020).
    • In a more restrictive regime, k-core pruning (with k=5) removes sparsely connected nodes (Lu, 4 Nov 2025).
  • Edge Weights:
    • Most datasets use unweighted, undirected edges: Aij{0,1}A_{ij} \in \{0,1\} (Zhu et al., 24 Jun 2024).
    • Weighted graphs are constructed in PMGT using

    ωht=logrht+1logdhdt+1\omega_{ht} = \frac{\log r_{ht} + 1}{\log \sqrt{d_h d_t} + 1}

    where rhtr_{ht} is the number of shared purchasers, and (dh,dt)(d_h, d_t) the degrees (Liu et al., 2020).

  • Adjacency Matrix:

    • Provided as sparse binary or real-valued matrices, with normalization (e.g., A^=D1/2(A+I)D1/2\hat{A} = D^{-1/2}(A + I)D^{-1/2}) deferred to downstream message-passing architectures (Zhu et al., 24 Jun 2024).
  • Train/Validation/Test Splits:

A plausible implication is that these graph construction and edge semantics conventions, especially edge weighting and hard negative sampling, are tailored to encourage strong signal propagation and realistic evaluation of link and neighbor prediction tasks.

5. Downstream Tasks, Evaluation, and Benchmarks

Depending on graph instance and benchmark protocol, the following tasks are supported:

  • Link Prediction (LP):
    • Setup: Predict existence of edges among held-out pairs. MM-GRAPH splits edges 80/10/10 (train/val/test), with for each positive edge 150 hard negatives sampled via the HeaRT algorithm (Zhu et al., 24 Jun 2024).
    • Metrics: Mean Reciprocal Rank (MRR), Hits@1, Hits@10.
  • Node Classification:
    • Many variants (notably the MAGB graphs) provide category labels over a moderate number of classes (C=18–20) (Yan et al., 11 Oct 2024).
    • Metrics: Standard accuracy and macro-F1 across classes.
  • Cross-Modal Retrieval:
    • Employed for frameworks like SLIP, evaluating Retrieval@K and MRR in both directions (Image→Text, Text→Image) (Lu, 4 Nov 2025).
    • SLIP’s Electronics subset, for example, yields I2T MRR=0.585 (CLIP baseline 0.520) and T2I MRR=0.582 (CLIP baseline 0.517), with R@1 improvement from 0.403 (CLIP) to 0.478 (SLIP).
  • Recommendation and CTR Prediction:
    • PMGT supports fine-grained click-through-rate prediction, recasting the graph as sample-positive and sample-negative item pairs for supervised learning (Liu et al., 2020).
  • Feature Reconstruction and Contrastive Pretraining:

    • PMGT leverages masked node feature reconstruction across multimodal neighborhoods; SLIP employs a structural contrastive loss to encourage both modality and relational alignment, with the total loss:

    Ltotal=LCLIP+λgraphLgraph+λauxLauxL_\text{total} = L_\text{CLIP} + \lambda_\text{graph} L_\text{graph} + \lambda_\text{aux} L_\text{aux}

    where

    Lgraph=1M+1+ϵi,jMij+logPi,jL_\text{graph} = -\frac{1}{\|\mathbf{M}^+\|_1 + \epsilon} \sum_{i,j} \mathbf{M}^+_{ij} \log P_{i,j}

    and M+\mathbf{M}^+ is the positive mask for pairs within h hops in the batch subgraph (Lu, 4 Nov 2025).

6. Preprocessing, Negative Sampling, and Data Augmentation

Standardized preprocessing steps and negative sampling methodologies enhance reproducibility and strict test isolation:

  • Text: Lowercased, tokenized; often truncated to fixed length; processed with model-specific tokenization pipelines (Zhu et al., 24 Jun 2024, Liu et al., 2020, Lu, 4 Nov 2025).
  • Images: Resized to fixed spatial resolution; normalized per encoder’s requirements (ImageNet, CLIP stats) (Lu, 4 Nov 2025).
  • Feature Normalization: ℓ₂-normalization post-encoding is standard (Zhu et al., 24 Jun 2024, Lu, 4 Nov 2025).
  • Negative Sampling: For LP and retrieval, hard negative edges/pairs are generated via specialized samplers (e.g., HeaRT for MM-GRAPH’s LP task, 150 negatives per positive edge) (Zhu et al., 24 Jun 2024).
  • Graph Augmentation: No edge dropping or node masking at data-load time; augmentation is left to modeling frameworks if desired (Zhu et al., 24 Jun 2024, Liu et al., 2020).
  • Masking for Pretraining: In PMGT, 20% of sampled contextual neighbors are masked as part of the masked node feature reconstruction objective, with replacements and masking decisions inspired by BERT’s pretraining protocol (Liu et al., 2020).

7. Applications, Impact, and Methodological Insights

The Amazon Product Co-purchase Multimodal Graph Dataset is central to the development and benchmarking of multimodal graph representation learning:

  • Benchmarking GNN-as-Predictor and VLM-as-Predictor Paradigms: Empirical studies reveal that integrating text, images, and topology in GNN pipelines outperforms unimodal baselines for node classification and product categorization (e.g., TV-GNN achieves 56.45% accuracy, 52.28% F1 on MAGB-Movies) (Yan et al., 11 Oct 2024).
  • Structure-aware Multimodal Pretraining: SLIP introduces a structural contrastive loss, grouping co-purchased products closer in embedding space while improving cross-modal retrieval metrics by 12.5% MRR over the CLIP baseline (Lu, 4 Nov 2025).
  • Recommendation and CTR Prediction: PMGT demonstrates that incorporating multimodal side information enhances downstream performance in recommendation and click-through-rate tasks (Liu et al., 2020).
  • Reproducibility and Open Tools: All scripts for downloading raw Amazon data, constructing graphs, extracting features (CLIP, T5, ViT, Inception), splitting edges/nodes, and generating negatives are publicly available via associated repositories, enabling comprehensive reproducibility (Zhu et al., 24 Jun 2024, Yan et al., 11 Oct 2024).
  • Modal Bias and Data Sparsity: Studies within these benchmarks report that modality importance may shift with domain, and intrinsic biases or missing information among modalities can hamper model performance in low-data settings (Yan et al., 11 Oct 2024). This suggests a continued need for robust fusion and modality-completion strategies.

The dataset's rigorous construction, scale, and modality completeness have made it a de facto reference for the evaluation and development of structure-aware, multimodal graph learning systems in large-scale recommendation and retrieval applications.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Amazon Product Co-purchase Multimodal Graph Dataset.