Papers
Topics
Authors
Recent
2000 character limit reached

Amazon Computers Co-Purchase Data

Updated 22 December 2025
  • Amazon Computers Co-Purchase Data is a focused subgraph from the Amazon Product Co-Purchasing Network that captures co-purchase patterns among computer products.
  • It enables analysis of e-commerce network structure through motif enumeration, community detection, and exploration of product bundling patterns.
  • The dataset supports developing inductive recommender systems using graph neural networks with rich node features and contextual metadata.

The Amazon Computers Co-Purchase Data refers to a subgraph extracted from the broader Amazon Product Co-Purchasing Network, focusing specifically on products categorized as “Computers.” This subgraph is derived from the SNAP Amazon meta-data datasets, which capture product-level co-purchasing relationships inferred from Amazon's “Customers Who Bought This Item Also Bought” feature. The Computers subgraph, once constructed, serves as a test bed for analyzing structural properties of e-commerce networks, investigating motif distributions, executing community detection, and developing inductive recommender systems using graph neural networks.

1. Source Data and Extraction of the Computers Subgraph

The Amazon Product Co-Purchasing Network, available from the Stanford Network Analysis Project (SNAP), consists of either undirected or directed edges between product nodes, with edges indicating that two products are frequently purchased together. The key datasets include:

  • Edge List: Plain text list of product pairs. In (Srivastava, 2010), this is a directed edge (i,j)(i, j) where node jj appears in the “also bought” list for node ii.
  • Metadata File: Contains for each product: integer node ID, ASIN, Title, Group (e.g., Electronics), SalesRank, Similar Products (list of ASINs), Categories (category tree), Reviews.

To generate the Computers subgraph:

  1. Parse metadata, filter nodes where either the “Group” field is “Electronics” or the “Categories” list contains “Computers.”
  2. Induce the subgraph GcG_c on these product IDs: Gc=G[{vV:“Computers”Categories(v)}]G_c = G[\{v \in V : \text{“Computers”} \in \text{Categories}(v)\}].
  3. Optionally restrict to the largest connected component for further experiments (Liu et al., 3 Jun 2025).

Node and edge counts for the filtered subgraph are dataset-dependent; typical extractions yield 10410^42×1042 \times 10^4 nodes and 8×1048 \times 10^41.5×1051.5 \times 10^5 edges, but exact statistics require a fresh local pass through the metadata. The undirected variant (as in (Liu et al., 3 Jun 2025)) treats edges as bidirectional if either product lists the other in its “Similar” field.

2. Structural Representation and Statistical Profile

The subgraph is represented by an adjacency matrix A{0,1}n×nA \in \{0,1\}^{n \times n}, where

Aij={1if ij 0otherwiseA_{ij} = \begin{cases} 1 & \text{if } i \to j\ 0 & \text{otherwise} \end{cases}

(for the directed network), or as G=(V,E)G = (V,E) for the undirected case. Node features include:

  • PCA-reduced textual embeddings of product titles
  • One-hot group/category vectors
  • Degree and clustering coefficient as scalar features.

Basic descriptive statistics computed on GcG_c include:

  • n=Vcomputersn = |V_{computers}| (number of nodes)
  • m=Ecomputersm = |E_{computers}| (number of edges)
  • Density: $2m / (n(n-1))$
  • Degree distribution, often exhibiting power law decay P(degk)CkαP(\deg \geq k) \approx C k^{-\alpha} with α3.5\alpha \approx 3.5 (Liu et al., 3 Jun 2025).

Typical extraction pipeline leverages libraries such as NetworkX and pandas for preprocessing and filtering.

3. Motif Analysis and Subgraph Patterns

Srivastava (Srivastava, 2010) introduced systematic motif enumeration on the full Amazon co-purchase network, counting occurrences of all 3-node and 4-node directed subgraphs (motifs). Motif counts MkM_k for a motif type kk are

Mk=(i,j,)V1subgraph(i,j,) matches motif kM_k = \sum_{(i,j,\dots) \in V} \mathbf{1}_{\text{subgraph}(i,j,\dots) \text{ matches motif } k}

Motif identification in the Computers subgraph is directly analogous, using standard enumeration tools such as FANMOD.

Patterns of interest:

  • Converging motifs: several nodes pointing to a common node, interpretable as multiple peripherals linked to a CPU.
  • Reciprocated pairs with “spoke” nodes: e.g., two products that co-purchase each other and are both linked to another.
  • Fully connected triangles: indicate tightly associated product bundles.

No explicit motif frequencies for the Computers subgraph are reported in primary sources; these must be computed by the researcher. Relative frequencies and motif distributions support inference on purchase associations and bundle structure.

4. Community Detection and Clustering

Community structure in the Computers subgraph is probed using:

  • Girvan–Newman edge-betweenness removal (O(m2n)O(m^2 n) naïve complexity)
  • Clauset–Newman–Moore greedy modularity maximization (O(nlog2n)O(n \log^2 n))

The process involves:

  1. Running the detection algorithm on GcG_c
  2. Recording optimal modularity QQ, number of communities CC, and mean community size Vc/C|V_c|/C
  3. Interpreting resulting communities by dominant ASINs (e.g., “Laptops,” “Keyboards,” “Graphics Cards”) (Srivastava, 2010)

No published community statistics exist for the Computers subgraph; a plausible implication is that modularity Q0.4Q \approx 0.4–$0.6$ and C=30C = 30–$80$ are typical for graphs of this size, but recomputation is necessary.

5. Recommender System Evaluation Using GraphSAGE

Recent work (Liu et al., 3 Jun 2025) applies a modified GraphSAGE framework for inductive link prediction on the Computers co-purchase subgraph, targeting the recommendation of new products.

  • Feature Construction: Input vectors xux_u concatenate text embedding R32\in \mathbb{R}^{32}, categorical group representation, category vector, degree, and clustering coefficient.
  • Layer-wise Embedding Propagation: For kk layers,
    • Sample SS 1-hop neighbors per node
    • Aggregate neighbor features (mean, GCN, or pooling)
    • Update hu(k)=σ(W(k)[hu(k1)hN(u)(k)])h_u^{(k)} = \sigma( W^{(k)} [h_u^{(k-1)} \| h_{N(u)}^{(k)}])
  • Link Scoring and Loss: Candidate links (u,v)(u,v) are scored by s(u,v)=σ(wsT[huhv])s(u,v) = \sigma(\mathbf{w}_s^\mathsf{T}[h_u \| h_v]); binary cross-entropy is minimized.

Positive examples are observed edges, with negatives constructed by pairing the same uu with random vv’ not in EE. A stratified splitting into train/validation/test is executed with an 80/10/10 ratio (Liu et al., 3 Jun 2025).

  • Online Inductive Inference: For new products, xnewx_{new} is computed from available metadata. “Proxy neighbor” nodes are sampled for propagation in the absence of direct edges, allowing inference on cold-start items.

Empirical results indicate that the modified GraphSAGE approach achieves superior ROC-AUC and Precision@K versus a random forest baseline; e.g., AUC(GraphSAGE) =0.957=0.957 versus AUC(RF) =0.938=0.938, and P@20\text{P}@20(GraphSAGE) =0.212=0.212 versus P@20\text{P}@20(RF) =0.174=0.174.

6. Data Availability and Reproducibility

The SNAP Amazon meta-data—both the raw edge and metadata files—are available at https://snap.stanford.edu/data/amazon-meta.html. Standard academic-use (CC BY) terms apply (Srivastava, 2010).

A full reproducibility pipeline is provided by (Liu et al., 3 Jun 2025) and published at [https://github.com/cse416a-fl24/final-project-l-minghao_z-catherine_z-nathan.git], including scripts for subgraph extraction, feature computation, and training of the link prediction model.

Characteristic pipeline steps include:

  • Metadata parsing and Computers-filtering via pandas
  • Subgraph induction and largest connected component extraction via NetworkX
  • Embedding and prediction using PyTorch Geometric (SAGEConv)
  • Experimentation scripts for model comparison, hyperparameter tuning, and metric computation (ROC-AUC, Precision@K).

7. Significance and Applications in Network Science

The Amazon Computers Co-Purchase Data exemplifies a style of empirical network analysis used to study both the structure of consumer-product relationships and the performance of modern graph-based recommender systems:

  • For motif studies (Srivastava, 2010), the induced Computers subgraph enables targeted analysis of functional modules (bundling, peripheral–core relations) within a focused product vertical.
  • For machine learning methods (Liu et al., 3 Jun 2025), the abundance of rich node features and evolving graph structure provides a realistic setting for inductive link prediction and evaluation of online learning algorithms.
  • For community detection, the subgraph offers a benchmark for algorithmic approaches to product taxonomy recovery and unsupervised categorical clustering.

A plausible implication is that the results obtained on this subgraph may generalize to other high-density, feature-rich retail product categories and inform the design of deployable, adaptive recommender systems for new item cold-start settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Amazon Computers Co-Purchase Data.