Amazon Computers Co-Purchase Data

Updated 22 December 2025

Amazon Computers Co-Purchase Data is a focused subgraph from the Amazon Product Co-Purchasing Network that captures co-purchase patterns among computer products.
It enables analysis of e-commerce network structure through motif enumeration, community detection, and exploration of product bundling patterns.
The dataset supports developing inductive recommender systems using graph neural networks with rich node features and contextual metadata.

The Amazon Computers Co-Purchase Data refers to a subgraph extracted from the broader Amazon Product Co-Purchasing Network, focusing specifically on products categorized as “Computers.” This subgraph is derived from the SNAP Amazon meta-data datasets, which capture product-level co-purchasing relationships inferred from Amazon's “Customers Who Bought This Item Also Bought” feature. The Computers subgraph, once constructed, serves as a test bed for analyzing structural properties of e-commerce networks, investigating motif distributions, executing community detection, and developing inductive recommender systems using graph neural networks.

1. Source Data and Extraction of the Computers Subgraph

The Amazon Product Co-Purchasing Network, available from the Stanford Network Analysis Project (SNAP), consists of either undirected or directed edges between product nodes, with edges indicating that two products are frequently purchased together. The key datasets include:

Edge List: Plain text list of product pairs. In (Srivastava, 2010), this is a directed edge $(i, j)$ where node $j$ appears in the “also bought” list for node $i$ .
Metadata File: Contains for each product: integer node ID, ASIN, Title, Group (e.g., Electronics), SalesRank, Similar Products (list of ASINs), Categories (category tree), Reviews.

To generate the Computers subgraph:

Parse metadata, filter nodes where either the “Group” field is “Electronics” or the “Categories” list contains “Computers.”
Induce the subgraph $G_c$ on these product IDs: $G_c = G[\{v \in V : \text{“Computers”} \in \text{Categories}(v)\}]$ .
Optionally restrict to the largest connected component for further experiments (Liu et al., 3 Jun 2025).

Node and edge counts for the filtered subgraph are dataset-dependent; typical extractions yield $10^4$ – $2 \times 10^4$ nodes and $8 \times 10^4$ – $1.5 \times 10^5$ edges, but exact statistics require a fresh local pass through the metadata. The undirected variant (as in (Liu et al., 3 Jun 2025)) treats edges as bidirectional if either product lists the other in its “Similar” field.

2. Structural Representation and Statistical Profile

The subgraph is represented by an adjacency matrix $A \in \{0,1\}^{n \times n}$ , where

$A_{ij} = \begin{cases} 1 & \text{if } i \to j\ 0 & \text{otherwise} \end{cases}$

(for the directed network), or as $G = (V,E)$ for the undirected case. Node features include:

PCA-reduced textual embeddings of product titles
One-hot group/category vectors
Degree and clustering coefficient as scalar features.

Basic descriptive statistics computed on $G_c$ include:

$n = |V_{computers}|$ (number of nodes)
$m = |E_{computers}|$ (number of edges)
Density: $2m / (n(n-1))$
Degree distribution, often exhibiting power law decay $P(\deg \geq k) \approx C k^{-\alpha}$ with $\alpha \approx 3.5$ (Liu et al., 3 Jun 2025).

Typical extraction pipeline leverages libraries such as NetworkX and pandas for preprocessing and filtering.

3. Motif Analysis and Subgraph Patterns

Srivastava (Srivastava, 2010) introduced systematic motif enumeration on the full Amazon co-purchase network, counting occurrences of all 3-node and 4-node directed subgraphs (motifs). Motif counts $M_k$ for a motif type $k$ are

$M_k = \sum_{(i,j,\dots) \in V} \mathbf{1}_{\text{subgraph}(i,j,\dots) \text{ matches motif } k}$

Motif identification in the Computers subgraph is directly analogous, using standard enumeration tools such as FANMOD.

Patterns of interest:

Converging motifs: several nodes pointing to a common node, interpretable as multiple peripherals linked to a CPU.
Reciprocated pairs with “spoke” nodes: e.g., two products that co-purchase each other and are both linked to another.
Fully connected triangles: indicate tightly associated product bundles.

No explicit motif frequencies for the Computers subgraph are reported in primary sources; these must be computed by the researcher. Relative frequencies and motif distributions support inference on purchase associations and bundle structure.

4. Community Detection and Clustering

Community structure in the Computers subgraph is probed using:

Girvan–Newman edge-betweenness removal ( $O(m^2 n)$ naïve complexity)
Clauset–Newman–Moore greedy modularity maximization ( $O(n \log^2 n)$ )

The process involves:

Running the detection algorithm on $G_c$
Recording optimal modularity $Q$ , number of communities $C$ , and mean community size $|V_c|/C$
Interpreting resulting communities by dominant ASINs (e.g., “Laptops,” “Keyboards,” “Graphics Cards”) (Srivastava, 2010)

No published community statistics exist for the Computers subgraph; a plausible implication is that modularity $Q \approx 0.4$ –$0.6$ and $C = 30$ –$80$ are typical for graphs of this size, but recomputation is necessary.

5. Recommender System Evaluation Using GraphSAGE

Recent work (Liu et al., 3 Jun 2025) applies a modified GraphSAGE framework for inductive link prediction on the Computers co-purchase subgraph, targeting the recommendation of new products.

Feature Construction: Input vectors $x_u$ concatenate text embedding $\in \mathbb{R}^{32}$ , categorical group representation, category vector, degree, and clustering coefficient.
Layer-wise Embedding Propagation: For $k$ $k$ layers,
- Sample $S$ 1-hop neighbors per node
- Aggregate neighbor features (mean, GCN, or pooling)
- Update $h_u^{(k)} = \sigma( W^{(k)} [h_u^{(k-1)} \| h_{N(u)}^{(k)}])$
Link Scoring and Loss: Candidate links $(u,v)$ are scored by $s(u,v) = \sigma(\mathbf{w}_s^\mathsf{T}[h_u \| h_v])$ ; binary cross-entropy is minimized.

Positive examples are observed edges, with negatives constructed by pairing the same $u$ with random $v’$ not in $E$ . A stratified splitting into train/validation/test is executed with an 80/10/10 ratio (Liu et al., 3 Jun 2025).

Online Inductive Inference: For new products, $x_{new}$ is computed from available metadata. “Proxy neighbor” nodes are sampled for propagation in the absence of direct edges, allowing inference on cold-start items.

Empirical results indicate that the modified GraphSAGE approach achieves superior ROC-AUC and Precision@K versus a random forest baseline; e.g., AUC(GraphSAGE) $=0.957$ versus AUC(RF) $=0.938$ , and $\text{P}@20$ (GraphSAGE) $=0.212$ versus $\text{P}@20$ (RF) $=0.174$ .

6. Data Availability and Reproducibility

The SNAP Amazon meta-data—both the raw edge and metadata files—are available at https://snap.stanford.edu/data/amazon-meta.html. Standard academic-use (CC BY) terms apply (Srivastava, 2010).

A full reproducibility pipeline is provided by (Liu et al., 3 Jun 2025) and published at [https://github.com/cse416a-fl24/final-project-l-minghao_z-catherine_z-nathan.git], including scripts for subgraph extraction, feature computation, and training of the link prediction model.

Characteristic pipeline steps include:

Metadata parsing and Computers-filtering via pandas
Subgraph induction and largest connected component extraction via NetworkX
Embedding and prediction using PyTorch Geometric (SAGEConv)
Experimentation scripts for model comparison, hyperparameter tuning, and metric computation (ROC-AUC, Precision@K).

7. Significance and Applications in Network Science

The Amazon Computers Co-Purchase Data exemplifies a style of empirical network analysis used to study both the structure of consumer-product relationships and the performance of modern graph-based recommender systems:

For motif studies (Srivastava, 2010), the induced Computers subgraph enables targeted analysis of functional modules (bundling, peripheral–core relations) within a focused product vertical.
For machine learning methods (Liu et al., 3 Jun 2025), the abundance of rich node features and evolving graph structure provides a realistic setting for inductive link prediction and evaluation of online learning algorithms.
For community detection, the subgraph offers a benchmark for algorithmic approaches to product taxonomy recovery and unsupervised categorical clustering.

A plausible implication is that the results obtained on this subgraph may generalize to other high-density, feature-rich retail product categories and inform the design of deployable, adaptive recommender systems for new item cold-start settings.