Complementary Item Retrieval

Updated 17 November 2025

Complementary item retrieval is a process that identifies and ranks items which functionally, stylistically, or contextually complete a query item, emphasizing asymmetric compatibility over mere similarity.
Approaches include dual embedding models, graph-based techniques, and generative methods that leverage multimodal data and metrics like Hit Rate@K and Recall@K to drive performance improvements.
Practical implementations address challenges such as cold-start, data noise, and explainability, integrating neural architectures, quadruplet networks, and LLM-enhanced rerankers for robust recommendations.

Complementary item retrieval refers to the task of identifying and ranking items that can functionally, stylistically, or contextually “complete” an initial item or set—such that the recommended items are compatible with, but not substitutes for, the query. This problem is central in e-commerce for applications such as “shop the look,” outfit completion, and “frequently bought together” recommendations. Current approaches leverage transaction graphs, behavioral signals, multimodal content, specialized neural architectures, and generative models, with increasing attention to modeling subjectivity, data noise, and explainable relationships. The sections below organize the field by core methodological and theoretical advances.

1. Problem Formulation and Taxonomy of Complementarity

Complementary item retrieval is defined as retrieving items $v_j$ that maximize an implicit or explicit compatibility score with a query item $v_i$ , where compatibility is asymmetric and distinct from similarity. Two items are similar if they are mutual substitutes (e.g., “black sofa” vs. “brown sofa”), while they are complementary if they are distinct but functionally or contextually compatible (e.g., “mattress” and “mattress pad”) (Kvernadze et al., 2022). In practical implementations, this distinction is realized by deploying different metrics: intra-space metrics for similarity, and cross-space (dual-embedding) metrics or compatibility classifiers for complementarity.

Multiple operationalizations exist:

Bipartite or item-item graphs based on co-purchase, co-view, or co-cart event logs (Anghinoni et al., 10 Jun 2025, Luo et al., 2024).
Multimodal embeddings that fuse item content (image, text, structured metadata) for cold-start scenarios (Wang et al., 29 Jul 2025, Bibas et al., 2023).
Scene-aware or set-to-item completion, where a partial collection needs to be “completed” with a compatible complementary item (Lin et al., 2019, Wang et al., 2023).
Conditional generative modeling, where one seeks to sample or retrieve items from an estimated $P(\mathrm{item}_2 \mid \mathrm{item}_1)$ without collapsing to mere similarity (Huynh et al., 2018, Attimonelli et al., 2024, Kumar et al., 2019).

The evaluation of complementary item retrieval usually relies on metrics such as Hit Rate@K, Recall@K, NDCG@K, Mean Reciprocal Rank, diversity indices (when diversity is a goal), as well as downstream measures (e.g., A/B test uplifts in purchase rate) (Kvernadze et al., 2022, Anghinoni et al., 10 Jun 2025, Xu et al., 22 Jul 2025).

2. Embedding, Graph, and Dual-Space Methods

A principal axis of methodological development contrasts single-embedding (similarity-oriented), dual-space, and graph-based models.

Dual Embedding (SGNS) Models: Embedding approaches extending word2vec/SGNS concepts to items have established that learning both input and output embeddings captures complementarity via cross-space dot-products, i.e., $u_i^\top v_j$ for $i$ ’s input embedding and $j$ ’s output embedding. This decouples symmetry (for substitutes) from asymmetry (for complements) and allows directionality in retrieval. SGNS losses with co-purchase positive pairs and hollowed-out negative sampling can be directly tuned for complementarity by including synthetic positives (constructed from click, text, or image similarity) to combat cold-start (Kvernadze et al., 2022). Large-scale implementations employ HNSW or IVFPQ indexes for real-time response.
Quadruplet Networks: Quadruplet architectures explicitly distinguish similar, complementary, and negative item pairs, enforcing ordering in latent space: similar items are closest, complementary are close but outside the similarity margin, and negatives are furthest (Mane et al., 2019). This resolves the degeneracy in which purely dyadic or triplet objectives cannot separate similarity and functional complementarity.
Graph-Based Retrieval: Graph projection methods generate directed item-item graphs from user-item bipartite graphs via random-walk or temporal weighting, with outgoing edge weights encoding both statistical co-occurrence and temporal ordering to enforce complementarity, not just similarity (Anghinoni et al., 10 Jun 2025). Such nonparametric approaches consistently outperform sequence-based and generic GNN-based recommenders (average +43% recall gain over sequential baselines). Spectral-based GNNs further distinguish complementary from similar relationships by explicitly separating low-frequency (relevance) and mid-frequency (dissimilarity) bands in spectral graph convolution, with additional attention mechanisms for adaptive fusion (Luo et al., 2024).

Approach	Complementarity Modeling	Scalability
SGNS dual space	Asymmetric (cross-embedding)	ANN index, O(log
Quadruplet	Similar vs. complement vs. negative	O(embedding size)
Bipartite Projection	Explicit item-item/graph edge	Sparse matrix ops
Spectral GNN	Laplacian frequency, GNN	Mini-batch training

3. Multimodal and Set-to-Item Retrieval

Complementary retrieval must operate in situations where behavioral data are sparse or unavailable (cold start). Multimodal models address this by fusing visual (CNN backbone), textual (BERT or taxonomy), and structured side information.

Multi-Modal Hierarchical Aggregation: MMSC fuses frozen foundational image-text encodings (e.g., BLIP-2 or CLIP) with behavior-derived embeddings denoised via meta-path contrastive learning, integrating the two by gating at both semantic and task levels (Wang et al., 29 Jul 2025). LLMs further refine noisy behavior-induced relationships. This approach yields +39% improvement over best baselines in complementary retrieval, with substantial gains maintained under cold-start when items have no behavioral edges.
Set-to-Item Subspace Attention: Methods for outfit completion and scene-based circumscribe the retrieval space by set-to-item compatibility, using subspace attention networks to compute attuned embeddings for each source-target category pair, weighting style- or function-relevant subspaces, and employing outfit-level margin ranking losses (Lin et al., 2019). For scene-aware retrieval, a Flexible Bidirectional Transformer generates complementary visual embeddings auto-regressively from a random masked scene object set, with both similarity and complementarity terms controlled by explicitly learned parameters (Wang et al., 2023).
Sequence-to-Sequence Style Encoding: Sequence-to-sequence frameworks with attention leverage large corpora of “outfit of the day” online posts annotated for color/pattern/type triplets, learning to map a sequence of input items to a sequence describing the complement, enabling rich style-aware recommendation that outperforms pattern-mining baselines by ~28% in Top-1 accuracy and ~24% in reciprocal rank (Dalmia et al., 2018).

4. Generative and Adversarial Approaches

Generative techniques capture the conditional distribution of plausible complementary items, crucial for out-of-distribution fashion and novel design tasks:

Conditional GANs (cGANs): A cGAN is trained to model $P_{\text{data}}(\mathrm{bottom}|\mathrm{top})$ , generating visually realistic candidate items for a given item query (Kumar et al., 2019). Novel regularization schemes such as pixel-wise MSE, DCT-based perceptual losses, and randomized label-flip supplement standard adversarial objectives, improving robustness to unaligned web data. GAN-generated feature vectors can be matched to catalog items by nearest-neighbor search in feature space (e.g., Inception-v4+PCA), delivering both relevant and diverse recommendations verified in human expert studies.
Feature-Space Transformers: CRAFT trains a conditional feature transformer in a (frozen) feature embedding space, learning to sample diverse, plausible embeddings of complementary items via an adversarial discriminator (Huynh et al., 2018). This method generalizes beyond memorized nearest neighbors and especially improves retrieval quality and diversity for long-tail (rare) queries.
Paired Image-to-Image Translation for Composed Retrieval: GeCo adopts a two-stage paired translation pipeline: a cGAN synthesizes a “template” image compatible with the query, and a downstream retrieval model embeds the query/template pair for nearest-neighbor search, showing superior ranking and AUC compared to item-only or regularized generative baselines (Attimonelli et al., 2024). Injecting noise at the generator mid-layer augments template diversity and retrieval performance, critical in low-data regimes.

5. Label Quality, Denoising, and LLMs

Noisy or subjective labels pose challenges for reliable complementary item retrieval. Several frameworks have recently advanced label cleaning, expansion, and explanation:

Gaussian Embedding and Independence Testing: NEAT proposes representing each item as a Gaussian distribution over the latent embedding, decomposing co-purchases into mean (complementary signal) and covariance (label noise). Retrieval is performed by scoring mean vector cosine similarity, while evaluation uses a $\chi^2$ independence test on transaction data to filter truly dependent (complementary) pairs, yielding consistent 20–25% improvements over Item2Vec and BPR baselines (Ma et al., 2022).
Knowledge-Augmented Relation Learning (KARL): KARL fuses active learning batch selection with LLM-based function-label annotation. The classifier ensemble selects query pairs by uncertainty (either margin or Query-by-Committee), LLMs (GPT-4o-mini) label challenging pairs by expert-curated function categories, and the classifiers are iteratively retrained (Yamasaki et al., 6 Sep 2025). Large performance gains are observed especially in out-of-distribution (OOD) settings (+37% macro-F1), where LLM-driven diversity brings new knowledge into the training pool; in-distribution gains saturate quickly with over-exploration leading to decline. This suggests adopting dynamic diversity strategies: aggressive in OOD, conservative in ID regimes.
Scenario-Based Generation: Item usage scenarios generated by LLMs provide a textual grounding for why items are complementary and can be used for future pipeline construction and explainable recommendations. A high (83%) fraction of scenarios generated for 300 leaf categories passed human plausibility checks (Hagiri et al., 9 Oct 2025).
LLM-Enhanced Reranking Pipelines: Two-stage LLM rerankers applied to otherwise off-the-shelf graph neural network recommenders can substantially increase both accuracy (Hit@1 up to +200%) and diversity (+5% vocabulary in top 1) in candidate lists (Xu et al., 22 Jul 2025). This configuration uses one agent for diversity-aware selection and a second for accuracy-conscious reranking, with domain-optimized prompt designs. LLMs particularly help in surfacing long-tail, semantically complementary items not evident in the observed graph.

6. Practical Considerations: Cold-Start, Diversity, and Evaluation

Robustness to item cold-start, model scalability, and objective diversity are active concerns:

Cold-Start Handling: Multi-modal and category-aware methods (MMSC, ALCIR) utilize fixed or relationally fine-tuned visual/text encoders for items unseen in behavioral graphs, mean-pooling pretrained representations or translating embeddings into the target category’s subspace (Wang et al., 29 Jul 2025, Bibas et al., 2023). Cold items are matched via content similarity and then embedded for retrieval, with observed 24–41% gains under ablation.
Diversity: True user and expert preference for “interesting variety” is addressed by noise-injected generative models (CRAFT, c⁺GAN), attention to mid-frequency spectral components (SComGNN), and explicit selection/diversity-biased rerankers (LLMs, (Xu et al., 22 Jul 2025)).
Human-in-the-Loop and Label Curation: Both the NEAT independence test and LLM-based scenario or label generation reduce the impact of spurious behavioral signals—commonly encountered in loosely connected, high-volume transaction data—and produce more trustworthy evaluation/candidate sets (Ma et al., 2022, Hagiri et al., 9 Oct 2025, Yamasaki et al., 6 Sep 2025).
Scalability: Embedding and graph-projection approaches emphasize sparse matrix computations, precomputed nearest-neighbor indexes, and decoupling feature construction from retrieval, thus supporting million-scale catalogs and real-time serving (Kvernadze et al., 2022, Anghinoni et al., 10 Jun 2025).

Challenge	Effective Approaches	Gains Reported
Cold-start	Multimodal, category/subspace translation	+24–41% M@10 (Wang et al., 29 Jul 2025)
Diversity	cGAN/CRAFT, spectral mid-frequency, LLM rerankers	+3–5% NDCG/diversity (Xu et al., 22 Jul 2025)
Label reliability	NEAT eval, LLM/KARL labels, scenario generation	+20–37% F1 (Ma et al., 2022, Yamasaki et al., 6 Sep 2025)
Scalability	ANN indexes, sparse graphs, decoupled stages	Real-time serving

7. Current Directions and Limitations

Recent work has highlighted several persistent challenges and directions:

Subjectivity and Contextuality: Complementarity is inherently subjective; user segments may disagree on what “goes well with” a given item, and explainable scenarios or function-based label hierarchies are needed (Hagiri et al., 9 Oct 2025, Yamasaki et al., 6 Sep 2025).
Noisy and Biased Behavior Data: Reliance on co-purchase or graph-based signals can propagate spurious relations (seasonal, trend-driven, or simply popular), which motivates the integration of generative, self-supervised, and knowledge-augmented relabeling approaches.
Integration of LLMs: LLMs have demonstrated substantial value in refining candidate pools, denoising label sets, and providing textual explanations, but require cost-aware design due to inference step latency and reliance on domain-specific prompts. Augmenting graph and embedding approaches rather than replacing them remains standard (Xu et al., 22 Jul 2025, Yamasaki et al., 6 Sep 2025).
Generalization Across Domains: Cold-start and OOD settings reveal model weaknesses when label or training data are too narrowly distributed, suggesting a need for dynamic diversity management in label selection and active learning rounds (Yamasaki et al., 6 Sep 2025).
Evaluation and Benchmarking: Direct comparison across studies is complicated by differences in dataset size, noise, preprocessing, label confidence, and candidate pool selection (random negatives, filtered hard negatives, semi-synthetic scenarios). Trustworthy evaluation protocols (e.g., NEAT’s independence tests or LLM-augmented positive selection) address this gap.

The landscape of complementary item retrieval therefore combines advances in semantic embedding, generative modeling, spectral graph analysis, active and self-supervised learning, and LLM-augmented denoising. Methodological selection depends on catalog scale, data regime (behavior-rich vs. cold start), explainability needs, and the operational trade-offs between accuracy, diversity, and subjectivity.