Multi-Embedding Retrieval Framework

Updated 1 July 2025

Multi-Embedding Retrieval Framework is a retrieval architecture that uses multiple conditioned embeddings to represent diverse aspects of user interests.
It employs deep differentiable clustering and explicit topic association to generate distinct embeddings for improved long-tail coverage and balanced engagement.
The framework has proven its practicality in production by boosting recommendation recall and diversifying content for less-active and long-tail user segments.

A multi-embedding retrieval framework is a retrieval architecture that leverages multiple embeddings—distinct vector representations—per user or item, each conditioned upon independent aspects such as user interests, semantic clusters, or explicit labels, to more effectively capture and retrieve the diversity of user needs and content in large-scale recommender systems. At Pinterest, such a framework has been deployed in production to address limitations of traditional two-tower models, particularly concerning coverage of long-tail interests and balancing engagement versus feed diversity (Fan et al., 29 Jun 2025).

1. Framework Foundation and Architectural Modules

The Pinterest multi-embedding retrieval framework is designed around the principle of conditioned user representation learning. Rather than representing a user with a single embedding, the system constructs multiple user embeddings, each conditioned either on implicit interests (learned from behavioral data) or explicit interests (user-declared topics or follows). Retrieval is performed in parallel for all embeddings, and the candidate sets are merged before passing to subsequent personalization stages.

Implicit Interest Conditioned Model: Employs a Differentiable Clustering Module (DCM) to discover $K_{im}$ $K_{im}$ latent interest clusters from historical user engagements (e.g., saves, repins).
- DCM applies an end-to-end, attention-like clustering mechanism based on Capsule Networks, with architectural enhancements such as Validity-Aware Farthest Point Initialization and Single-Assignment Routing, ensuring each user’s diverse interests are represented with minimal redundancy.
Explicit Interest Conditioned Model: Implements Conditional Retrieval (CR), where explicit interests (e.g., followed topics) guide the construction of $K_{ex}$ dedicated user embeddings, each encoding affinity for a discrete, user-specified interest.
The system maintains separate embedding towers for user/item, a crossover network for feature interactions, and routes retrieval through both DCM and CR, providing diverse candidate sets for subsequent candidate ranking and blending.

2. Modeling Implicit and Explicit Interests

Implicit Interests

DCM clusters features derived from user-engaged items using dynamic routing. Inputs are processed:
- Each engagement (item) is mapped to a high-dimensional feature vector.
- Routing weights $b_{ij}$ are computed as $\mathrm{softmax}(c_j^\top S e_i)$ , with $c_j$ as cluster centroids, $e_i$ as item features, and $S$ a shared projection.
- The resulting clusters yield condition-specific user embeddings capturing granular, data-driven user preferences.

Explicit Interests

Each explicit user interest (e.g., a topic from Pinterest’s taxonomy) is mapped to an embedding via a learnable topic embedding table.
During training and inference, item embeddings are associated with their relevant explicit condition by logging which topic led to the engagement, ensuring precise assignment of training targets for each explicit embedding.

The framework thus unites implicit discovery (from inference over behavioral logs) and explicit signals (from user self-declared interests), creating a richer and more diverse user representation space.

3. Conditioned User Representation Learning and Association

The retrieval function is reframed as $f(i|u, c)$ : the probability of user $u$ engaging with item $i$ under condition $c$ (either an implicit cluster or explicit topic). This function is realized using a two-tower model with feature crossing, then specialized as follows:

For each candidate item, the affinity score is computed with each conditioned embedding.
- Implicit: For a user’s $K_{im}$ DCM interests, retrieve separately from each embedding.
- During training, the loss is applied only to the "best-matching" embedding for each engagement (winner-takes-all), determined by maximizing the affinity with the target item's embedding.
For explicit interests, retrieval and supervision are directly paired to the corresponding explicit topic embeddings.

The final candidate pool is formed by round-robin selection and deduplication across all embeddings, ensuring both coverage and diversity.

4. Synergy and Complementarity of Implicit and Explicit Interests

Implicit and explicit conditioned retrieval are complementary:

Implicit model (DCM): Captures fine-grained, recent, and active user interests through data-driven latent clusters; excels with active or core users and ongoing engagement.
Explicit model (CR): Recovers omitted, abandoned, or less active interests; excels for users who do not exhibit these preferences in recent engagement (e.g., new users, returning or infrequent users), and is robust to the sparsity or drift of behavioral signals.

The overlap in candidates between the two conditioned models is reported as 3.2%, demonstrating empirical complementarity. Joint deployment brings measurable improvements in both engagement and diversity, as evidenced by metric gains in both "repins" and adopted Pincepts (a fine-grained diversity metric on Pinterest).

5. Empirical Results and Practical Impact

Offline Results

DCM achieves the highest HR@100 and HR@1000 among implicit modeling strategies; CR with source interest association notably outperforms item-attribute-based explicit modeling [see Table 1 in the paper].
Case studies reveal that explicit-conditioned retrieval can recover interests missed by implicit clustering (e.g., an interest in "education" not included by DCM clusters).

Online A/B Testing

Adding the explicit CR module increased Home Feed repins by 0.98% overall and 3.04% for non-core users.
DCM surpassed self-attention, MIND, and token-based approaches for both overall and core user segments.
The combined multi-embedding framework achieved a +1.09% overall lift in Home Feed repins and +0.81% in diversity (A-Pincepts), with the greatest gains in less active and long-tail user segments.

Scalability and Deployment

The framework is in full production on the Pinterest home feed.
For each recommendation request, all conditioned embeddings are used in parallel to retrieve candidates from ANN-indexed item vectors, then merged with round-robin and deduplication.
Serving latency remains within acceptable production thresholds, with p90 request time shifting from 150ms to 205ms despite increased modeling complexity.

6. Technical Specification and Mathematical Details

Retrieval Probability: $f(i|u, c) \propto \exp(\phi(u, c)^\top \psi(i))$ , where $\phi(\cdot)$ and $\psi(\cdot)$ are user and item encoders.
Routing weights (DCM): $b_{ij} = \mathrm{softmax}(c_j^\top S e_i)$ ; centroids updated via $c_j \leftarrow \mathrm{squash}(\sum_i b_{ij} S e_i)$ , with $\mathrm{squash}(v) = \frac{||v||_2^2}{1+||v||_2^2} \frac{v}{||v||_2}$ .
Loss: Sampled softmax loss is applied per embedding, updating only the embedding giving $\arg\max_j (o_u^j)^\top o_{y_i}$ per engaged item.
Feature Crossing: Implemented via Deep Hierarchical Ensemble Network (DHEN), layering a Transformer, MLP, and mask net for flexible feature interactions.

Model Component	HF Repins (All)	HF Repins (Non-core)	Diversity (A-Pincepts, All)	Diversity (Non-core)
Baseline	0.00%	0.00%	0.00%	0.00%
CR (Explicit, +filter)	+0.98%	+3.04%	+0.32%	+1.03%
DCM (Implicit)	+0.86%	— (core: +1.23%)	+0.46%	— (core: +0.87%)
Combined Framework	+1.09%	—	+0.81%	—

7. Significance and Outlook

This multi-embedding retrieval framework demonstrates that modeling user interests with multiple, independently conditioned embeddings—derived both from latent behavioral clusters and explicit user signals—substantially enhances the recall, diversity, and personalization of large-scale recommender systems. Explicit and implicit conditionings capture distinct and complementary aspects of user preferences, with measured synergy benefiting various user cohorts. The architecture and techniques, including deep end-to-end differentiable clustering and supervised condition association, set a scalable blueprint for other retrieval-dominated recommendation platforms seeking to balance engagement with content diversity and long-tail coverage (Fan et al., 29 Jun 2025).

PDF Markdown Chat (Upgrade)

References (1)

1.

Synergizing Implicit and Explicit User Interests: A Multi-Embedding Retrieval Framework at Pinterest (2025)