MISS: Multi-modal Indexing & Lifelong Search

Updated 21 August 2025

The paper presents a novel multi-modal index tree that uses recursive k-means clustering to hierarchically encode item embeddings for efficient candidate retrieval.
It integrates lifelong sequential behavior modeling with attention mechanisms through Co-GSU and MM-GSU to capture diverse user interests.
Performance evaluations demonstrate up to 59.38% hierarchical recall gains and significant improvements in large-scale industrial recommendation systems.

Multi-modal Indexing and Searching with Lifelong Sequence (MISS) refers to a unified framework that integrates multi-modal information (e.g., textual, visual, behavioral signals) and lifelong user behavioral sequences into large-scale retrieval recommendation systems. MISS is engineered to address the dual challenges of exploiting extensive user history and leveraging diverse content modalities for precise and efficient information retrieval, particularly in industrial-scale environments such as video and product recommendation platforms (Guo et al., 20 Aug 2025).

The core indexing mechanism in MISS is the multi-modal index tree—a hierarchical binary tree constructed over item multi-modal embeddings. The construction process is as follows:

At each node, k-means clustering is recursively applied to the set of items represented by their multi-modal embeddings (pre-trained to maximize semantic alignment via objectives such as InfoNCE).
Each split divides items into two clusters, forming left/right subtrees, until leaf nodes correspond to single items.
Each non-leaf node stores a mean-pooled embedding of all leaf items in its subtree:

$z_i = \frac{1}{|L(i)|} \sum_{j \in L(i)} c_{\pi(j)}$

where $L(i)$ are the leaf nodes under node $i$ and $c_{\pi(j)}$ is the multi-modal embedding of item $j$ .
Leaf node embeddings are set directly as their item embeddings.

This index tree reflects a hierarchy of item similarity that is sensitive to both content and behavioral interaction cues.

During candidate retrieval, beam search traverses the tree, rapidly pruning irrelevant branches based on the multi-modal similarity between user query (or interests) and node embeddings. The tree is trained with a binary cross-entropy objective at each node, using pseudo labels to align the tree with retrieval objectives:

$\mathrm{BCE}(\hat{y}_n, \hat{y}_n^t) = -\hat{y}_n \log \hat{y}_n^t - (1 - \hat{y}_n) \log (1 - \hat{y}_n^t)$

with $\hat{y}_n = I\left(\sum_{i \in L(n)} y_{\pi(i)} \geq 1\right)$ representing the presence of relevant items under node $n$ .

2. Lifelong Sequential Behavior Modeling

MISS directly models user lifelong behavioral sequences to extract diverse and fine-grained user interests, even across extremely long histories (e.g., thousands of behaviors). The framework introduces two principal search units:

Collaborative General Search Unit (Co-GSU): Operates over one-hot ID embeddings of candidate nodes and user behaviors. It applies target attention via projection matrices, extracting the Top-K most relevant historical behaviors for the candidate:

$r_i = (W_q e_n)^\top (W_k e_i) / \sqrt{d}$

where $e_n$ , $e_i$ are the projected embeddings, $W_q$ , $W_k$ are learnable matrices, and $d$ is the embedding dimension.
Multi-modal General Search Unit (MM-GSU): Utilizes fixed pre-trained multi-modal embeddings directly. Relevance between candidate and historical behaviors is measured as:

$r_i^{mm} = z_n^\top z_i$

producing a multi-modal sub-behavior sequence for further attention-based aggregation.

An Exact Search Unit (ESU) refines the candidate set by applying softmax-based attention over the selected behaviors and aggregating their feature representations:

$a_i = \mathrm{softmax}(r_i)$

$x^{co/mm} = \sum_{i} a_i W_v e_i$

where $W_v$ is an additional projection layer.

MISS integrates multi-modal content at two levels:

Index tree formation: Multi-modal embeddings are engineered to capture both semantic content (e.g., visual, textual features) and historical interaction signals. These embeddings serve as the basis for item clustering and hierarchical partitioning in the index tree, ensuring that similar items (under both content and behavioral metrics) are close in tree structure.
Behavioral searching: MM-GSU leverages multi-modal embeddings for direct similarity computation between candidate items and historical behaviors, allowing the system to extract semantically consistent interests beyond what ID-based methods provide.

This dual integration produces candidate sets that are both relevant and diverse, particularly in environments with noisy or highly variable user histories.

4. System Design and Algorithms

The MISS system applies the following technical principles:

Recursive k-means index tree construction over multi-modal embeddings.
Hierarchical embedding propagation: Node embeddings are computed by pooling child nodes to maintain the semantic hierarchy.
Beam search for candidate retrieval: Efficient tree traversal based on computed multi-modal similarity scores, supporting scalable sub-linear search complexity.
Attention-based sequence modeling: Co-GSU and MM-GSU units extract multi-perspective user interests from lifelong sequences.

This design allows MISS to function with extremely large candidate pools and behavior logs, addressing longstanding limitations in traditional retrieval systems that struggle to incorporate both multi-modal content and extensive behavioral history.

5. Performance Evaluation

Offline evaluation of MISS demonstrates marked improvements over strong baselines (e.g., SASRec, TDM, NANN, Kuaiformer):

Recall@800: Using a 4k behavioral sequence, MISS achieved a 37.93% improvement over the second-best approach.
Hierarchical recall: At intermediate tree levels (13, 16, 19), the recall gains reached up to 59.38%, reflecting superior candidate filtering and recall quality.
Online deployment: In Kuaishou’s production system (400 million DAU), MISS yielded measurable gains in Total App Usage Time and Per User Usage Time, confirming real-world impact.

These results indicate that the simultaneous exploitation of multi-modal signals and lifelong user data is significantly beneficial for both offline accuracy and online user engagement.

6. Industrial Applications and Implications

MISS is architected for real-world deployment in large-scale recommendation systems where:

Millions of items and hundreds of millions of users generate extensive and noisy multi-modal behavioral data;
Tree-based retrieval with multi-modal embeddings supports rapid candidate narrowing despite massive pools;
Multi-perspective behavior searching (via Co-GSU and MM-GSU) ensures comprehensive modeling of user interests;
Scalability is achieved through efficient data structures and modular multi-task learning blocks such as Multi-Gate Mixture-of-Experts.

The framework’s modularity also allows for incremental improvements and continual learning, consistent with the demands of lifelong indexing and retrieval systems.

7. Connections to Broader Research and Future Directions

The design principles underlying MISS—multi-modal index tree construction, dual-path lifelong behavior modeling, and attention-driven candidate refinement—reflect current trends in efficient cross-modal retrieval (Markchit et al., 2019), unsupervised hashing (Hansen et al., 2021), lifelong continual learning for retrieval (Wang et al., 2021), and scalable multi-modal recommendation (Wang et al., 2023, Shen et al., 15 Jul 2024). Future work may address further optimization of tree balancing, integration with advanced product quantization or proximity graph-based indexing (Wang et al., 2023), and extension to open-world or weakly supervised retrieval settings (Solaiman et al., 25 Jun 2025). The convergence of efficient index structures and continual multi-modal sequence modeling represents a promising direction for high-performing, industrial-scale information retrieval and recommendation.

MISS thereby establishes an integrated benchmark for multi-modal indexing and lifelong sequential search at scale, with demonstrated advances in accuracy, diversity, and operational efficiency in live industrial deployments (Guo et al., 20 Aug 2025).