DSI++: Advances in Continual Neural Indexing
- DSI++ is a framework that extends the Differentiable Search Index by enabling continual learning for dynamic document corpora while ensuring memory plasticity and stability.
- It employs sharpness-aware minimization and generative replay to mitigate both implicit and explicit forgetting, thereby preserving retrieval accuracy during incremental updates.
- Empirical evaluations on benchmarks like NQ and MS MARCO show significant improvements, including a +21.1% increase in Hits@10 and a reduction in full retraining iterations.
DSI++ refers to continual advancements and extensions of the Differentiable Search Index (DSI) architecture, a paradigm in neural information retrieval where a Transformer model “memorizes” the mapping from document content and natural language queries directly to document identifiers (docids) within its parameters. The DSI++ framework, as described in (Mehta et al., 2022), formalizes the challenge of incrementally updating such models as the document corpus evolves, addressing catastrophic forgetting and proposing foundational solutions for ongoing, efficient adaptation.
1. Paradigm Shift: DSI++ in Information Retrieval
DSI++ generalizes DSI by moving from static corpus indexing toward continual learning with dynamic corpora. In classical DSI (Tay et al., 2022), all corpus content is encoded into model weights: queries are mapped to docids via generative decoding, eliminating the need for external retrieval indices. However, real-world corpora evolve—new documents must be indexed, and queries concerning old documents must remain answerable. The DSI++ problem is the incremental addition of new documents while retaining accurate retrieval across both old and newly indexed material.
DSI++ thus tackles two core requirements:
- Memory Plasticity: The ability to efficiently assimilate new documents and index them for queries without retraining from scratch.
- Memory Stability: Preventing forgetting events, which manifest as decreased accuracy for previously indexed documents and queries.
2. Catastrophic Forgetting in Continual Indexing
The DSI++ paper identifies two classes of forgetting in Transformer-based retrieval:
- Implicit Forgetting: Even during initial corpus memorization, instability in the optimization landscape causes documents to transition from correctly indexed to forgotten and sometimes re-learned. Analysis shows that approximately 88% of corpus items undergo at least one forgetting event during training.
- Explicit Forgetting: When new data batches are incrementally indexed, the model degrades in accuracy and retrieval for previously indexed subsets (with drops of ~25 points in indexing accuracy and ~20 points in retrieval Hits@1 for original documents after new corpora updates).
This phenomenon is particularly acute in large-scale deployments (e.g., Natural Questions, MS MARCO) where retraining the whole model is computationally prohibitive.
3. Mitigating Forgetting: Sharpness-Aware Minimization and Generative Replay
DSI++ advances two principal strategies:
a) Sharpness-Aware Minimization (SAM)
To ensure model parameters reside in flatter regions of the loss landscape, DSI++ leverages SAM:
Here, is the original loss, and bounds the perturbation. The adversarial update
leads to an update at , promoting flatter minima. Empirical results demonstrate a +12% absolute increase in documents with zero forgetting events compared to Adafactor.
b) Generative Experience Replay (Pseudo-Query Generation)
For continued retrieval efficacy, DSI++ employs a generative memory module. In realistic scenarios, ground-truth queries for every new document are unavailable. The generative memory synthesizes pseudo-queries for both old and new documents, which are replayed alongside indexing updates. This dual optimization acts as continual supervised/semi-supervised training, preserving previous associations and reducing retrieval metric degradation. Even sparse replay (ambiguity mixing ratio as high as ) is effective.
4. Experimental Validation and Metrics
Benchmarks constructed from NQ and MS MARCO datasets demonstrate:
- Baseline Fine-Tuning (new data only): Severe forgetting in old corpora.
- Cumulative Fine-Tuning (union of all data): Partial mitigation, but performance degrades after multiple corpus expansions.
- DSI++ (SAM + Generative Memory): Substantially improved average performance and retrieval consistency.
On NQ, DSI++ yields a +21.1% improvement in average Hits@10 over competitive methods and requires six times fewer model updates than full retraining for five sequential corpus expansions.
Key metrics for model evaluation include:
- : mean retrieval accuracy across all corpora after the th update.
- : total performance loss on old corpora.
- : forward transfer efficiency for new documents.
5. Docid Representation and Interference
DSI++ investigates unstructured atomic tokens and various structured string-based docid representations. Atomic docids reduce interference-induced forgetting in continual indexing scenarios, while semantic structuring provides potential resilience where identifier collisions and docid prefix sharing are expected.
A plausible implication is that further exploration of docid representations, mixture-of-experts memory architectures, and adaptive ratio balancing between indexing and retrieval samples could yield even greater robustness for evolving corpora.
6. Implications for Continual Learning and Information Retrieval
DSI++ situates itself at the intersection of continual learning and generative information retrieval, offering methodological innovations, notably SAM-driven memory stability and generative replay techniques. These findings hold broader significance for neural search systems, document classification, and any domain requiring incremental content integration with maintained inference accuracy.
Potential future research directions include:
- Filtering pseudo-queries for noise and adapting generation to out-of-domain queries.
- Combining parameter isolation approaches with generative replay for selective memory updates.
- Reducing environmental cost in large-scale retraining contexts.
7. Summary and Outlook
DSI++ provides a principled framework for the ongoing update and maintenance of generative retrieval models, addressing catastrophic forgetting and computational efficiency in dynamic environments. Integrating sharpness-aware optimization with generative replay sets a new standard for continual memory management in Transformer-based information retrieval. The demonstrated effectiveness across large benchmarks and the significant reduction in costly model updates suggest DSI++ is a foundational advance toward real-world, adaptive IR systems.
This synthesis integrates empirical findings and methodological advances from (Mehta et al., 2022), establishing the scope and significance of DSI++ for the research community.