Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Recommendation of Subhāṣitas

Updated 17 January 2026
  • Semantic recommendation of subhāṣitas is a method that leverages AI-driven sentence embeddings to match Sanskrit aphorisms with user queries based on meaning rather than keywords.
  • The approach employs methodologies such as IndicBERT, FAISS indexing, and Mixture-of-Codes to achieve high precision in retrieving contextually relevant verses.
  • It enhances cross-lingual accessibility and cultural preservation by integrating expert-annotated corpora, transliterations, and LLM-generated explanations.

Semantic recommendation of subhāṣitas refers to the application of semantic retrieval and recommendation techniques, leveraging sentence embeddings, vector search, and LLMs, to match Sanskrit subhāṣita verses (concise, philosophical aphorisms) to user queries not by keyword overlap but by underlying meaning, intent, or contextual relevance. This approach addresses linguistic, cultural, and accessibility barriers, enabling both scholarly study and practical use of the classical corpus through modern AI and information retrieval methodologies (Raorane et al., 10 Jan 2026, Zhang et al., 2024).

1. Dataset Construction and Annotation

Semantic recommendation systems for subhāṣitas require high-quality, thematically annotated corpora. In the Pragya framework, a dataset of 200 verses was curated from canonical classical Indian sources (e.g., Hitopadeśa, Pañcatantra) (Raorane et al., 10 Jan 2026). Each verse was:

  • Transcribed in Devanagari script.
  • Accompanied by a Marathi translation and a simplified English translation, enhancing cross-lingual accessibility.
  • Expert-annotated with mood or theme tags (e.g., motivation, friendship, compassion, courage), allowing theme-aware retrieval.
  • Stored in a structured CSV schema: Sanskrit verse, Marathi translation, English translation, theme tags.

This structured, annotated resource is foundational for any system seeking to retrieve not just textually but semantically similar or contextually relevant verses.

2. Embedding Computation and Semantic Indexing

Semantic representations were generated using the IndicBERT transformer, a multilingual model trained for Indian languages and adopted in "sentence embedding" mode (Raorane et al., 10 Jan 2026). The workflow involves:

  • Text normalization and tokenization of the Sanskrit verse or query.
  • Embedding extraction: The [CLS] token embedding (768 dimensions) serves as the fixed-length semantic representation.
  • Similarity measurement: Cosine similarity between query and verse embeddings quantifies semantic alignment:

cosine_sim(eq,ev)=eqeveqev\operatorname{cosine\_sim}(e_q, e_v) = \frac{e_q \cdot e_v}{\|e_q\| \cdot \|e_v\|}

  • Indexing: All corpus embeddings are ingested into a FAISS FlatL2 index for efficient nearest-neighbor retrieval.

An alternative approach under development leverages LLM-based embeddings, further quantized and structured via multi-codebook vector quantization à la Mixture-of-Codes (MoC) (Zhang et al., 2024). This enhances discriminability and embedding robustness for large-scale recommendations.

3. Retrieval and Recommendation Pipeline

The recommendation system architecture comprises distinct retrieval and generation modules (Raorane et al., 10 Jan 2026):

  • Retrieval: On query receipt, its embedding is compared (cosine metric) against the indexed corpus, retrieving the top-kk nearest verses (typically k=3k=3 in Pragya).
  • Post-processing: No hard similarity threshold; always return the kk best matches. Coverage is computed as the fraction of queries for which at least one result shares the intended theme.
  • Scalability (MoC): With Mixture-of-Codes, each verse's semantics are distributed across NN parallel codebooks (VQ-VAE), yielding a high-dimensional semantic ID while avoiding spectrum collapse and supporting fast approximate nearest-neighbor search (Zhang et al., 2024).

The table below summarizes retrieval performance reported in Pragya:

Metric Keyword Search Pragya (RAG)
Top-3 Precision 45 % 72 %
Coverage (≥1 relevant in top 3) 60 % 82 %
Latency per Query (seconds) 0.5 1.2
User Satisfaction (1–5 scale) 2.8 4.3

Semantic retrieval notably increases both precision and semantic coverage relative to traditional keyword search.

4. Generation and Explanation Module

A distinguishing feature is retrieval-augmented generation (RAG): after semantic retrieval, a local Mistral LLM (via Ollama) is invoked to enhance user accessibility (Raorane et al., 10 Jan 2026):

  • Inputs: The target Sanskrit verse, theme metadata, provided translations, and user query.
  • Outputs:
  1. Transliteration (IAST or Latin script).
  2. Fluent English translation.
  3. Contextualized explanation: A motivational, modern paraphrase linking the verse's theme to the user's context.

Comparison of generated explanations:

Metric Dictionary Translation Pragya (Mistral)
Clarity (1–5) 2.5 3.4
Cultural Appropriateness (1–5) 3.0 4.2
Engagement (1–5) 2.2 4.1
Overall Score (avg) 2.6 4.4

The LLM based generation module significantly improves clarity, cultural resonance, and engagement compared to dictionary-style outputs.

5. Advanced Embedding Techniques: Mixture-of-Codes Approach

Scaling classical embedding approaches to large corpora while maintaining semantic granularity and computational tractability is addressed by the Mixture-of-Codes (MoC) paradigm (Zhang et al., 2024):

  • Multi-codebook quantization: High-dimensional LLM embeddings (zRnzz \in \mathbb{R}^{n_z}) are projected into NN low-dimensional codebooks, each with KK codewords.
  • ID construction: Each verse is tagged by NN discrete code indices, preserving a distributed semantic representation.
  • Fusion module: Embeddings from all codebooks are concatenated and processed by a residual MLP bottleneck ("fusion") prior to downstream recommendation (e.g., within DeepFM or AutoInt+).
  • Loss functions: VQ-VAE-style loss (Lrec+Lcommit+LcodebookL_{rec} + L_{commit} + L_{codebook}) ensures faithful quantization and diversity of semantic subspaces; task-specific losses supervise downstream recommendation.
  • Retrieval: Precomputed, fused verse embeddings allow scalable nearest-neighbor and ranking operations for user or context-specific queries.
  • Storage and computation: Embedding size grows linearly with NN; quantization and retrieval remain efficient for large KK and corpus sizes; periodic retraining maintains adaptation to corpus or model drift.

This approach supports high-dimensional, information-rich semantic matching without the discriminability loss inherent in brute-force projection or single-codebook compression.

6. Evaluation, Limitations, and User Studies

Evaluation of semantic recommendation frameworks encompasses both traditional IR metrics and user-centered measures (Raorane et al., 10 Jan 2026):

  • Retrieval: Top-3 Precision@3, Coverage@3, Recall@k, MRR.
  • Explanation Quality: Scored by users on clarity, cultural appropriateness, engagement, and overall.
  • User studies: Encompass a demographically mixed cohort (Sanskrit, Marathi, English speakers); tasks include thematic querying and utility rating.
  • Key findings: Semantic retrieval via IndicBERT + FAISS outperformed keyword search in both system-level metrics and user satisfaction (2.8 → 4.3/5 Likert scale).
  • Generation: LLM-generated explanations rated ~4.0+ versus <2.6 for literal translations.
  • Usability: Users noted bridging of the gap between classical content and contemporary motivational or ethical advice.

Limitations include the small dataset (200 verses), absence of domain-specific model fine-tuning, and constrained user study scope. No recall@k or large-scale A/B testing has yet been performed; current deployment relies on local hardware with moderate latency and basic TTS.

7. Future Directions and Research Opportunities

The field of semantic recommendation for subhāṣitas is at a nascent stage, with several promising research directions (Raorane et al., 10 Jan 2026, Zhang et al., 2024):

  • Dataset expansion: Scaling verse annotations to thousands, incorporating additional Indian and global languages, and enriching theme tagging.
  • Adaptive embeddings: Fine-tuning LLMs and embedding models on subhāṣita-specific corpora, or integrating multilingual encoders (e.g., LaBSE, mBERT) for cross-lingual capability.
  • System scalability: Leveraging Mixture-of-Codes and advanced vector quantization to support web-scale retrieval and robust discriminability.
  • Multimodal integration: Adding automatic speech recognition (ASR), high-quality TTS, and multimodal (image/text/audio) recommendation.
  • Comprehensive evaluation: Systematic user studies, recall@k and MRR benchmarking, and A/B tests over diverse populations.
  • Deployment: Development of mobile, web, and edge-computing applications for public engagement, education, and heritage preservation.

A plausible implication is that as annotated corpora and semantic embedding models mature, domain-specific fine-tuning and scalable MoC-based representations will further enhance the accessibility, relevance, and impact of subhāṣita recommendation systems, contributing to the digital preservation and revitalization of classical Sanskrit literature.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Recommendation of Subhasitas.