E-CARE: Efficient Commonsense Recommendation
- The paper introduces E-CARE, which shifts LLM-based commonsense reasoning offline to require just one forward pass per query.
- It employs offline factor graph construction and lightweight adapters to augment queries and products, ensuring low latency and scalability.
- Empirical results show improvements of up to +12.1 Precision@5 and +2 Macro-F1, highlighting its efficiency and effectiveness.
Efficient Commonsense-Augmented Recommendation Enhancer (E-CARE) is a framework for integrating LLM derived commonsense reasoning into e-commerce recommender systems with maximal inference efficiency. It achieves comparable or superior reasoning performance to prior LLM-augmented approaches while requiring only a single forward pass through the LLM per user query. E-CARE leverages offline factor graph construction and lightweight trainable adapters to enable efficient query and product augmentation, ensuring scalability across large catalogs and low-latency deployment.
1. Motivation and Design Rationale
Recent advances in LLM-based commonsense augmentation for product search and recommendation—such as FolkScope or COSMO—depend on frequent, real-time LLM inference on each query–product pair. This results in prohibitively high serving costs when ranking millions of candidates, due to:
- Multiple LLM calls per (Q,P) pair (for intention and utility extraction)
- Intensive utilization of human annotation (for “ground-truth” reasoning triples or supervised fine-tuning)
- Poor scalability in updating knowledge graphs and recommender fine-tuning
E-CARE is designed to shift almost all LLM reasoning and annotation offline. Its core guiding principles are:
- Efficiency: invoke the LLM exactly once per query at inference time.
- Scalability: offload graph and factor construction to the offline pipeline.
- Retention of reasoning power: use factor graphs distilled from LLM chain-of-thought reasoning, accessed or approximated via adapters and graph lookup at runtime.
2. System Architecture and Workflow
E-CARE is modular and interfaces with any embedding-based or cross-encoder recommender. The system comprises multiple interdependent stages:
Offline (batch, per catalog update):
- LLM reasoning: for historical positive query–product pairs, the LLM is prompted to produce commonsense factors (user needs, utilities, features).
- Factor graph construction: A bipartite/multiplex graph is built with nodes (queries, products, factors), and edges representing explanatory relationships.
- Clustering and filtering: Redundant factors are merged; low-confidence edges removed.
- Adapter training: Train lightweight adapters (“query adapters,” sometimes “product adapters”) to predict factor nodes relevant for new queries and products based on their LLM embeddings.
Online (inference per (Q, P)):
- Query embedding: Single call to a frozen LLM model computes .
- Factor retrieval: Adapters (MLPs) project into scores for each factor node, selecting the top- factors for the query .
- Augmentation: Combine query and selected factor texts (); for candidate product , retrieve offline-linked factors .
- Recommender input: Feed to a bi-encoder or cross-encoder for ranking or classification.
3. Commonsense Reasoning via Factor Graphs
The factor graph formalism underpins latent reasoning in E-CARE. Its structure is:
- , , the set of extracted factor nodes (“needs,” “utilities,” features).
- Edges connect queries and products to their explanatory factors.
A compatibility model for binary relevance (or multiclass) between is specified as:
Factor potentials are parameterized by:
with typically comprising cosine similarity between LLM embeddings of and alongside structural indicators (e.g., linking to in ).
While the factor graph allows for (approximate) inference methods such as belief propagation or mean-field variational inference, the deployed E-CARE performs a single-pass approximation: summing the log-potentials over shared factors between and the neighborhood of .
4. Single LLM Forward Pass and Offline Extraction
E-CARE constrains real-time LLM usage to one forward pass per query. The offline pipeline is specified by:
- A template prompt for historical :
“Given Q and P+its extracted product features, list: (a) the need n behind Q, (b) the utility u provided by P, under each of the scopes {who, why, where_when}.”
LLM output segments are parsed into factor nodes using chunking tools (e.g., DSPy pipeline).
- These factors are persistently encoded (embeddings), clustered, and indexed in the graph.
At inference, all factor definitions and relationships have been pre-extracted and adapters trained; no online annotation or chaining of LLM calls occurs.
5. Inference, Scoring, and Integration with Base Models
Approximate graph inference at runtime is described by:
where is the set of factors connected to in the graph. This is fused with a base recommender score as:
In most cases, . Thus, commonsense augmentation is realized through a weighted sum of LLM embedding similarities over shared explanatory factors.
Adapter Training: For each query and factor subset :
- Positives:
- Negatives: , sampled.
- InfoNCE loss employed:
The base recommender continues standard cross-entropy or ranking loss training, with frozen factor augmentation.
6. Empirical Evaluation
E-CARE has been evaluated on two downstream tasks:
a) Search Relevance (ESCI, WANDs)
- Task: classify into relevance classes (e.g., Exact/Substitute/Complement/Irrelevant).
- Baselines: bi-encoder (BERT-large, DeBERTa-v3, gte-Qwen2), cross-encoder, few-shot LLM inference (Llama-3.1-8B), KDD’22 ensemble.
- Metrics: Macro-F1, Micro-F1.
- Observed gains: up to +2.02 Macro-F1 on DeBERTa-v3 cross-encoder (59.01→61.03) and +1.71 on gte-Qwen2 bi-encoder (42.95→44.66) for ESCI; smaller gains on WANDs.
b) App Recall (private dataset)
- Task: recall top-K apps from 66K candidates for each query.
- Baseline: production recall systems using keyword, semantic, and popularity heuristics.
- Metric: Recall@5, Precision@5.
- Gains: Recall@5 improved by +11.1 (51.3→62.4), Precision@5 by +12.1 (41.0→53.1).
Additional ablations include:
- Circuit graph statistics (node/edge reduction after clustering and filtering).
- Adapter retrieval quality (cosine similarity up to 0.89 on top-1).
- Case studies recovering implicit needs/utilities.
7. Efficiency, Cost, and Practical Considerations
E-CARE’s efficiency profile is a distinctive feature:
- LLM Calls: Exactly one per query; prior methods require per query.
- Latency: Reduced to the time of a single LLM embedding plus adapter inference (≈10 ms) and standard scoring.
- Serving cost: Lowered by a factor of relative to per-pair LLM prompting.
- Scalability: Factor graph size is ; online cost scales with the number of top retrieved factors and the cost of the base recommender.
Guidelines and Best Practices:
- Rebuild factor graphs when historical interaction distributions shift.
- Prefer lightweight, frozen LLMs (e.g., gte-Qwen2) for both embedding extraction and adapter inference.
- Monitor adapter retrieval accuracy for drift detection.
Representative pseudocode:
1 2 3 4 5 6 7 8 9 10 |
def ECARE_score(Q, P): v_Q = LLM_encode(Q) # one LLM forward F_Q = set() # predicted factors for S in factor_subsets: v_Q_S = MLP_S(v_Q) F_Q |= top_k_factors_by_cos(v_Q_S) F_P = offline_graph_neighbors(P) # pre-stored commonsense_score = sum(cos(v_Q, enc(f)) for f in F_Q & F_P) base_score = base_model(Q, P) # bi- or cross-encoder return base_score + λ * commonsense_score |
Hyperparameters include: number of factor subsets (e.g., {who, why, where_when, category, style, usage}); top- factors per subset ( ≈ 5–20); regularization weight (0.5–2.0); edge filtering thresholds (0.2–0.5 for contrastive scoring).
E-CARE preserves the multi-hop, interpretive commonsense reasoning of LLMs with single-pass augmentation, yielding up to +12.1 points in precision@5 (app recall), +2 points Macro-F1 (search), and drastic reductions in latency and cost compared to real-time per-candidate LLM prompting.