NEMO-4-PAYPAL: LLM-Optimized Commerce System
- NEMO-4-PAYPAL is a multi-agent e-commerce system that leverages NVIDIA's NeMo Framework to optimize LLM workflows and transform search and recommendation pipelines.
- It applies LoRA-based fine-tuning and layered agent orchestration to reduce query latency by over 50% while lowering GPU costs significantly.
- Its modular design, hyperparameter optimization, and scalable architecture provide a replicable blueprint for efficient, production-scale commerce applications.
NEMO-4-PAYPAL is a production-scale, multi-agent commerce system developed by PayPal in collaboration with NVIDIA, leveraging LLM optimization via NVIDIA's NeMo Framework to orchestrate agentic workflows for e-commerce application scenarios. At its core, the system replaces traditional search and recommendation pipelines with a layered architecture of specialized LLM-powered agents, with the Search and Discovery agent as a primary performance target. NEMO-4-PAYPAL applies low-rank adaptation (LoRA) for efficient model fine-tuning, achieving substantial reductions in query latency and cost without sacrificing quality, thereby establishing a scalable paradigm for multi-agent LLM optimization in commercial environments (Sahami et al., 25 Dec 2025).
1. Architectural Overview
NEMO-4-PAYPAL integrates four principal architectural layers, each contributing distinct capabilities to the overall commerce agent system:
- Commerce Platform: Hosts user interface, authentication, payment, and cart management services.
- Agent Orchestration Framework: A Java-based API server utilizing LangChain for session management, multi-agent orchestration, and standardized domain-specific tool invocation.
- LLM Integration (“Model Garden”): Supports selection among general-purpose or commerce-tuned LLMs (such as the llama3.1-nemotron-nano-8B-v1), deployed via NVIDIA NIM microservices.
- Personalization Engine: Constructs and maintains per-user behavioral profiles to provide conditional recommendations.
The central orchestrator routes user interactions to specialized sub-agents, each responsible for a distinct workflow: Search and Discovery (free-form product queries), Recommendation (personalized suggestions), Cart-Management (basket operations), and Post-Purchase Support. Each sub-agent operates as an LLM workflow encompassing query understanding, task-specific planning, retrieval/ranking (drawing on dense embedding searches and LLM-powered re-ranking), external API calls, and response formulation.
2. Search and Discovery Agent: Pipeline and Optimization
Within the multi-agent composition, the Search and Discovery agent addresses free-form product queries using a four-stage process:
- Query Understanding & Expansion: The LLM extracts attribute–value pairs and applies the HyDE process to simulate a hypothetical product description, enriching the semantic context.
- Retrieval: Dense embedding search operates over PayPal’s product catalog to identify candidate results.
- Ranking: Top-K results are re-scored by a dedicated LLM module to optimize relevance.
- LLM Evaluator: Final results are filtered and narrative responses generated for the end user.
Empirical profiling identified that retrieval accounted for over 50% of the system’s end-to-end (E₂E) response time. Consequently, the original query-formulation LLM was replaced with a fine-tuned Nemotron Small LLM (SLM), yielding greater than 50% savings in the retrieval stage and substantial improvements in overall latency and cost.
3. LLM Model Choice and Fine-Tuning Methodology
Model Selection
PayPal adopted the 8B-parameter variant from the NVIDIA llama3.1-nemotron-nano family due to its optimal balance between inference speed (sub-2s targets) and capability to capture commerce semantics. NeMo-provided recipes and checkpoint sharding facilitated distributed training and deployment.
LoRA-based Fine-Tuning
Fine-tuning employed Low-Rank Adaptation (LoRA), which augments each attention or feed-forward layer weight with low-rank, trainable matrices and :
This technique reduces GPU memory requirements by storing only , relieving the need to fully replicate the weight tensor .
Hyperparameter Optimization
Twenty model variants were generated, sweeping:
- Learning rates
- LoRA ranks
- Optimizers: Adam and AdamW (with weight decay )
- Learning rate scheduling: cosine annealing
- Regularization: L2 weight decay (up to ) and adapter dropout
Optimization routines followed standard update formulas for Adam, AdamW, and cosine annealing schedules as documented in the primary reference.
4. Datasets, Evaluation Protocols, and Quantitative Results
Datasets
- Supervised Fine-Tuning (SFT): 10,000 prompt–response pairs from historical Shopping Assistance logs (70% train, 30% validation split).
- Direct Preference Optimization (DPO): 5,594 training and 1,863 validation preference pairs.
Metrics
Model variants were compared using:
- Retrieval latency (ms)
- Full agent response time (ms)
- GPU cost per query (\$)
- Quality score (0–5 scale for hypothetical product generation)
- E₂E composite score (span: query formulation, attribute extraction, recommendation)
Results Summary
| Model/Method | Retrieval Latency | E₂E Time | GPU Cost | Quality Score Change |
|---|---|---|---|---|
| Baseline Llama3-8B | 3,800 ms | 4,500 ms | X | 2.03 (baseline) |
| NFT Nemotron-8B | ↓ 57.9% | ↓ 48.9% | ↓ 45.5% | ↓ 36.2% |
| SFT Nemotron-8B (champ) | ↓ 58% | ↓ 49% | ↓ 45% | ↓ 21.7% (+23% vs NFT) |
Interpretation: Retrieval latency decreased from 3.8s to approximately 1.6s. The champion SFT model reduced agent response to ~2.3s, halved GPU cost, and recouped most of the quality lost by zero-shot deployment without fine-tuning.
5. Performance Analysis, Scalability, and Trade-Offs
Bottleneck Resolution
The introduction of domain-specific LoRA adapters led to more compact, semantically dense query embeddings, directly reducing the complexity and duration of dense nearest-neighbor search. This also lowered the number of downstream ranking operations required, resolving the retrieval bottleneck that dominated response time.
Quality/Latency/Cost Trade-offs
- Non-fine-tuned (NFT) models prioritize speed and cost but degrade quality up to 36%.
- Supervised fine-tuning (SFT) offers a near 50% reduction in latency and cost, with quality degraded by ~22%.
- Direct Preference Optimization (DPO) maintains quality (<7% drop) and E₂E score (<2% drop), trading off for reduced (18%) latency gains.
System Scalability
The NeMo framework’s three-dimensional parallelism (tensor, pipeline, data) enables seamless scaling from single- to multi-GPU clusters (H100/B200 GPUs). NIM-driven inference selects between TensorRT-LLM and vLLM backends at runtime. Empirically, Blackwell B200 GPUs with tensor parallelism (TP=4) achieved 35% faster inference than H100. LangChain-based orchestration and modular microservices simplify the addition or modification of agent workflows without disturbing the optimized retrieval core.
6. Key Algorithms and Mathematical Formulations
Key update and scheduling rules employed:
- LoRA Update: ,
- Cosine Annealing Schedule:
- Adam:
- AdamW:
These formulations are applied systematically in the model optimization process.
7. Significance and Blueprint for Multi-Agent LLM Commerce Systems
NEMO-4-PAYPAL constitutes the first documented application of the NVIDIA NeMo Framework for commerce-specific agent optimization at production scale. By methodically tuning small LLMs for the retrieval–recommendation axis of e-commerce, the system achieves step-change reductions in latency and GPU operational expenditure while sustaining or enhancing user-level recommendation quality. The modular multi-agent design, robust hyperparameter optimization, and transparent scalability make NEMO-4-PAYPAL a replicable architecture for large-scale, LLM-driven commerce systems, where latency and economic efficiency are primary operational constraints (Sahami et al., 25 Dec 2025).