Papers
Topics
Authors
Recent
2000 character limit reached

NEMO-4-PAYPAL: LLM-Optimized Commerce System

Updated 29 December 2025
  • NEMO-4-PAYPAL is a multi-agent e-commerce system that leverages NVIDIA's NeMo Framework to optimize LLM workflows and transform search and recommendation pipelines.
  • It applies LoRA-based fine-tuning and layered agent orchestration to reduce query latency by over 50% while lowering GPU costs significantly.
  • Its modular design, hyperparameter optimization, and scalable architecture provide a replicable blueprint for efficient, production-scale commerce applications.

NEMO-4-PAYPAL is a production-scale, multi-agent commerce system developed by PayPal in collaboration with NVIDIA, leveraging LLM optimization via NVIDIA's NeMo Framework to orchestrate agentic workflows for e-commerce application scenarios. At its core, the system replaces traditional search and recommendation pipelines with a layered architecture of specialized LLM-powered agents, with the Search and Discovery agent as a primary performance target. NEMO-4-PAYPAL applies low-rank adaptation (LoRA) for efficient model fine-tuning, achieving substantial reductions in query latency and cost without sacrificing quality, thereby establishing a scalable paradigm for multi-agent LLM optimization in commercial environments (Sahami et al., 25 Dec 2025).

1. Architectural Overview

NEMO-4-PAYPAL integrates four principal architectural layers, each contributing distinct capabilities to the overall commerce agent system:

  • Commerce Platform: Hosts user interface, authentication, payment, and cart management services.
  • Agent Orchestration Framework: A Java-based API server utilizing LangChain for session management, multi-agent orchestration, and standardized domain-specific tool invocation.
  • LLM Integration (“Model Garden”): Supports selection among general-purpose or commerce-tuned LLMs (such as the llama3.1-nemotron-nano-8B-v1), deployed via NVIDIA NIM microservices.
  • Personalization Engine: Constructs and maintains per-user behavioral profiles to provide conditional recommendations.

The central orchestrator routes user interactions to specialized sub-agents, each responsible for a distinct workflow: Search and Discovery (free-form product queries), Recommendation (personalized suggestions), Cart-Management (basket operations), and Post-Purchase Support. Each sub-agent operates as an LLM workflow encompassing query understanding, task-specific planning, retrieval/ranking (drawing on dense embedding searches and LLM-powered re-ranking), external API calls, and response formulation.

2. Search and Discovery Agent: Pipeline and Optimization

Within the multi-agent composition, the Search and Discovery agent addresses free-form product queries using a four-stage process:

  1. Query Understanding & Expansion: The LLM extracts attribute–value pairs and applies the HyDE process to simulate a hypothetical product description, enriching the semantic context.
  2. Retrieval: Dense embedding search operates over PayPal’s product catalog to identify candidate results.
  3. Ranking: Top-K results are re-scored by a dedicated LLM module to optimize relevance.
  4. LLM Evaluator: Final results are filtered and narrative responses generated for the end user.

Empirical profiling identified that retrieval accounted for over 50% of the system’s end-to-end (E₂E) response time. Consequently, the original query-formulation LLM was replaced with a fine-tuned Nemotron Small LLM (SLM), yielding greater than 50% savings in the retrieval stage and substantial improvements in overall latency and cost.

3. LLM Model Choice and Fine-Tuning Methodology

Model Selection

PayPal adopted the 8B-parameter variant from the NVIDIA llama3.1-nemotron-nano family due to its optimal balance between inference speed (sub-2s targets) and capability to capture commerce semantics. NeMo-provided recipes and checkpoint sharding facilitated distributed training and deployment.

LoRA-based Fine-Tuning

Fine-tuning employed Low-Rank Adaptation (LoRA), which augments each attention or feed-forward layer weight WRd×dW \in \mathbb{R}^{d \times d} with low-rank, trainable matrices ARd×rA \in \mathbb{R}^{d \times r} and BRr×dB \in \mathbb{R}^{r \times d}:

ΔW=AB Wtuned=Wbase+ΔW\Delta W = A \cdot B \ W_{\text{tuned}} = W_{\text{base}} + \Delta W

This technique reduces GPU memory requirements by storing only ΔW\Delta W, relieving the need to fully replicate the weight tensor WW.

Hyperparameter Optimization

Twenty model variants were generated, sweeping:

  • Learning rates η{1×105,5×105,1×104,5×104}\eta \in \{1\times 10^{-5}, 5\times 10^{-5}, 1\times 10^{-4}, 5\times 10^{-4}\}
  • LoRA ranks r{4,8,16}r \in \{4, 8, 16\}
  • Optimizers: Adam and AdamW (with weight decay λ\lambda)
  • Learning rate scheduling: cosine annealing
  • Regularization: L2 weight decay (up to 1×1021\times 10^{-2}) and adapter dropout

Optimization routines followed standard update formulas for Adam, AdamW, and cosine annealing schedules as documented in the primary reference.

4. Datasets, Evaluation Protocols, and Quantitative Results

Datasets

Metrics

Model variants were compared using:

  • Retrieval latency (ms)
  • Full agent response time (ms)
  • GPU cost per query (\$)
  • Quality score (0–5 scale for hypothetical product generation)
  • E₂E composite score (span: query formulation, attribute extraction, recommendation)

Results Summary

Model/Method Retrieval Latency E₂E Time GPU Cost Quality Score Change
Baseline Llama3-8B 3,800 ms 4,500 ms X 2.03 (baseline)
NFT Nemotron-8B ↓ 57.9% ↓ 48.9% ↓ 45.5% ↓ 36.2%
SFT Nemotron-8B (champ) ↓ 58% ↓ 49% ↓ 45% ↓ 21.7% (+23% vs NFT)

Interpretation: Retrieval latency decreased from 3.8s to approximately 1.6s. The champion SFT model reduced agent response to ~2.3s, halved GPU cost, and recouped most of the quality lost by zero-shot deployment without fine-tuning.

5. Performance Analysis, Scalability, and Trade-Offs

Bottleneck Resolution

The introduction of domain-specific LoRA adapters led to more compact, semantically dense query embeddings, directly reducing the complexity and duration of dense nearest-neighbor search. This also lowered the number of downstream ranking operations required, resolving the retrieval bottleneck that dominated response time.

Quality/Latency/Cost Trade-offs

  • Non-fine-tuned (NFT) models prioritize speed and cost but degrade quality up to 36%.
  • Supervised fine-tuning (SFT) offers a near 50% reduction in latency and cost, with quality degraded by ~22%.
  • Direct Preference Optimization (DPO) maintains quality (<7% drop) and E₂E score (<2% drop), trading off for reduced (18%) latency gains.

System Scalability

The NeMo framework’s three-dimensional parallelism (tensor, pipeline, data) enables seamless scaling from single- to multi-GPU clusters (H100/B200 GPUs). NIM-driven inference selects between TensorRT-LLM and vLLM backends at runtime. Empirically, Blackwell B200 GPUs with tensor parallelism (TP=4) achieved 35% faster inference than H100. LangChain-based orchestration and modular microservices simplify the addition or modification of agent workflows without disturbing the optimized retrieval core.

6. Key Algorithms and Mathematical Formulations

Key update and scheduling rules employed:

  • LoRA Update: ΔW=AB\Delta W = A \cdot B, Wtuned=Wbase+ΔWW_{\text{tuned}} = W_{\text{base}} + \Delta W
  • Cosine Annealing Schedule:

ηt=ηmin+12(ηmaxηmin)[1+cos(πt/T)]\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})[1 + \cos(\pi t / T)]

  • Adam:

mt=β1mt1+(1β1)Lt vt=β2vt1+(1β2)Lt2 θt=θt1ηmtvt+ϵm_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L_t\ v_t = \beta_2 v_{t-1} + (1-\beta_2)\nabla L_t^2\ \theta_t = \theta_{t-1} - \eta \frac{m_t}{\sqrt{v_t}+\epsilon}

  • AdamW:

θt=θt1η(mtvt+ϵ+λθt1)\theta_t = \theta_{t-1} - \eta\left( \frac{m_t}{\sqrt{v_t}+\epsilon} + \lambda \theta_{t-1}\right)

These formulations are applied systematically in the model optimization process.

7. Significance and Blueprint for Multi-Agent LLM Commerce Systems

NEMO-4-PAYPAL constitutes the first documented application of the NVIDIA NeMo Framework for commerce-specific agent optimization at production scale. By methodically tuning small LLMs for the retrieval–recommendation axis of e-commerce, the system achieves step-change reductions in latency and GPU operational expenditure while sustaining or enhancing user-level recommendation quality. The modular multi-agent design, robust hyperparameter optimization, and transparent scalability make NEMO-4-PAYPAL a replicable architecture for large-scale, LLM-driven commerce systems, where latency and economic efficiency are primary operational constraints (Sahami et al., 25 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to NEMO-4-PAYPAL.