Papers
Topics
Authors
Recent
Search
2000 character limit reached

Magentic-Marketplace: LLM Agent Economic Markets

Updated 2 July 2026
  • Magentic-Marketplace is an open-source simulation environment for two-sided markets where LLM-based Assistant and Service agents execute economic decisions.
  • It provides a formal framework with REST endpoints and modular components for search, dialogue, and transaction management in agent-mediated commerce.
  • Quantitative analyses reveal key insights on market efficiency, biases such as first-proposal bias, and vulnerabilities to manipulation affecting agent performance.

Magentic-Marketplace is an open-source simulation environment designed for rigorous study of two-sided agentic markets where LLM agents mediate economic decisions, encompassing consumer-side "Assistant" agents representing users and business-side "Service" agents representing sellers. The environment provides a formal framework for modeling, evaluating, and benchmarking agent behavior, market efficiency, biases, and vulnerabilities—particularly as agent-mediated commerce becomes more prevalent with the maturation of autonomous LLM-based systems (Bansal et al., 27 Oct 2025).

1. Formal Environment Specification

Magentic-Marketplace instantiates a two-sided market A=AcAsA = A_c \cup A_s comprising consumer-side Assistant agents AcA_c and Service agents AsA_s. The universe of goods or tasks I\mathbb{I} is finite; each consumer request iIi \in \mathbb{I} is a bundle of 1–3 items and 1–2 required amenities. The market state MtM_t at time tt maintains:

  • Registered agents and their identities,
  • Service catalogs (mapping each sAss \in A_s to its menu/items, amenities, and prices),
  • Message queues recording asynchronous market communications (types: “text,” “order_proposal,” “payment”),
  • Transaction ledger of completed exchanges.

For any transaction jj between consumer aca_c and service AcA_c0, let AcA_c1 indicate exact fit to bundle and amenity constraints, and AcA_c2 the price. Consumer utility:

AcA_c3

with AcA_c4 (intrinsic value) typically set to AcA_c5 times average item price, AcA_c6. Service agent utility:

AcA_c7

Social welfare over all transactions AcA_c8:

AcA_c9

Agents interact over three REST endpoints: /register (market entry), /protocol (action schema discovery), and /action (action invocation). A prototypical protocol: Assistants perform “search” (receiving a paginated candidate set), “send text” to request prices, receive structured “order_proposal” offers, and finalize with “send payment.”

2. Simulation Architecture and Extensibility

Architecturally, the Magentic-Marketplace consists of independent agent client processes communicating with a central server via HTTP/REST. Core server modules include:

  • Catalog and Search: indexes and retrieves ranked service candidates,
  • Dialogue Manager: message routing,
  • Transaction Ledger: atomicity and recording of trade.

Agents embed an “action router” polling /protocol and funneling messages via /action. The minimal API (three endpoints) allows protocol elements (e.g., new transaction types like “refund” or “review”) to be introduced centrally without agent code modifications, facilitating backward-compatible experimentation. Researchers can substitute custom modules (e.g., alternative search, matching, or pricing mechanisms) within the same experimental ecosystem (Bansal et al., 27 Oct 2025).

3. Evaluation Metrics and Experimental Methodology

The environment evaluates agentic market performance according to:

  • Consumer welfare: AsA_s0,
  • Service revenue: AsA_s1,
  • Social welfare: AsA_s2,
  • Response quality: fraction of exact fits (AsA_s3),
  • Latency and message count per transaction.

Controlled variables include model family (e.g., GPT-4o, Gemini-2.5-Flash, Sonnet-4/4.5 for proprietary; GPT-OSS-20b, Qwen3-14b-YARN for open-source), search budget (candidate set size 3–100), market scale (AsA_s4 up to AsA_s5 consumers/businesses), and distractor items ensuring non-overlapping consumer baskets. Baselines include random selection, “cheapest” among items and amenities, and global optimal. This facilitates fine-grained comparison between LLM-agent performance, algorithmic search mechanisms, and classical welfare-theoretical lower bounds (Bansal et al., 27 Oct 2025).

4. Quantitative Results and Market Dynamics

Key findings include:

  • Frontier model near-optimal welfare under ideal search: GPT-4.1, Gemini-2.5-Flash reach 95–98% of the theoretical optimum with perfect search; Sonnet-4.5 is within 2%.
  • Search degradation: Lexical (realistic) search causes significant welfare loss: e.g., GPT-4.1 drops to ≈85%, open-source GPT-OSS-20b to ≈70%, Qwen3 to ≈50% of optimal.
  • Performance scales poorly: Welfare decreases by 5–10 points (proprietary) or up to 20 points (open-source) when moving to larger markets.
  • Severe “first-proposal bias”: First proposals are accepted 60–100% of the time with up to 30× speed advantage over higher-quality but slower responses. GPT-4o and Sonnet-4.5 accept the first proposal in 100% of interactions, regardless of subsequent alternatives.
  • Consideration set paradox: Expanding the search set from 3 to 100 candidates often reduces welfare; e.g., Sonnet-4 drops by 65.4%, GPT-5 by 44%. Despite more options, most models contact only 2–4 businesses on average.
  • Position bias: Most frontier models distribute selections uniformly among the top ranks, but Qwen3-14b is highly biased toward lower-ranked results.

These results highlight interactional pathologies unique to autonomous agentic markets, not evident in classical, two-party negotiation tests (Bansal et al., 27 Oct 2025).

5. Behavioral Biases and Manipulation Vulnerabilities

Empirical analysis reveals the emergence of strongly suboptimal behaviors:

  • First-proposal and position bias: Quantified as AsA_s6–AsA_s7, with theoretical upper bounds for choice steeply favoring the first arrival.
  • Manipulation susceptibility: Open-source models (GPT-OSS-20b, Qwen3-14b) are vulnerable to authority/social-proof tactics and prompt injection, with manipulated payments rising to ≈1.5–2.0x normal; prompt injections can redirect nearly all payments. Proprietary models exhibit relative robustness except against strong prompt injection attacks.
  • Coordination failures are catalyzed by the paradox of choice and first-proposal anchoring: increasing candidate sets lead to early, low-quality matches being accepted, impeding optimal allocation (Bansal et al., 27 Oct 2025).

6. Design Principles and Future Research Directions

The empirical regularities unveiled by the Magentic-Marketplace inform several actionable guidelines for system and protocol design:

  • Search and discovery: Cap search result sets (3–5 results) to avoid cognitive overload and the paradox of choice; combine lexical with semantic filtering.
  • Negotiation protocol: Enforce minimum “voting” (requiring assessment of AsA_s8 proposals before payment), randomize proposal order, or mandate time windows to suppress first-proposal bias.
  • Reputation and trust: Employ cryptographically signed credentials to mitigate fake-authority attacks; require human-in-the-loop for high-value transactions.
  • Adversarial robustness: Strict prompt sanitization, instruction masking, and adversarial training are necessary to contain manipulation risk.
  • Extension: The framework supports dynamic markets with online learning, combined AI–human agent populations, expanded transaction types (refund/review/rating), and supply-chain scenarios with higher-order agent roles as buyers/sellers (Bansal et al., 27 Oct 2025).

A worked example of Magentic-Marketplace within the broader Marketplace Evaluation paradigm specifies user segmentation (novices/experts), generator and retriever sets, softmax-based choice and utility updates, and expected market dynamics (market share, dominance, retention metrics). Market-based evaluation exposes properties such as early adoption advantage, sensitivity to path-dependence, and divergence from static benchmark predictions (Kim et al., 15 Apr 2026).

7. Comparative Context and Applications

Agentic-market environments such as Magentic-Marketplace represent a methodological advance over both conventional static benchmarks and constrained bilateral agent simulations. In comparison to LLM agent-assistants in C2C markets (e.g., FaMA (Yan et al., 4 Sep 2025)), which focus on GUI replacement and workflow automation, Magentic-Marketplace exposes emergent, multi-agent dynamics—market-level welfare, competitive fairness, and system-level biases. Marketplace evaluation under repeated, competitive scenarios (market share, HHI, dominance gap, retention) can reveal operational phenomena—e.g., market concentration, adoption trajectories, and systemic vulnerabilities—not accessible via single-agent accuracy metrics (Kim et al., 15 Apr 2026).

A plausible implication is that market design principles validated in multi-agent simulation (e.g., enforced diversity of considered offers, bounded search sets, authenticated seller identities) will become integral to the engineering of robust agent-mediated marketplaces at internet scale.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Magentic-Marketplace.