Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 97 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 92 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Kimi K2 175 tok/s Pro

2000 character limit reached

LLM-Based Proxy Overview

Updated 8 July 2025

LLM-based proxies are systems where large language models mediate interactions among users, software components, and auxiliary models to optimize costs, security, and decision orchestration.
They employ diverse architectural patterns—from RESTful gateways to control planes—to route queries, compress context, and facilitate effective expertise transfer with measurable performance gains.
LLM-based proxies play crucial roles in preference elicitation, adversarial resilience, and emergent ability forecasting, offering empirical benchmarks that underscore their impact on advanced AI deployments.

A LLM–based proxy is a system, architectural pattern, or algorithm in which an LLM—often in concert with plugins or auxiliary models—mediates or reframes an interaction, task, or control flow between different software components, users, or agents. LLM-based proxies are deployed to accomplish goals including cost optimization, security evasion or defense, preference elicitation, model adaptation, alignment, robust decision orchestration, and efficient information management. These proxies operate at varying levels of abstraction: from acting as middleware interfaces masking underlying LLM complexity or orchestrating access to different models, to providing lightweight mechanisms for expertise transfer or efficient context handling.

1. Architectural Patterns and Orchestration

LLM-based proxies can take the form of standalone services, embedded modules, RESTful gateways, or auxiliary models. For example, LLMProxy is a system-level proxy that routes user queries to different LLMs based on dynamic cost–quality tradeoffs, managing model selection, semantic caching, and context reduction (Martin et al., 4 Oct 2024). MCP Bridge exemplifies a RESTful proxy that abstracts the MCP server communication protocol for LLM tool use, incorporating client risk assessment and multi-tiered security to expose tool capabilities across resource-constrained environments (Ahmadi et al., 11 Apr 2025). Sentinels function as lightweight, sentence-level compression proxies that use attention signals from small LLMs to filter and compress context for downstream models (Zhang et al., 29 May 2025).

Multi-model orchestration via proxies is also represented in LLM control planes, in which routers classify and dispatch queries based on predicted complexity or quality needs (Shafran et al., 3 Jan 2025). Here, a proxy may perform as a classifier, scheduler, or intermediary, weighing signals such as predicted sequence length (Qiu et al., 12 Apr 2024) or heuristic answers from slim models (Tan et al., 19 Feb 2024).

2. Proxy-Tuning and Expertise Transfer

Proxy-tuning is an algorithmic method by which a small tuned LLM serves as an inference-time auxiliary to steer the decoding of a much larger, base (possibly black-box) LLM without accessing its internal parameters (Liu et al., 16 Jan 2024). At each token generation step, the proxy (small expert) and its untuned variant (anti-expert) provide logit differences that shift the large model’s output distribution:

$P_M(x \mid x_{<t}) = \mathrm{softmax}(S_M(x \mid x_{<t}) + S_{M^+}(x \mid x_{<t}) - S_{M^-}(x \mid x_{<t}))$

This enables domain, instruction, or task-specific adaptation on proprietary LMs at dramatically lower compute cost. Empirically, this approach closed 88% of the gap between tuned and untuned Llama2-70B on knowledge and reasoning benchmarks and enabled effective GPT-3.5 temporal adaptation.

LLM-based proxy expertise transfer is also leveraged in proxy attacks for misalignment or evasion. For instance, a small model is RL-tuned for "human-likeness" (e.g., via DPO), then used to shift the output distribution of a larger LLM in the decoding phase, allowing text to bypass LLM-generated content detectors by blending distributions (Wang et al., 25 Oct 2024). This approach retained utility scores within a modest range of the unattacked model while reducing detector AUROC by over 70%.

3. Proxy Models for Efficiency, Robustness, and Scheduling

LLM-based proxies are widely deployed as efficiency mechanisms, reducing the resource footprint or latency of complex pipelines:

Knowledge-Aware Retrieval: Slim proxy models (e.g., Llama2-7B) are used in SlimPLM to pre-assess whether a larger model’s knowledge suffices to answer a QA query, selectively triggering document retrieval only for missing information. This decreases inference cost by requiring only one LLM call per query, yet achieves or surpasses state-of-the-art performance (Tan et al., 19 Feb 2024).
Interactive Serving and Scheduling: Lightweight proxy predictors (e.g., BERT-base) estimate LLM answer lengths, enabling speculative shortest-job-first (SSJF) scheduling. This reduces job completion time by up to 39.6% and increases throughput by up to 3.6× compared to FCFS scheduling, without changing memory or batching (Qiu et al., 12 Apr 2024).
Proxy Metrics for Robustness Evaluation: Rather than <i>running</i> computationally expensive adversarial red-teaming ensembles, fast proxy metrics (direct prompting, embedding-space attacks) yield attack success rate (ASR) estimates highly correlated with full attack ensembles (Pearson $r_p=0.87$ , Spearman $r_s=0.94$ ), but at three orders of magnitude less computational cost (Beyer et al., 14 Feb 2025).

4. LLM-Based Proxies in Security and Adversarial Contexts

The LLM-based proxy paradigm enables both attack and defense in cybersecurity:

Command and Control Mediation: Proof-of-concept malware attacks have leveraged public LLMs (such as ChatGPT) with vulnerable plugins to act as command-and-control (C2) proxies. Here, malware on a victim machine issues prompts to the LLM, which then fetches attacker commands via a plugin (e.g., web browsing), exfiltrates data, and avoids detection by appearing as legitimate traffic (Beckerich et al., 2023).
Alignment Enforcement: Proxy-RLHF decouples generation and alignment by training a lightweight external proxy model (e.g., a small MLP), using RL to monitor LLM token outputs for human-value alignment. This method reduces parameter and GPU memory requirements by 99%, aligning outputs at comparable quality to full RLHF (Zhu et al., 7 Mar 2024).
Orchestration Plane Integrity: Adversarial confounder gadgets, as short token sequences optimized to manipulate routers, can induce upgrades (weak-to-strong LLM switches) or downgrades in model routing. Such gadgets maintain low perplexity, circumventing naturalness-based filtering, highlighting architectural vulnerabilities in LLM control planes (Shafran et al., 3 Jan 2025).
Autonomous Multi-Step Attack Coordination: The Incalmo abstraction layer, serving as a high-level LLM proxy, bridges LLM-generated intentions (e.g., "scan network") with concrete command sequences, empowering smaller LLMs to achieve multi-host, multi-stage network attacks when direct command generation fails (Singer et al., 27 Jan 2025).

5. Preference Elicitation, Fairness, and Alignment via Proxy

LLM-based proxies enable efficient, user-centric solutions in complex decision and social systems:

Accelerated Preference Elicitation: In combinatorial auctions, LLM proxies maintain natural language transcripts with bidders and incrementally approximate preference functions (XOR bids) using LLM-generated inference. This reduces required queries by up to a factor of five compared to DNF-based proper learning, lessening the cognitive burden on participants (Huang et al., 24 Jan 2025).
Fairness-Aware Recommendations: FACTER introduces a fairness-aware proxy layer combining conformal prediction thresholds (adaptive semantic variance) and adversarial prompt updates. This proxy detects bias by measuring embedding deviation and, upon violation, injects specific "avoid" instructions into the prompt to mitigate demographic bias iteratively—reducing fairness violations by up to 95.5% with no retraining (Fayyazi et al., 5 Feb 2025).
White-Box Reward Baselines for Alignment: Reverse reward engineering constructs a white-box reward proxy from interpretable features (length incentive, repetition penalty, query relevance) for RL-based alignment. Such engineered rewards show strong monotonic correlation with open-source RM signals and match or surpass black-box RMs in standard alignment benchmarks (Kim et al., 2 Feb 2024).

6. Proxy Tasks and Emergent Ability Forecasting

LLM-based proxies are also cast as proxy tasks—tasks used for efficient early-stage prediction of downstream, emergent abilities:

The method involves collecting normalized model performance vectors across a suite of tasks, using correlation (Kendall’s, Spearman’s) to select proxies strongly linked to the target emergent ability (e.g., tool utilization). Candidate proxies are further filtered for task robustness (variance ratio in model ensembles) before being weighted and integrated as early predictors. This technique demonstrates high correlation between proxy task predictions and realized emergent capability rankings (e.g., tool-use) (Zhang et al., 10 Dec 2024).
Such approaches facilitate anticipatory model evaluation, enabling efficient hyperparameter optimization and data curation prior to full-scale training.

7. Compression, Context Management, and System Service Proxies

LLM-based proxies play critical roles in managing context, memory, and overall system service integration:

Sentence-Level Context Compression: Sentinel reframes context selection as an attention-probing task, using attention distributions from a small proxy LM (e.g., 0.5B) and a lightweight classifier to identify and retain only those sentences most relevant to a downstream task. This yields up to 5× compression while matching large-scale QA performance, and its attention signals transfer robustly across model scales (Zhang et al., 29 May 2025).
On-Device LLM as System Proxy: The LLMs framework implements a system-level LLM proxy (LLMaaS) on resource-constrained devices, using chunk-wise KV-cache compression, pipelined loading, and managed context lifespan to minimize latency and memory footprint. Proxying inference for multiple client apps through a shared, stateful LLM service improves privacy and efficiency as all client requests remain local (Yin et al., 18 Mar 2024).
RESTful Proxy for External Tool Access: MCP Bridge enables client applications to interact with MCP servers (for LLM tool augmentation) via a vendor-agnostic API, with features such as risk-based execution and dynamic capability discovery. This allows LLM-powered toolchains to operate seamlessly in diverse or constrained environments (Ahmadi et al., 11 Apr 2025).

In summary, LLM-based proxy systems span a wide variety of forms and functions—including middleware, auxiliary models, control planes, fairness and alignment layers, compression interfaces, and strategic abstraction modules. These proxies are critical for enhancing efficiency, flexibility, robustness, and safety in the deployment and operation of large generative AI systems across domains such as security, recommendation, orchestration, preference elicitation, and retrieval-augmented generation. Their deployment surfaces novel challenges in both security defense and attack, model evaluation, and safe system orchestration, and their practical impact is supported by robust empirical results across contemporary benchmarks and production use cases.