PRISM: Privacy Routing for AI Inference

Updated 4 December 2025

The paper introduces PRISM, a framework that leverages semantic modulation to balance privacy leakage against output utility in AI inference.
It uses client- and edge-local pipelines where sensitive tokens are perturbed via formal differential privacy measures, ensuring robust machine translation and LLM performance.
PRISM’s adaptive routing and semantic transforms achieve improved latency, energy efficiency, and translation accuracy compared to traditional uniform privacy approaches.

Privacy-aware Routing for Inference with Semantic Modulation (PRISM) encompasses a set of client- and edge-centric privacy-preserving algorithms for AI inference, primarily designed for machine translation and cloud-edge LLM workflows. The PRISM family includes solutions for user-side privacy protection in machine translation (Sato, 2023) and context-sensitive, adaptive privacy routing for cloud-edge LLM services (Zhan et al., 27 Nov 2025). PRISM leverages semantic modulation—perturbation of input tokens or entities based on contextual risk and differential privacy budgets—to unambiguously trade off privacy leakage against output utility. The following sections document the formal approach, architectural patterns, semantic transforms, empirical trade-offs, and integration strategies for PRISM.

1. Threat Model and Privacy Objectives

PRISM is built for environments where remote inference servers, network intermediaries, or cloud-based translators are treated as untrusted, "honest-but-curious" adversaries. These parties may observe all input queries and outputs, without access to privileged client-side state such as substitution histories or local dictionaries (Sato, 2023). The overriding privacy objective is to prevent reconstruction or inference of sensitive source text or entities from any data observable to these adversaries. In the PRISM Machine Translation variant ("PRISM-R"), privacy is formally guaranteed via $\epsilon$ -differential privacy at the encoder $A: x_{pri} \to x_{pub}$ :

Two texts $x, x' \in V^n$ differing in one token ( $x \sim x'$ ) satisfy:

$\frac{\Pr[A(x) \in S]}{\Pr[A(x') \in S]} \le e^\epsilon$

where PRISM-R achieves $\epsilon = \ln \left( \frac{r + |V|(1 - r)}{r} \right)$ using independent random token substitutions. In PRISM* (utility-optimized variant), formal differential privacy is relaxed for empirical leakage minimization (Sato, 2023). In cloud-edge PRISM, entity-level local differential privacy (LDP) budgets are assigned per extracted entity, and noise is injected adaptively based on the contextual sensitivity score (Zhan et al., 27 Nov 2025).

2. System Architecture and Data Flow

PRISM is architected as a client-local (for translation) or edge-local (for LLMs) pipeline, exposing only obfuscated data to external or cloud services. Across both major variants (Sato, 2023, Zhan et al., 27 Nov 2025), the common pattern involves:

Entity/Token Profiling: Via NER or token-level analysis, entities or terms requiring privacy protection are flagged, and categorical sensitivity scores are assigned.
Semantic Modulation: Sensitive entities/tokens are perturbed via context-aware transforms—random substitutions (PRISM-R), confidence-scored substitutions (PRISM*), or category/value randomized response (cloud-edge PRISM).
Routing Mode Selection: In cloud-edge scenarios, a soft gating function $\mathbf{z}$ (risk features) feeds into $g_\theta$ , producing a tripartite distribution over modes ("cloud," "local," "collab"). Modes are selected by $\arg\max_j \pi_j$ .
Remote Inference: Only the modulated, privacy-obscured prompt is transmitted to the external inference engine (translator or cloud LLM).
Recovery/Postprocessing: On the client/edge, local dictionary and substitution history or the original context are used to restore semantically accurate results from the obfuscated output.

This data flow ensures that no sensitive content is directly revealed to the remote computation endpoint, with semantic fidelity maintained via reversible transforms and client-side dictionaries.

3. Semantic Modulation Transforms

PRISM employs several mechanisms for input perturbation according to privacy and utility constraints:

PRISM-R (Random Substitution, Machine Translation):

Each token $t_i$ is substituted independently with probability $r$ for a random vocabulary token $u_i$ , and the mapping $(w_i, u_i)$ is stored. This transform guarantees formal $\epsilon$ -DP (Sato, 2023).

PRISM* (Smart Substitution):

Confidence scores $c(w, s)$ computed from translation probabilities $p_{w,s,v}$ (based on part-of-speech and corpus statistics) prioritize substitution of unambiguous tokens. Substitutes $u_i$ are chosen to maximize confidence and POS agreement, improving utility over pure randomization (Sato, 2023).

Adaptive Two-layer LDP (Cloud-Edge PRISM):

For each flagged entity $e_i$ with category $c_i$ , total budget $\varepsilon_{tot}$ is split into $\varepsilon_1$ and $\varepsilon_2$ (category and value layers) using

$\varepsilon_1 = \varepsilon_{tot} \cdot \frac{w_{c_i}}{w_{c_i} + (1 - w_{c_i})\alpha}, \quad \varepsilon_2 = \varepsilon_{tot} - \varepsilon_1,$

and each layer executes randomized response over domain sizes $K_1$ (categories) and $K_2$ (values), producing $(c_i^*, e_i^*)$ . Sequential composition yields $(\varepsilon_1 + \varepsilon_2)$ -LDP protection (Zhan et al., 27 Nov 2025).

Semantic Sketch Fusion: In collaborative modes, the cloud LLM returns a high-level semantic sketch $S$ on perturbed inputs, with the edge SLM recoupling $S$ to the original prompt $P$ to yield refined, privacy-preserved output $\hat{R}$ (Zhan et al., 27 Nov 2025).

4. Algorithmic Pipeline and Pseudocode

The PRISM pipeline integrates token/entity selection, perturbation, remote call, and local recovery. In client-side machine translation (Sato, 2023), this is realized as:

Input: x_pri = w1...wn, dictionary L, substitution ratio r, mode in {R, *}
Output: y_pri

Tokenize x_pri into t1...tn, initialize history H
If mode=='R':      # PRISM-R
    For i=1..n:
        If Uniform(0,1) < r:
            pick ui ~ Uniform(V)
            H <- H ∪ {(wi=ti, ui)}
            ti <- ui
Else:              # PRISM*
    Compute POS tags si <- POS(ti)
    Compute confidences c(wi, si)
    Sort positions by descending c(wi, si)
    For top k=⌊r·n⌋ positions i:
        pick ui of same POS maximizing c(ui, si), not used before
        H <- H ∪ {(wi=ti, ui, si)}
        ti <- ui
x_pub <- Detokenize(t1...tn)
y_pub <- T(x_pub)    # remote inference
y_pri <- y_pub
For each (w, u, (opt s)) in H:
    For v in L(u) (or L(u, s)):
        If v occurs in y_pri:
            replace v with L(w,1) (or L(w,s,1))
            break
Return y_pri

For cloud-edge collaborative inference (Zhan et al., 27 Nov 2025), the routing proceeds:

Input: Prompt P
(R(P), {di}) ← SensitivityProfiling(P)
π ← SoftGating([R(P), d1,…, dm])
mode ← argmax_j π_j
if mode == “cloud”:
    return G_cloud(P)
if mode == “local”:
    return G_edge(P)
if mode == “collab”:
    apply adaptive two-layer LDP and proceed to semantic sketch fusion

5. Empirical Trade-offs and Evaluation

PRISM's efficacy is empirically established via multi-dimensional trade-off curves. In machine translation (Sato, 2023), the Privacy-Preserving Score (PPS) and Quality Score (QS) benchmark adversarial leakage versus translation accuracy. Tabulated metrics for English→French and English→German translations using T5 and ChatGPT engines reveal:

Method	AUPQC	[email protected] (En→Fr T5)	[email protected] (En→De T5)	[email protected] (En→Fr ChatGPT)	[email protected] (En→De ChatGPT)
NoDecode	0.355	0.493	0.524	0.495	0.480
PUP	0.363	0.439	0.505	0.487	0.511
PRISM-R	0.431	0.613	0.557	0.611	0.629
PRISM*	0.454	0.803	0.789	0.799	0.769

As $r$ increases, privacy leakage falls and PRISM* attains $\sim$ 80% QA accuracy even at zero-practical-leakage (PPS $\approx$ 0.5), outperforming prior DP-text baselines (<60%) (Sato, 2023).

In cloud-edge inference (Zhan et al., 27 Nov 2025), PRISM consistently reduces latency and energy consumption by 60% compared to uniform/selective LDP, while preserving or improving inference quality (IQ):

Method	Ct.(s)	Ec.(J)	IQ
PRISM	7.92	687.2	6.88
Uniform LDP	20.56	1707.6	5.72
Selective LDP	21.22	1770.8	5.94
Edge-Only	17.84	1573.9	5.09
Cloud-Only	5.13	296.3	8.14

Across hardware and multiple cloud/edge model pairings, PRISM maintains robust quality, demonstrating empirical dominance within the achievable Pareto-frontier for privacy-utility-latency-energy (Zhan et al., 27 Nov 2025).

6. Integration with Off-the-Shelf Engines and Future Directions

PRISM is agnostic to the underlying remote inference engine, requiring only a black-box API interface for translation (T5, ChatGPT) (Sato, 2023) or LLM service (GPT-4o, Qwen3-235B, StableLM-Zephyr, Phi-3.5, TinyLLaMA) (Zhan et al., 27 Nov 2025). Dictionary construction (in translation) is performed once, leveraging any cheap MT engine, and all subsequent privacy logic remains purely client/edge-local.

Key tunables are the substitution ratio $r$ (dictating privacy-utility in translation) and, for cloud-edge PRISM, the per-entity LDP budget ( $\varepsilon_{tot}$ ), category-value allocation hyperparameter $\alpha$ , and the mode selection linear weights ( $\theta$ ). A plausible implication is that more granular, context-dependent budget allocation (possibly via reinforcement learning) or multi-edge collaboration may further optimize utility under strict privacy.

Limitations include single-device edge restriction, text-only inputs (extension to multimodal inputs suggested), and static allocation of privacy budget. Ongoing directions involve end-to-end differentiable fusion training, dynamic routing, and federated/multimodal privacy-aware orchestration.

PRISM operates in direct contrast to server-side privacy claims and undifferentiated perturbative approaches. Unlike traditional uniform LDP or selective masking—which inject identical privacy budget independent of contextual risk—PRISM dynamically modulates noise per entity or token, guided by context-aware soft gating, which minimizes unnecessary perturbation and utility loss for non-sensitive content.

This general approach is mirrored in related orchestration frameworks such as IslandRun (Malepati, 29 Nov 2025), where agent-based multi-objective privacy routing and reversible semantic anonymization are used across distributed computation resources ("islands"). Both demonstrate the shift from coarse platform-centric privacy guarantees towards fine-grained, request-level semantic modulation, client-side control, and adaptive privacy budgeting.

The underlying principle is that privacy-aware inference must be context-sensitive, formally auditable, and locally controllable, with privacy leakage and output utility jointly optimized using semantic modulation and adaptive routing. This paradigm is increasingly foundational for trustworthy, high-utility AI services in heterogeneous, distributed ecosystems.