Agentic & Inference-Time Unlearning
- Agentic and inference-time unlearning are methods that remove specific information from LLM responses without modifying model weights.
- These approaches use auxiliary verifiers and multi-agent orchestration to enforce privacy, RTBF, and compliance at runtime.
- They employ conformal calibration and probabilistic guarantees to balance suppression accuracy with overall model utility under adversarial conditions.
Agentic and inference-time unlearning encompass a suite of emerging methods for removing, suppressing, or neutralizing specific knowledge or memorized content from machine learning models—especially LLMs—without updating model weights. These strategies leverage either auxiliary verifier models, multi-agent decision-making, or serving-aware system coordination to enforce “right to be forgotten” (RTBF), privacy, or compliance constraints at runtime. They contrast with classical approaches that require full or partial retraining on a reduced dataset, offering a tractable and scalable solution in scenarios where model access, compute, or safety requirements prohibit or discourage post hoc weight modification.
1. Overview and Motivation
Machine unlearning addresses the challenge of excising sensitive or undesirable information from machine learning models after training. Traditional techniques revolve around retraining models or applying fine-tuned, parameter-level interventions, which can be computationally prohibitive and detrimental to general model performance. Furthermore, increasingly strict privacy regimes (such as GDPR and CCPA) and security concerns have necessitated methods that provide robust guarantees while accommodating black-box deployment scenarios.
Inference-time unlearning shifts the unlearning burden from training to deployment by interposing specialized runtime policies—either verifier-driven loops or “agentic” control structures—on model outputs. Agentic unlearning systems recast unlearning as a multi-agent, sequential decision problem, where specialized components orchestrate, audit, and sanitize model responses to ensure that sensitive content is not revealed, even under adversarial prompting or distributional shifts. This paradigm provides both operational flexibility and empirical guarantees regarding prompt-level safety, utility preservation, and computational tractability (Chowdhury et al., 3 Feb 2026, Sanyal et al., 1 Feb 2025, Cai et al., 29 Apr 2025, Hu et al., 2023).
2. Mathematical Formulation and Guarantees
Probabilistic Guarantees and Unlearning Error
A central mechanism in inference-time unlearning is the use of explicit verification or scoring functions:
- Let be an auxiliary verifier that scores model response to prompt ; higher values indicate greater “forgetfulness.”
- Define a threshold such that signals a safe (non-leaking) answer.
The unlearning error is then
and the target is to drive this probability below a user-specified level , i.e., coverage guarantee .
Conformal Calibration for Distribution-Free Bounds
Inference-time unlearning frameworks (notably, conformal unlearning) calibrate an iteration bound using a calibration set of i.i.d. prompts. For each , the system records the attempt count to achieve . The critical conformal iteration threshold is set as: By split-conformal theory, running the runtime loop for at most steps guarantees marginal coverage of at least (Chowdhury et al., 3 Feb 2026).
Agentic Unlearning Objective
Formalizations adopted by agentic systems, such as AegisLLM, encode unlearning and retention properties via KL-divergence metrics: where is the actual model’s output and a benign (random or deflecting) baseline. The system aims to keep for restricted prompts and preserve nominal accuracy for safe prompts (Cai et al., 29 Apr 2025). No model parameters are updated; all guarantees are enforced at inference.
3. System Architectures and Algorithms
Single-Verifier Loop with Conformal Prediction
The verifier-based paradigm involves iteratively generating candidate responses, scoring them, and accepting only those above , with stopping determined by conformal calibration:
- At each step, the LM generates given prompt and history .
- is accepted if , else the loop continues up to rounds.
- If no safe answer is produced, output the highest-scoring candidate found (Chowdhury et al., 3 Feb 2026).
Time complexity is LM and verifier calls per prompt, with determined by small calibration sets; the method remains strictly retrain-free.
Agentic Multi-Agent Orchestration
Agentic frameworks (e.g., ALU, AegisLLM) enact a sequential, multi-role pipeline. The canonical architecture features:
| Agent Role | Input/Function | Output/Contribution |
|---|---|---|
| Orchestrator | User query | Flags safety/routing |
| Responder | (safe) | Candidate response |
| Evaluator | Assesses for leaks | |
| Deflector | , response type (if unsafe) | Refusal/sanitization |
- Queries pass through orchestrator gating, with unsafe or ambiguous cases diverted to a deflector for non-informative or random responses.
- Candidate answers are further audited and scored; for content flagged as risky, fallback responses are triggered.
- Prompts for each agent are optimized via Bayesian algorithms such as DSPy, maximizing reward functions that target both safety and utility (Cai et al., 29 Apr 2025).
- ALU expands this by introducing an AuditErase agent for granular, chain-of-thought erasure and a Composer agent for utility-preserving synthesis (Sanyal et al., 1 Feb 2025).
This workflow does not involve weight modification but achieves constant per-request runtime by capping agentic inference rounds.
Inference-Time Unlearning in MLaaS Systems
Inference-serving-aware designs (e.g., ERASER) dynamically interleave unlearning execution with inference serving by certifying, on a per-prompt basis, whether unlearning would actually affect a prediction:
- ERASER computes certified vote-margin conditions over SISA-trained model shards to decide if an incoming query can be answered immediately.
- If any certification fails, retraining/unlearning is triggered according to a configurable schedule (immediate, threshold-triggered, or uncertification-triggered) (Hu et al., 2023).
This approach separates the timing of unlearning from inference, providing strong privacy guarantees with minimal latency impact.
4. Performance, Benchmarks, and Empirical Evaluation
Inference-time and agentic unlearning frameworks have been extensively evaluated on realistic and adversarial benchmarks:
| Benchmark | Unlearning Focus | Characteristic Result |
|---|---|---|
| RWKU | QA on 200 real-world entities | 93% reduction in error (conformal) |
| WPU | Wikipedia biographies (100 forget/100 retain) | Utility maintained (Δ ≤2%) |
| WMDP | Sensitive STEM, multi-choice | WMDP acc. at random (ALU/AegisLLM) |
| MMLU | Retain: college-level MCQ | Retain acc. loss ≤1–2% |
| TOFU | Fictional authors (synthetic/unlearning) | ROUGE-L dropped to 0.057 (ALU) |
- Conformal unlearning reduced unlearning error by up to 93% versus vanilla generation and outperformed Best-of-N, greedy sampling, and parameter-optimization baselines (Chowdhury et al., 3 Feb 2026).
- ALU achieved near-zero Forget-ROUGE and retained 90–98% of utility on retain sets; scaled robustly up to 1,000 targets with latency increase (Sanyal et al., 1 Feb 2025).
- AegisLLM reached WMDP accuracy matching random guessing with only 20 training examples and LM calls; retain performance on MT-Bench fell by only 0.4 points (Cai et al., 29 Apr 2025).
- ERASER delivered up to inference latency speedups (DIMP variant) and reduced retraining events by 30% (threshold-triggered), compared to inference-oblivious baselines (Hu et al., 2023).
5. Trade-offs, Robustness, and Theoretical Insights
Several design trade-offs and robustness considerations govern the development and evaluation of these frameworks:
- Efficacy-Utility Trade-off: Overly aggressive deletion can induce “catastrophic forgetting” of unrelated content. Multi-agent and critic-driven filtering strategies directly optimize the balance between suppression and informativeness (Sanyal et al., 1 Feb 2025).
- Black-box Compatibility: All leading agentic and verifier-loop solutions require only inference API access. There are no model weight updates, satisfying practical privacy, legal, and deployment constraints.
- Runtime Scalability: Multi-agent and conformal approaches achieve constant-time or predictably bounded runtime per query, regardless of forget set size (Chowdhury et al., 3 Feb 2026, Sanyal et al., 1 Feb 2025).
- Resistance to Jailbreaking: Robustness to adversarial prompting, target-masking, multilingual paraphrase, and in-context attack chains is empirically substantiated, with agentic frameworks like ALU and AegisLLM outperforming static keyword guardrails and optimization-based retraining (Sanyal et al., 1 Feb 2025, Cai et al., 29 Apr 2025).
Theoretical claims center on distribution-free coverage guarantees (split-conformal bounds), constant-time inference (fixed agent call graphs), and agentic rationales for scheduling and threshold optimization (Chowdhury et al., 3 Feb 2026, Hu et al., 2023).
6. Extensions to Agentic and Adaptive AI
Agentic inference-time unlearning is both a methodology and a template for broader “smart forgetting” systems:
- Dynamic Resource-Aware Scheduling: ERASER and similar serving-aware systems enable dynamic balance between retraining cost and privacy risk, adaptive to server load and resource constraints (Hu et al., 2023).
- Prompt-Space Adaptivity: Multi-agent systems can tune or evolve their guard- and response routines in response to evolving threats or task distributions through automated prompt optimization (e.g., via DSPy/Bayesian search in AegisLLM) (Cai et al., 29 Apr 2025).
- Type-Universal Scalability: Agentic unlearning approaches extend naturally to multimodal and retrieval-augmented architectures, and can be positioned as intermediaries for federated or cloud-based AI APIs (Sanyal et al., 1 Feb 2025).
A plausible implication is that as autonomous and interactive AI expands in prevalence, the principles underlying agentic, inference-time unlearning will become foundational in designing regulatory-compliant, privacy-resilient AI services.
7. Limitations and Prospects
Although these frameworks achieve significant practical and empirical results, certain limitations persist:
- Over-suppression in small models and under long forget lists (as seen with 3B-parameter LLMs in ALU) can impair generalization (Sanyal et al., 1 Feb 2025).
- Security assumptions regarding agent prompt integrity and privacy remain areas for continued examination.
- Formal verification of robustness to global adversarial in-context attacks and extension to continuous learning scenarios are open challenges.
Ongoing research directions include development of lightweight critic models, adaptive candidate sampling schemes, multimodal system integration, and formal guarantees under online or federated forgetting demands.
Key References:
- "Inference-time Unlearning Using Conformal Prediction" (Chowdhury et al., 3 Feb 2026)
- "ERASER: Machine Unlearning in MLaaS via an Inference Serving-Aware Approach" (Hu et al., 2023)
- "AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security" (Cai et al., 29 Apr 2025)
- "ALU: Agentic LLM Unlearning" (Sanyal et al., 1 Feb 2025)