Avoiding Negative Side Effects in AI
- Avoiding negative side effects refers to mitigating unintended adverse impacts arising from misspecified AI objectives and incomplete environmental models.
- Mitigation strategies include model updates, impact regularization, constrained optimization, and human–agent collaboration to prevent unmodeled harm.
- Applications span AI safety, reinforcement learning, medicine, and multi-agent systems, emphasizing formal taxonomies and algorithmic interventions.
Negative side effects (NSE) are unintended, undesirable consequences produced by autonomous agents in pursuit of their formal objectives. NSEs arise even when an agent is correctly optimizing its stated model of the environment and mission, but that model omits factors critical to stakeholders or the broader environment. Avoiding negative side effects is a central challenge in AI safety, reinforcement learning, operations research, medicine (e.g., drug safety), and multi-agent systems, as well as in the deployment of LLMs. The rigorous mitigation of NSE requires both formal taxonomies and a spectrum of algorithmic interventions that respect the foundational incompleteness of all practical system models.
1. Formal Definitions and Taxonomy of Negative Side Effects
A negative side effect is any unintended, undesirable impact that an agent’s action has on the environment, beyond the intended effect encoded in the agent’s reward or objective function. Mathematically, in a Markov Decision Process (MDP) , if the state vector is factored into and the reward only depends on , any change in that reduces real-world utility is an NSE (Saisubramanian et al., 2020). NSEs are not design failures or adversarial attacks; rather, they result from reward misspecification and unmodeled value.
A canonically used taxonomy identifies seven orthogonal dimensions for characterizing NSEs (Saisubramanian et al., 2020):
| Property | Values |
|---|---|
| Severity | mild … safety-critical |
| Reversibility | reversible / irreversible |
| Avoidability | avoidable / unavoidable |
| Frequency | common / rare |
| Stochasticity | deterministic / probabilistic |
| Observability | full / partial / unobserved |
| Exclusivity | interferes / non-interfering |
The taxonomy guides both modeling and mitigation decisions, as the degree of reversibility, avoidability, and observability significantly influences feasible interventions.
2. Core Mitigation Strategies
Mitigation methods for negative side effects broadly fall into five families (Saisubramanian et al., 2020, Lindner et al., 2021, Saisubramanian et al., 2021):
- Model and Policy Updates
- Inverse Reward Design: Treat the agent’s proxy reward as an uncertain observation of true value; infer and plan under this posterior to avoid missing penalties for side effects.
- Lexicographic Multi-Objective Planning: Enforce a hierarchical constraint—first, near-optimal task reward; then, minimize an explicit penalty for known or discovered side effects. Formally, maximize
subject to minimizing . - Constrained MDPs: Integrate explicit side-effect cost functions as hard constraints.
Impact Regularization and Penalty Terms
- Relative Reachability (RR): Penalize the agent for stepwise, irreversible loss of reachability to other states, using a carefully chosen baseline such as the stepwise inaction baseline (Krakovna et al., 2018).
- Attainable Utility Preservation (AUP): Penalize the agent for making large shifts in its ability to optimize auxiliary reward functions across a random or structured set, thus discouraging actions that "burn bridges" even to unknown goals (Turner et al., 2020).
- Future-Task-Based Penalties: Penalize the agent for reducing the value that could be obtained on hypothetical future tasks, filtered by the baseline policy to prevent interference incentives (Krakovna et al., 2020).
- Constraint-Based and Query-Augmented Approaches
- Minimax-Regret Active Querying: In high-stakes or uncertain settings, conservatively assume all unmodeled features are locked unless a human confirms otherwise. Query design is optimized to minimize worst-case regret (Saisubramanian et al., 2020).
- Human–Agent Collaboration and Environment Shaping
- Operate in a two-player mode where a human designer can reconfigure the environment (e.g., move fragile objects, add barriers) to mitigate side effects without altering the agent’s core policy (Saisubramanian et al., 2021). These interventions are evaluated according to their ability to reduce NSE without exceeding a tolerance for drop in primary task value (slack δ_A).
- Reward Shaping and Safe Exploration
- Intrinsic penalties or bonus rewards encourage the agent to avoid high-risk actions during exploration. Safe-exploration frameworks prioritize policies with low-expected side-effect cost under Bayes or statistical confidence bounds (Saisubramanian et al., 2020).
3. Impact Regularizers: Design Choices and Challenges
Impact regularizers (IRs) encapsulate a family of algorithms where the reward is penalized by a term that measures “impact”—the deviation from a reference state according to a chosen distance metric (Lindner et al., 2021). Their effectiveness depends on three orthogonal design decisions:
- Baseline selection: Common choices include the initial state ("inaction" baseline), a stepwise-inaction baseline, or an expert-policy baseline. Each has tradeoffs: starting-state baselines incentivize interference, whereas stepwise-inaction avoids interference and offsetting (Krakovna et al., 2018).
- Deviation measure: Options include Hamming distance, RR, value-differences (AUP), or future-task-based returns. Pure magnitude metrics are sign-agnostic and can penalize even beneficial changes, while RR and AUP can be crafted to respect sign and attain magnitude sensitivity (Lindner et al., 2021, Turner et al., 2020).
- Regularizer magnitude : Insufficient fails to prevent side effects; excessive blocks all progress. Theory interprets as a Lagrange multiplier in a constrained RL framework.
The principal technical challenge is to simultaneously avoid interference (preventing changes not caused by the agent), offsetting (undoing progress just to match a baseline), and reward hacking (subverting the intent of the penalty). The combination of stepwise inaction and RR has been shown to uniquely satisfy these desiderata (Krakovna et al., 2018).
4. Environment Shaping: Human–Agent Team Models
Environment shaping introduces an explicit human-in-the-loop model in which a designer (distinct from the agent/actor) applies minor reconfiguration actions to the environment, mitigating side effects without reducing task performance (Saisubramanian et al., 2021). The formalism defines two MDPs:
- Actor’s MDP: , optimizing for task reward.
- Designer’s Model: , with the set of modifications; mapping environment and modification to a new environment; a cost function on modifications; giving the NSE penalty under the agent’s policy in the new environment.
The designer selects environment modifications to maximize
subject to maintaining the task value within allowed slack .
Empirical studies show high user willingness to perform modest environment shaping, with mechanisms such as feature-based clustering reducing the number of modifications evaluated (Saisubramanian et al., 2021).
5. Avoiding NSEs in Multi-Agent and Causal Settings
In multi-agent and causal-treatment contexts, NSEs emerge from unintended coordinated effects or from indirect (mediated) harm channels.
- Multi-agent systems: Side-effects are minimized by lexicographic decentralized MDPs, with joint penalties decomposed via credit/blame assignment. Counterfactual neighbors are used to assign the causal share of joint NSE penalties to each agent, facilitating decentralized planning (Rustagi et al., 7 May 2024).
- Causal inference and medical interventions: Evaluations target sharp (worst-case) bounds on the fraction of individuals harmed by a new treatment, using influence-function–based robust estimation of marginal and covariate-conditional bounds (Kallus, 2022). When side effects occur through mediators, optimal treatment rules can be learned to assign intervention only to those not predicted to exhibit harmful indirect (mediated) effects, using multiply robust pseudo-outcome regression (Rudolph et al., 2021).
6. Practical Methods in Real-World Domains
Several algorithmic approaches demonstrate effective avoidance of negative side effects in high-consequence domains:
- Reduced-model planning: Portfolios of reduced models allow agents to locally increase planning fidelity in risky states, nearly eliminating side effects while retaining planning speedups (Saisubramanian et al., 2019).
- Trajectory-based non-Markovian safety: Training an RNN classifier to score entire trajectories as undesirable/safe, and enforcing probabilistic constraints over these scores via Lagrangian relaxation, achieves strong NSE control beyond Markovian cost-based approaches (Low et al., 2023).
- Drug safety: Bidirectional matrix factorization uses both side effects and indications to reduce false-positive side-effect predictions in drug-effect recommender systems (Azuma et al., 2022); graph co-attention networks detect combinatorial polypharmacy risks with high predictive performance (Deac et al., 2019).
- LLM safety: Non-pairwise, negative-sample–only distributional dispreference optimization (D²O) robustly reduces LLM harmfulness while keeping quality high by maximizing divergence from negative-only human labels (Duan et al., 6 Mar 2024). Risk-averse RLHF with CVaR tail-focused optimization outperforms risk-neutral RLHF at suppressing dangerous completions (Chaudhary et al., 12 Jan 2025).
7. Limitations, Open Problems, and Design Recommendations
Despite considerable progress, challenges remain (Lindner et al., 2021, Saisubramanian et al., 2020, Saisubramanian et al., 2021):
- All approaches rely on principled baseline selection and deviation measures; poor choices (e.g., static starting-state baselines, unreachability metrics) produce perverse incentives.
- Human-in-the-loop strategies require well-defined, tractable sets of modifications and user understanding of NSE tradeoffs.
- High-dimensional or continuous environments challenge the scalability of explicit reachability or auxiliary-value-based penalties.
- Partial observability and multi-agent causal entanglement complicate both credit assignment and creditable mitigation guarantees.
- Empirical benchmarks for real-world NSE are still immature relative to the diversity of side-effect types encountered in practice.
Design recommendations include explicit recognition of model incompleteness, adoption of multi-objective or constrained MDPs with slack, systematic impact regularization (RR/AUP/FT), integration of human oversight especially for critical configuration or policy steps, and the development of scenario-rich, real-world evaluation testbeds (Saisubramanian et al., 2020, Saisubramanian et al., 2021).
Open problems include developing baselines and penalties that remain effective in large-scale, partially observed, or highly stochastic settings, and extending avoidance/mitigation guarantees across dynamically evolving and causally complex environments.