Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Defer with Human Feedback

Updated 7 July 2025
  • Learning to Defer with Human Feedback (L2DHF) is an adaptive framework that empowers AI systems to decide when to operate autonomously or defer to human experts.
  • It utilizes deep reinforcement learning to iteratively update the deferral policy based on human corrections and real-world feedback.
  • Practical applications in security operations have demonstrated enhanced accuracy and reduced analyst workload through smarter, feedback-driven triaging.

Learning to Defer with Human Feedback (L2DHF) defines a paradigm in which machine learning systems are explicitly designed to decide—per instance—whether to act autonomously, defer the decision to a human, or request/adapt feedback from a human expert. By combining uncertainty-aware automated decision-making with human-in-the-loop supervision, L2DHF seeks to optimize both overall system accuracy and resource allocation, especially in high-stakes or rapidly evolving environments where model confidence is often insufficient or miscalibrated. The L2DHF approach has been notably applied in domains such as alert prioritisation in security operations centres (SOCs), where the adaptive involvement of expert analysts is paramount (2506.18462).

1. Defining L2DHF and Its Motivation

Learning to Defer with Human Feedback (L2DHF) expands on classical Learning to Defer (L2D) by allowing the deferral policy to be adaptive and responsive to streaming human feedback. In standard L2D, an AI model is trained to make confident predictions when it expects to be accurate, and otherwise “defers” to a human expert, treating the deferral option as an additional action with its corresponding cost. However, traditional L2D policies are generally fixed after training, lacking mechanisms to improve dynamically as more human input and corrections are received.

L2DHF addresses this limitation by:

  • Embedding a continual learning process where human feedback on deferred cases propagates back to update the deferral strategy.
  • Using reinforcement signals derived from human corrections or validations to iteratively adjust the deferral policy.
  • Explicitly aiming to minimize not only the cost of mistaken autonomous actions but also the resource burden on human experts through smarter, experience-driven triaging.

The motivation is to operationalize human-AI teaming such that critical, ambiguous, or novel cases are escalated to human decision-makers, while routine cases are efficiently handled by automated models. This reduces both misclassification risk and unnecessary overload for human analysts (2506.18462).

2. Framework Architecture and Learning Mechanism

L2DHF integrates three primary system components:

  1. Predictive AI Model: Assigns preliminary scores, such as risk or priority, based on alert or input features using supervised learning or ensemble methods.
  2. Adaptive Deferral Agent: Implements policy decisions on whether to accept the AI’s output or defer to human review. This agent is realized as a Deep Reinforcement Learning (DRL) unit, parameterized and updated through human feedback signals.
  3. Human Feedback Loop: Completes the triad by capturing the actions and corrections made by human experts. Outcomes from human-reviewed alerts are stored in repositories and used both directly for operational decisions and indirectly as rewards for the DRL agent.

The deferral process operates as follows (2506.18462):

  • The AI assigns a candidate priority to each alert.
  • The DRL agent, observing a state vector (comprising the AI’s score, alert features, similarity to previously reviewed alerts, and related statistics), chooses to accept or defer.
  • When an alert is deferred, a human analyst reviews and corrects the output if necessary; their decision is fed back to refine future deferral policies.

The adaptive deferral agent is often implemented as a Dueling Double Deep Q Network (D3QN), which exploits both value and advantage functions for robust learning. The reward structure is critical: it is tuned so that deferring a misprioritized alert yields high positive reward, incentivizing correction in future interactions, while deferring correctly prioritized alerts is penalized (with penalties scaled according to the criticality of the alert).

3. Human Feedback: Roles and Integration

Human feedback in L2DHF serves two interconnected roles:

  • Supervisory Correction: Each time the system defers, the human expert’s corrected prioritization provides a direct signal to the adaptive agent about the appropriateness of the original automated decision.
  • State Augmentation for Future Alerts: Analyst-validated cases are logged in an Analyst-Validated Alert Repository (AVAR). Future alerts are compared via similarity metrics (e.g., distance in feature space) to the AVAR; high similarity to prior, validated alerts provides additional context for the deferral agent (2506.18462).

Feedback is thus not only a reward (for DRL) but also a mechanism for dynamically constructing states, implicitly capturing evolving threat or operational patterns. This intertwined use of feedback ensures that the deferral mechanism continually learns, adapts to analyst expertise distribution, and can leverage previous human knowledge to inform present decisions.

4. Experimental Validation and Practical Outcomes

Empirical evaluation of L2DHF has centered on large-scale security alert datasets, such as UNSW-NB15 and CICIDS2017. Key findings include:

  • Enhanced AP (Alert Prioritization) Accuracy: L2DHF achieved 13–16% improvements in accuracy for critical alerts on UNSW-NB15 and 60–67% on CICIDS2017 compared to both standalone predictive AI and static L2D baselines.
  • Reduction in Misprioritisation and False Positives: Misclassifications of critical or high-severity alerts (e.g., incorrectly labeling a severe attack as a low-priority alert) dropped significantly, in some cases by 98%.
  • Decreased Deferrals and Analyst Workload: L2DHF reduced the proportion of alerts requiring escalation by 37% on UNSW-NB15. A more selective, experience-driven deferral policy means fewer routine cases interrupt the analyst, targeting their attention where it is most impactful.
  • Execution Efficiency: The framework was implemented with per-time-step execution times ranging from 10 to 40 seconds, making it compatible with the operational tempo of modern SOCs (2506.18462).

These results demonstrate that adaptive feedback-driven deferral both improves system accuracy and directly reduces human workload, addressing scalability bottlenecks in cyber defense and similar domains.

5. Reward Formulation and Theoretical Principles

The L2DHF reward scheme is specifically structured to align the agent’s objectives with both accuracy and resource optimization. The paper provides, for different alert categories, parameterized rewards:

  • Let q,z,f,g,h,wq, z, f, g, h, w encode category-specific penalties or bonuses.
  • Deferring a misprioritized critical alert yields a high positive reward, whereas deferring a correctly prioritized alert yields a negative reward, proportional to alert severity.

This reward shaping, coupled with reinforcement learning, guides the agent toward minimizing both misprioritisation cost and unnecessary human interventions. The structure balances exploration (occasionally deferring borderline cases to learn from human corrections) and exploitation (accepting correctly classified cases autonomously as confidence grows), in spirit analogous to drift-plus-penalty methods from online learning.

6. Practical Implications and Deployment Considerations

L2DHF establishes a framework for real-time, adaptive human-AI teaming in settings where automated models are inherently imperfect:

  • Continuous Learning: The deferral mechanism is continually updated rather than frozen after initial training. As new threat vectors or types of alerts emerge, the system remains responsive.
  • Analyst-Facing Efficiency: Reduction of unnecessary deferrals combats alert fatigue. The system’s smart triaging lets analysts focus on complex, high-risk, or novel alerts.
  • Data Efficiency and Personalization: By leveraging the AVAR, the system learns from the actual operational context and historical analyst decisions, including evolving best practices or institution-specific guidelines.

Deployment of L2DHF systems requires robust data pipelines for storing and matching analyst-validated alerts, efficient DRL agents (with hardware acceleration for time-sensitive applications), and careful tuning of the reward function to align with organizational priorities and risk tolerance.

7. Future Directions

Several ongoing and open research avenues are indicated:

  • Scaling to Multiple Analysts: Future systems may extend L2DHF to multi-analyst settings, allocating deferred cases based on individual expertise or load, and incorporating consensus or arbitration mechanisms.
  • Enhanced State Representation: Further integration of richer alert context, perhaps using natural language content or outputs from LLMs to augment alert summaries or explanations.
  • Personalization and Fairness: Tailoring deferral and triaging strategies to analyst preferences or institutional guidelines, and ensuring equitable distribution of workload and review exposure.
  • Longitudinal Feedback Loops: Incorporating multi-stage or repeated feedback, with mechanisms for post-hoc correction or escalation, further strengthening the adaptive learning cycle.

L2DHF’s advances suggest broader potential for adaptive deferral frameworks in other complex, high-stakes environments such as healthcare triage, financial fraud detection, and emergency management, wherever dynamic collaboration between AI and expert humans can yield synergetic gains (2506.18462).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)