- The paper introduces a novel risk-aware framework for contextual bandits by accommodating constraints and continuous action spaces with a multi-critic neural architecture.
- The proposed method employs a deterministic actor alongside multiple critics to effectively address aleatoric uncertainty in constraint metrics.
- Experimental results demonstrate robust constraint satisfaction and energy efficiency in both synthetic environments and a 5G mobile network resource allocation task.
Introduction
Recent progress in decision-making learning techniques has broad applicability in various real-world scenarios. These scenarios often involve critical operational constraints that must be observed. Standard learning solutions, however, may overlook the risks associated with violating these constraints, impinging on their practical deployment. Addressing this gap, a unique framework for contextual bandit problems has been proposed, which can accommodate constraints and continuous action spaces within a risk-aware decision-making context.
Problem Formulation
The paper tackles a type of sequential decision-making problem known as the contextual bandit problem, incorporating constraints that must be adhered to at every decision step. A learner in this framework observes a context from the environment, selects an action, and subsequently observes the outcomes, which include a reward and several metrics representing the constraints. These observed metrics introduce randomness, a phenomenon termed aleatoric uncertainty, impacting constraint satisfaction which is vital in most applications.
Proposed Method
The proposed method introduces a neural architecture featuring a deterministic actor and multiple critics, each assessing a distinct performance metric's distribution, thereby allowing the decision-making to adjust for various risk levels. Multiple critics serve to appreciate the inherent stochasticity, enabling the system to balance constraint satisfaction with performance. This framework can operate in high-dimensional continuous action spaces, unlike prior approaches that construe constraints linearly or fail to account for the randomness in performance metrics.
Evaluation
The method's effectiveness was assessed against state-of-the-art baselines in two environments. In a synthetic environment, it demonstrated robust constraint satisfaction under varying risk levels while maintaining a reasonable reward. When applied to real-world scenarios such as a 5G mobile network resource allocation problem, the method exhibited outstanding reliability in meeting system constraints with minimal energy consumption. These results suggest that the model can adeptly manage the intricate balance between performance and constraint adherence, adapting to different application requirements and risk profiles.