Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services

Published 19 Apr 2026 in cs.DC, cs.ET, and cs.PF | (2604.17373v1)

Abstract: Edge computing enables AI inference closer to data sources, reducing latency and bandwidth costs. However, orchestrating AI services across the cloud-edge continuum remains challenging due to dynamic workloads and infrastructure variability. We present AIF-Router, an Active Inference--based routing framework that autonomously learns to balance latency, throughput, and resource utilization across multi-tier AI services without offline training. AIF-Router performs Bayesian state inference and expected free energy minimization to guide routing decisions based on observability-driven real-time metrics. Despite device instability on edge nodes, AIF-Router exhibits stable online learning behavior and demonstrates the feasibility of applying Active Inference for adaptive AI service orchestration in unreliable edge environments. Our findings highlight both the promise and practical challenges of deploying self-adaptive decision-making frameworks for real-world edge AI systems.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents AIF-Router, a novel adaptive routing framework using active inference to orchestrate AI workloads with zero-shot adaptation across edge and cloud tiers.
The paper leverages Bayesian inference and expected free energy minimization to balance latency reduction and reliability in dynamic, heterogeneous systems.
The paper validates AIF-Router on a 3-tier testbed, achieving a 34.7% reduction in median latency while experiencing higher failure rates on unstable edge devices.

Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services

Introduction

The paper "Active Inference-Based Adaptive Routing for Heterogeneous Edge AI Services" (2604.17373) introduces AIF-Router, an online adaptive routing framework leveraging active inference (AIF) for orchestrating AI inference workloads across heterogeneous edge and cloud infrastructure. The work addresses the challenge of efficiently handling dynamic workloads on devices exhibiting variable resource availability and operational stability. Unlike reinforcement learning (RL) and control-theoretic approaches, which require precise system modeling or expensive offline training, AIF-Router provides zero-shot adaptation by continuously learning from streaming runtime observations and updating its generative model in an online fashion.

AIF-Router’s design introduces a probabilistic, Bayesian approach to online adaptation in the context of edge AI, which is uniquely characterized by partial observability, device heterogeneity, and infrastructure instability. The key insight is that active inference, through expected free energy minimization, unifies goal-driven decision making with epistemic exploration, thus providing a scalable mechanism for balancing latency, reliability, and resource utilization under non-stationary and uncertain operating conditions.

AIF-Router Architecture and Control Flow

AIF-Router operates as an autonomous mediator between incoming inference requests and a set of back-end AI service tiers: light (low-power edge), medium (mid-range edge), and heavy (cloud/server). The control flow consists of a fast Bayesian inference and action selection loop operating at 1 Hz, and a slower online model learning and consolidation step at 0.1 Hz.

Figure 1: AIF-Router control flow with Bayesian state inference, action selection, and multi-tier request dispatching.

Upon receipt of aggregated telemetry (e.g., latency, throughput, error rates), AIF-Router updates its posterior distribution over a discretized five-dimensional system state space, encompassing both global load and per-tier resource utilization. Routing policies are selected by minimizing the expected free energy of candidate actions, computed using a learned observation model ( $A$ ), transition model ( $B$ ), and an adaptive preference model ( $C$ ). By leveraging timescale separation, the agent maintains inference stability while continuously adapting its generative models based on recent history.

State, Action, and Observation Modelling

The system state space is discretized (low/medium/high) for latency, request rate, and tier-specific CPU utilization, yielding 243 possible states. The action space consists of 20 routing policies, spanning the simplex of tier weights but coarsely discretized to ensure computational tractability without sacrificing coverage of plausible strategies. Observations are high-level performance metrics, which the router inverts into latent state beliefs using its learned $A$ matrix.

This design embodies important trade-offs: the discretized representation improves sample efficiency and noise tolerance while limiting granularity. The reliance on non-parametric, count-based updating over the state and action spaces allows AIF-Router to operate in realistic environments, where model errors and runtime drifts dominate.

Generative Model Learning and Adaptive Preferences

Key to the AIF-Router’s performance is online adaptation of observation ( $A$ ) and transition ( $B$ ) models. The observation model factorizes the likelihood over latency, RPS, queue, and error observations, while the transition model encodes action-conditioned Markovian dynamics over the state space. Importantly, the framework introduces a preference model ( $C$ ) that can be adaptively reshaped based on recent error trends, allowing the system to re-weight the softmax objective between latency minimization and reliability—critical when edge servers undergo frequent instability.

The Bayesian pseudo-count approach for model update, combined with experience replay and observations' delayed credit weighting, yields robustness against transient behaviors and non-i.i.d. dynamics.

Expected Free Energy Minimization

Action selection is operationalized via minimization of expected free energy:

$G(a) = \text{Risk}(a) + \text{Ambiguity}(a) + \text{Cost}(a)$

Risk quantifies divergence from preferred outcomes ( $C$ ), ambiguity captures the epistemic value (encouraging exploration), and cost penalizes extreme routing decisions for regularization. Softmax sampling over the negative expected free energy with a pre-specified temperature ensures stochastic policy selection and robust exploration.

Evaluation: Latency-Reliability Trade-off

Comprehensive experiments were conducted on a 3-tier physical testbed with jointly deployed edge (NVIDIA Jetson Orin) and cloud (Ryzen) nodes, using a high-intensity Tiny-ImageNet workload.

AIF-Router demonstrates a statistically significant 34.7% reduction in median (P50) latency (2003 ms) compared to a fixed-weight baseline (3067 ms), with a $p$ -value much less than 0.0001.

Figure 2: P50 latency comparison. AIF-Router achieves 34.7% lower median latency (2003 ms vs. 3067 ms, $B$ 0).

However, this comes at the expense of an 11.5 percentage point lower overall success rate (77.9% vs. 89.4%). The elevated failure rate is directly associated with frequent edge device instabilities (pod restarts and transient unavailability) particularly on the Jetson platforms. The effect of adaptive routing on tier allocation is manifested as an increased share of requests targeted to the heavy tier (46% vs. 38%), reflecting the system’s learned preference for high-capacity servers when available.

Figure 3: Tier allocation comparison. AIF-Router learns to allocate more requests to the heavy tier (46% vs 38%) after observing performance feedback, while experiencing higher failure rates on unstable edge devices.

The experimental methodology is robust, with low variance across replicates, demonstrating stable adaptation and statistically sound findings.

Implications and Theoretical Discussion

This work firmly positions active inference as a strong candidate for self-adaptive edge AI orchestration. While RL-based methods (e.g., Decima, Gandiva) achieve similar adaptability, they generally require extensive offline training and environmental specificity. In contrast, AIF-Router’s zero-shot adaptability and online model learning ensure transferability across deployments at the cost of increased sensitivity to catastrophic edge failures. The research provides evidence that minimizing expected free energy—balancing goal-directed action and epistemic exploration—can yield highly performant, sample-efficient online control solutions in partially observable Markov decision processes (POMDP) typical of real-world edge computing.

The paper identifies several key limitations:

The coarse state discretization improves stability but may miss fine-grained effects;
Preference modeling, while adaptive, is still hand-designed and may not automatically adapt to organizational or workload-specific SLAs;
Real-world deployment demands incorporation of health signals and predictive failure diagnostics to preempt device instability.

Opportunities for future research include (1) integrating inverse reward or preference learning for automatic $B$ 1 model adaptation; (2) extending the observation model to ingest predictive telemetry (thermal, power, reboot history); and (3) contrasting AIF performance directly against pure RL and control-theoretic agents under identical online adaptation constraints.

Conclusion

AIF-Router demonstrates that active inference provides a robust, training-free mechanism for autonomous, online adaptive routing across heterogeneous, unreliable edge AI deployments. The 34.7% latency reduction at the cost of increased failures highlights a core trade-off in such systems: aggressive latency minimization can expose the system to instability, especially under edge device unreliability. Nevertheless, the capacity to discover and exploit resource heterogeneity in a zero-shot manner without prior training or workload-specific tuning addresses a principal bottleneck in dynamic edge AI orchestration. Further developments integrating predictive state estimation, multi-objective preference adaptation, and direct health-awareness will likely extend the applicability of active inference-driven orchestration to broader, more unstable production environments.

Markdown Report Issue