Performance-Feedback Autoscaler (PFA)

Updated 10 March 2026

Performance-Feedback Autoscaler (PFA) is a closed-loop resource management system that directly adjusts allocations using real-time performance data.
It employs techniques such as token-based demand prediction and budget-aware scaling to reduce job slowdown and improve resource utilization.
By integrating proactive forecasting with reactive performance estimation, PFAs enhance SLA compliance in environments like FLAS and Apache Airflow.

A Performance-Feedback Autoscaler (PFA) is an online control strategy for resource allocation in cloud-based computing systems that adjusts provisioning decisions directly based on observed system performance metrics, such as application throughput or response time, rather than relying on precomputed plans or static resource demand estimates. Unlike plan-based autoscalers, which require accurate task runtime or workload forecasts, PFAs use closed-loop feedback control to adapt resource allocation in response to workload variability and system dynamics, frequently under constraints such as fixed budgets or service-level agreements (SLAs) (Ilyushkin et al., 2019, Rampérez et al., 23 Oct 2025).

1. Design Principles and Problem Formulation

PFA design is grounded in closed-loop feedback, in which scaling decisions directly reflect recent system performance. In the context of cloud-based workflow workloads, time is discretized into autoscaling intervals, and the key control variables are the number of resources of each type allocated to each user or application. The formulation incorporates both per-user budget constraints and system-wide resource limits: $\sum_{i} q_i n_{i,j}(t) \leq b_j$ where $q_i$ is the unit cost of resource type $i$ , $n_{i,j}(t)$ is the number allocated to user $j$ at time $t$ , and $b_j$ is the budget for user $j$ (Ilyushkin et al., 2019).

The main control objective in typical PFA deployments is to minimize average job slowdown: $S_{\mathrm{avg}} = \frac{1}{|\mathcal{W}|} \sum_{w \in \mathcal{W}} \frac{T^{\mathrm{resp}}_w}{T^{\mathrm{ideal}}_w}$ where $T^{\mathrm{resp}}_w$ is the actual response time for workflow $w$ , and $T^{\mathrm{ideal}}_w$ is the ideal makespan.

2. PFAs in Distributed Service and Workflow Contexts

Two contemporary realizations of PFA are notable:

In dynamic, stateful distributed services, the Forecasted Load Auto-Scaling (FLAS) architecture combines proactive SLA trend forecasting with reactive performance estimation to trigger scaling events responsive to both forecasted and immediate demand cues. The PFA model constructs input/output relationships between observed low-level metrics (e.g., CPU, memory, network I/O) and high-level metrics (e.g., response time, throughput), using time-series forecasters and linear regression estimators for closed-loop scaling decisions (Rampérez et al., 23 Oct 2025).
For cloud-based workloads of workflows (e.g., DAGs submitted to workflow engines like Apache Airflow), PFA estimates realized average throughput per resource type and user. It applies throughput smoothing, token-based demand prediction over workflow DAGs, and budget-aware profile scaling to adapt resource supply proportionally to observed demand and budget constraints, without relying on explicit task runtime estimations (Ilyushkin et al., 2019).

3. Core Algorithm and Control Loop

A canonical PFA control loop comprises the following steps, typically executed at each control interval:

Performance Measurement: Measure completion throughput for each resource type and user. Smooth instantaneous throughput using a moving average (MA) or exponentially weighted moving average (EWMA).
Profile-Based Supply Calculation: Distribute per-user budget among resource types proportionally based on recent throughput, computing an initial supply profile.
Demand Estimation: Predict near-term demand by analyzing parallelism (level of concurrently enabled tasks), often via token propagation in the workflow's task graph, with adaptive look-ahead depth set by historical throughput.
Resource Reconciliation: Scale supply profile up or down to match demand prediction. Handle over-provisioning by proportional down-scaling and under-provisioning by inflating instances within budget and type constraints.
Provisioning Actions: Invoke resource allocation or deallocation via cloud provider APIs, accounting for current allocations and billing granularity.
Budget Compliance: Ensure all allocation actions strictly obey per-user budgets and system-wide capacity constraints.

The following table summarizes the PFA loop as implemented for cloud workflows (Ilyushkin et al., 2019):

Step	Description	Central Formula or Action
1. Measurement	Collect task/completion rates, smooth ratios	$\hat\tau_{i,j}(t),\;\rho_{i,j}(t) = \text{MA/EWMA}$
2. Supply	Budget/profile allocation for resource types	$\hat\mu_{i,j}(t) = \lfloor b_j \nu_{i,j}(t) / q_i \rfloor$
3. Demand Est.	Token-based DAG demand analysis	$\sigma_j(t) = \lceil\text{token movements}\rceil$
4. Scale	Down/Up profile to match predicted demand	Scaling/Inflation logic within budget constraint
5. Provision	Call cloud API to allocate/deallocate	Asynchronous (batched) resource management
6. Clean-up	Check budget/capacity, deallocate idle	Immediate deallocation at billing boundary

4. Proactive and Reactive Coordination in FLAS

In FLAS, the autoscaler associates proactive control with future-trend forecasting of SLA metrics and reactive control with instantaneous estimation of service performance from resource metrics. Key components include:

Scaling-Time Forecaster: Predicts duration for scaling actions based on notification/subscription rates.
Workload-Trend Forecaster: Applies smoothing (e.g., Savitzky–Golay) and time-series modeling (ARIMA, harmonic regression) to forecast future increments in SLA metrics, e.g., $\Delta RT(t)$ .
Performance Forecaster: Multivariate regression estimator for response time and throughput as functions of CPU, memory, etc., obviating the need for intrusive metric collection at runtime.
Decision Logic: Adaptive policy that triggers scale-out/-in based on either persistent forecasted metric trends (majority of forecasted slopes above threshold) or a threshold crossing by performance estimator. A cool-down interval, based on actual scaling time, inhibits premature re-scaling.

Pseudocode from (Rampérez et al., 23 Oct 2025) details the precise sequence and compound logic for combining proactive and reactive triggers.

5. Empirical Evaluation and Quantitative Performance

Experimental evaluations demonstrate PFAs' ability to optimize workload performance under constrained resources and variable demand. Key findings:

In content-based publish/subscribe systems, FLAS maintained SLA compliance (>99%), sustained over- and under-provisioning at 8–10% and 2–3% of total runtime, and demonstrated average scale-out/in times of 3 s/2 s. Compared to pure proactive or reactive baselines, the coordinated FLAS approach achieved fewer SLA violations and lower over-provisioning (Rampérez et al., 23 Oct 2025).
In cloud workflow workloads managed by Apache Airflow, the PFA autoscaler reduced mean job slowdown by up to 47% versus state-of-the-art plan-based and scaling-first autoscalers, and exhibited up to 76% lower average runtime per scaling invocation. Elasticity metrics indicated PFAs achieved more accurate supply-demand matching and higher overall resource utilization (Ilyushkin et al., 2019).

6. Implementation Techniques and Practical Considerations

PFA implementations leverage system-specific features:

For cloud workflows, integration with Airflow's scheduler invokes PFA as a minute-granularity background autoscaler process; resource abstractions are maintained per worker, and cloud provisioning APIs (e.g., boto3) are used for resource management (Ilyushkin et al., 2019).
Asynchronous API batching, resource-type abstraction, mapping workflow DAG structure to parallelism predictors, and budget-aware supply calculation are integral.
Sensitivity analysis reveals trade-offs in smoothing parameter choice ( $m$ , $\alpha$ ) and the impact of billing granularity on reactivity and overhead; suitable settings (e.g., $m=10$ or $\alpha=0.7$ for 1-min intervals) were determined empirically.

Potential extensions include auto-tuning of smoothing parameters, richer resource dimension modeling (I/O, memory), incorporation of fairness via throughput-weighted SLO budgets, and application to serverless/containerized executors at sub-minute granularity (Ilyushkin et al., 2019).

7. Significance and Comparative Analysis

PFAs offer several advantages relative to conventional (plan-based) autoscaling:

Require no a priori workload or task runtime estimation, thus naturally accommodating unpredictable, non-stationary, or bursty workloads.
Achieve compliance with strict budget or SLA constraints via direct feedback with low overhead.
Outperform both planning-first and reactive-only baselines in average job slowdown, elasticity, and resource utilization metrics.
Exhibit low compute complexity ( $O(R + |\text{pending tasks}|)$ ), enhancing scalability and practical deployability (Ilyushkin et al., 2019, Rampérez et al., 23 Oct 2025).

A plausible implication is that PFAs, by aligning resource management to observed system behavior rather than brittle predictions, present a robust autoscaling paradigm adaptable to emerging workload patterns and programmable cloud environments.

Markdown Report Issue Upgrade to Chat

References (2)

Performance-Feedback Autoscaling with Budget Constraints for Cloud-based Workloads of Workflows (2019)

FLAS: a combination of proactive and reactive auto-scaling architecture for distributed services (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Performance-Feedback Autoscaler (PFA).