Papers
Topics
Authors
Recent
2000 character limit reached

Cost-Aware PoQ Framework

Updated 25 December 2025
  • The paper demonstrates a cost-aware optimization that minimizes intervention expenses while meeting a specified probability threshold for effective system recovery.
  • It integrates counterfactual reasoning with a surrogate Structural Causal Model using pattern clustering and structured VAE to address hidden noise.
  • Empirical results on synthetic and real-world datasets validate its superior performance in anomaly detection and cost-effective intervention planning.

A Cost-Aware Proof-of-Quality (PoQ) Framework is a principled system for selecting optimal interventions or actions under uncertainty, with the explicit objective of minimizing cost while ensuring sufficiently high probability of success or quality. In causal decision-making under abnormal or anomalous conditions, this framework integrates counterfactual reasoning on a surrogate structural causal model (SCM), cost-constrained intervention optimization, and guarantees of identifiability from mixed observational data. The approach is distinguished by its ability to operate in continuous intervention spaces and to achieve identifiability for counterfactual queries even in the presence of hidden noise, via integration of abnormal pattern clustering and structured variational autoencoders (VAE).

1. Causal Problem Formulation with Cost Constraints

Let X=(X1,,Xd)RdX=(X_1,\ldots,X_d)\in\mathbb{R}^d be the vector of endogenous system variables and YRY\in\mathbb{R} a real-valued target (e.g., anomaly score); the system is governed by a known causal DAG G\mathcal{G} over vertices V={X1,,Xd,Y}V=\{X_1,\dots,X_d,Y\}, each with a structural equation and independent exogenous noise. A typical scenario is anomaly detection, where anomalous exogenous noise in component ii (ZiN(μi,σi2)Z_i \sim \mathcal{N}(\mu'_i,\sigma_i'^2)) perturbs the system, resulting in an observation (x,y)(x,y) with y>ty > t (anomalous regime).

The core goal is to find a minimal-cost action (intervention) do(X=x)do(X=x^*) such that the system is restored—i.e., YtY \leq t post-intervention—formally,

x=argminxC(x,x)  such that  P(Ycf(x;x)tX=x,Y>t)ι,x^* = \arg\min_{x^*} C(x^*,x) ~~\text{such that}~~ \mathbb{P}\bigl(Y^{\text{cf}}(x;x^*)\leq t \mid X=x, Y > t\bigr) \geq \iota,

where C(,)C(\cdot,\cdot) is a convex, user-specified cost and ι(0,1]\iota\in(0,1] is a desired probability of recovery (Cai et al., 13 May 2025).

2. Surrogate Structural Causal Model via Abnormal Pattern Clustering

Direct observation of exogenous noise is infeasible; instead, a surrogate SCM is learned. First, anomalies are clustered using a Gaussian Mixture Model (GMM) on the augmented space (x,y)(x,y), yielding cluster labels u{1,,K}u\in\{1,\ldots,K\} encapsulating abnormal modes. The model then employs a VAE whose encoder/decoder architecture respects the causal ordering in the DAG:

  • Each node VjV_j in topological order has posterior qϕ(zjvj,vpaj,u)q_\phi(z_j \mid v_j, v_{\mathrm{pa}_j},u) and prior pθ(zjvpaj,u)p_\theta(z_j \mid v_{\mathrm{pa}_j},u), with reconstruction pθ(vjzj,vpaj,u)p_\theta(v_j \mid z_j, v_{\mathrm{pa}_j}, u).
  • The evidence lower bound (ELBO) for variational inference factorizes nodewise as

logpθ(xu)j=1d{Eqϕ[logpθ]DKL(qϕpθ)},\log p_{\theta}(x\mid u) \geq \sum_{j=1}^{d} \Bigl\{ \mathbb{E}_{q_{\phi}}[\log p_{\theta}] - D_{\mathrm{KL}}(q_{\phi}\| p_{\theta}) \Bigr\},

enabling efficient and structure-respecting learning of the latent noise structure.

Pattern clustering labels uu serve as auxiliary supervision, supporting identifiability of latent variables and causal mechanisms in the presence of multiple, overlapping anomaly types.

3. Identifiable Counterfactual Reasoning and Optimization

Counterfactual estimation proceeds in three steps:

  1. Abduction: inference of latent noise z^\hat{z} from (x,u)(x,u),
  2. Intervention: replacing xRx_{\mathcal{R}} by xRx_{\mathcal{R}}^* in selected intervenable coordinates R\mathcal{R} (those capable of reaching YY),
  3. Prediction: forward propagation yields counterfactual y^=fθ(xpay,z^y)\hat{y}^* = f_{\theta}(x_{\mathrm{pa}_y},\hat z_y).

The probability of necessity (PN) is defined as

PN(x;xR)=P(Ycf(x;xR)tX=x,Y>t),\mathrm{PN}(x; x^*_{\mathcal{R}}) = \mathbb{P}\left( Y^{\mathrm{cf}}(x; x^*_{\mathcal{R}}) \leq t \mid X=x, Y>t \right),

quantifying the likelihood that the intervention transitions YY to the safe regime.

The cost-aware optimization admits either a constrained form or a penalized relaxation,

minxRC(xR,x)+λ[ιPN(x;xR)]+,\min_{x^*_{\mathcal{R}}} C(x^*_{\mathcal{R}}, x) + \lambda\,[\iota - \mathrm{PN}(x; x^*_{\mathcal{R}})]_{+},

where C(,)C(\cdot,\cdot) is convex (e.g., quadratic), and λ0\lambda \gg 0 ensures feasibility under the constraint. Sequential Least Squares Programming (SLSQP), a trust-region quasi-Newton step, is employed to find local optima with KKT enforcement, using gradients from automatic differentiation through the VAE decoder (Cai et al., 13 May 2025).

4. Identifiability Guarantees and Theoretical Foundations

Surrogate counterfactuals are guaranteed to be identifiable under two complementary results:

  • Pattern-Clustering Identifiability: Satisfying weak separability ($2d$ dimensions across d+1d+1 variables) and mixture-of-Gaussian conditions ensures that GMM clusters correspond to meaningful abnormal patterns, per results of [Tahmasebi et al., 2018].
  • Noise-Variable Identifiability: Structured VAE parameter identifiability follows under assumptions of injective mixing, smooth, linearly-independent sufficient statistics, and enough cluster-conditioning points, generalizing the results of [Khemakhem et al., 2020].

Together, these results ensure the approximated SCM recovers sufficient information about the true noise to provide closed-form, counterfactually valid predictions for YcfY^{\text{cf}}.

5. Practical Implementation and Empirical Results

Model and Training:

  • GMM is fit on anomalies for clustering.
  • VAE encoder/decoder: 3-layer MLP, hidden size 30–50, LeakyReLU, Gaussian noise models, Adam optimizer (learning rate $1$e3^{-3}), batch size 64, trained for 20 epochs.

Intervention Optimization:

  • Intervene on continuous action spaces xRx_{\mathcal{R}}^* using SLSQP, with regularization (2\ell_2 penalty on xRx_{\mathcal{R}}^* step-size).

Benchmark Datasets:

  • Synthetic: random DAGs (chain, Erdős–Rényi), variable node count and edge sparsity, with injected anomalies.
  • Real-world: AIOps (5G metrics, curated DAG, partial labels), Lemma-RCA (IT incident logs), Air-Pollutants (PM2.5, PM10, SO2, NO2, Beijing APEC period).

Metrics and Results:

  • F1 score (identification accuracy), normalized cost (N-Cost), nDCG@k, and r-MSE (counterfactual accuracy).
  • MiCCD achieves best-in-class results across all benchmarks, e.g., AIOps F1 = 0.95 (vs 0.88 in BIGEN), Air F1 = 1.0 at lowest cost, Lemma-RCA F1 = 0.94 (vs 0.63 in next best) (Cai et al., 13 May 2025).

6. Illustrative Case and Broader Applications

A representative example in a data-center power recovery scenario demonstrates the utility of cost-aware decision-making: Rather than defaulting to root-cause repair at a high cost (c1=10c_1 = 10 for X1X_1), the framework identifies a less direct but far cheaper intervention (X3X_3 boost at c3=0.1c_3 = 0.1) that suffices for system recovery, as quantified by probability of necessity.

Applications extend to root-cause interventions under abnormal system operation, automated recovery planning, cost-sensitive anomaly detection, and interpretable counterfactual-based control in systems with complex, overlapping anomaly patterns.

7. Framework Generalization and Future Directions

The formal recipe encompasses:

  • Structural causal modeling with pattern clustering for abnormal data regimes,
  • Surrogate SCMs with identifiable counterfactual reasoning,
  • Convex cost-constrained optimization of interventions via differentiable programming,
  • Statistical identifiability and practical training pipelines,
  • Empirical validation against state-of-the-art root cause analysis and RL-based baselines.

Generalizations include adaptation to domains with richer causal structure, variable intervention costs, integration of active learning for intervention selection, and the extension to partially observed or dynamic anomaly regimes.


References:

  • "An Identifiable Cost-Aware Causal Decision-Making Framework Using Counterfactual Reasoning" (Cai et al., 13 May 2025)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Cost-Aware PoQ Framework.