Counterfactual Curiosity Reward for Robust Learning
- Counterfactual curiosity reward is a learning-theoretic construct that incentivizes models to maintain prediction stability under counterfactual interventions, thereby emphasizing causal signals over spurious correlations.
- It integrates with neural structural equation models, invariance penalties, and adaptive loss reweighting within a cooperative game framework to enhance performance on rare labels.
- Empirical evaluations demonstrate improved mean average precision and rare-label F1 scores, validating its role in robust multi-label classification under distribution shifts.
Counterfactual curiosity reward is a learning-theoretic construct introduced within the Causal Cooperative Game (CCG) framework for multi-label classification, designed to improve robustness and rare-label performance by incentivizing models to seek consistent predictions under counterfactual interventions. The central idea is to operationalize "curiosity" as the pursuit of prediction stability when exposed to counterfactual or non-causal changes in input features, thus guiding the model to prioritize truly causal signals over spurious correlations. This is achieved via explicitly formulated reward terms that penalize stochastic or fragile outputs under such interventions. Counterfactual curiosity reward integrates seamlessly with other CCG mechanisms: explicit neural structural equation models (SEMs) for causal discovery, invariance penalties, and adaptive loss reweighting for rare labels, collectively enabling a solution that demonstrably advances generalization across label distributions and environments (Fan et al., 30 Nov 2025).
1. Conceptual Origins and Motivation
The counterfactual curiosity reward emerges from the intersection of cooperative game theory and causal inference. In the context of CCG for multi-label learning, the traditional focus on average predictive accuracy is replaced by a nuanced objective: maximizing rare-label accuracy, output diversity, and, critically, reliable performance under hypothetical (counterfactual) scenarios. The motivation is to mitigate vulnerabilities to label imbalance, spurious correlations, and distribution shifts—especially for under-represented (rare) labels—by explicitly operationalizing curiosity as the model’s capacity to sustain accurate prediction under controlled interventions targeting non-causal features. This design addresses a key failure mode of standard discriminative objectives, which can inadvertently encode spurious dependencies when trained on imbalanced or shifted data (Fan et al., 30 Nov 2025).
2. Formal Definition in the Causal Cooperative Game Framework
Within CCG, multi-label prediction is modeled as a cooperative game in which the set of labels is partitioned into disjoint causal subgraphs , each managed by a player . Each player's action is to predict probabilities for labels based on feature vector . The global objective includes several cooperative payoffs: causal-graph learning loss (), per-player classification and invariance losses, and the per-player counterfactual curiosity reward :
where is a "counterfactual" input with interventions on non-causal features, and is the Jensen–Shannon divergence. The reward is maximized when the player's prediction is both stable to such interventions and accurate, encouraging robustness to distributional perturbations. The full curiosity reward is:
with terms for rare-label accuracy, inter-player diversity (KL-divergence), and counterfactual consistency. During training, is added to the loss to incentivize maximization.
3. Methodology for Computing and Integrating the Reward
Counterfactual curiosity reward computation involves three steps:
- Counterfactual Generation: For each input , generate a counterfactual view via intervention on predetermined non-causal attributes (e.g., synonym replacement, light paraphrasing).
- Prediction Consistency Assessment: For a selected sub-label , compute the Jensen–Shannon divergence between the predicted probabilities before and after intervention.
- Reward Aggregation and Loss Integration: Aggregate the stability reward across all labels in (with up-weighting for rare labels), combine with diversity and accuracy terms, and include the result as a negative loss term.
This mechanism is embedded in a joint optimization procedure alongside other objectives. The reward term enforces the model's predictions to be insensitive to perturbations on features outside the causal graph, aligning with the goal of learning invariant, generalizable representations (Fan et al., 30 Nov 2025).
4. Interplay with Rare Label Enhancement and Invariance Penalties
Counterfactual curiosity reward is most effective in synergy with causal invariance loss and rare-label enhancement. The invariance loss, implemented as a combination of feature-contrastive ( distance between representations under environment augmentations) and prediction-consistency (cross-entropy across augmented inputs), further restricts the model class to functions most robust to non-causal variation. Simultaneously, rare-label enhancement amplifies gradients on rare-label causal edges in the SEM, and dynamically reweights cross-entropy loss by per-label frequency, with parallel upweighting in the curiosity reward’s accuracy term. Empirical results confirm that omitting the curiosity reward, invariance penalty, or rare-label enhancement leads to significant declines in mean average precision (mAP) and rare-label F1, underscoring the necessity of each component (Fan et al., 30 Nov 2025).
5. Empirical Impact and Theoretical Properties
Experimental evidence on standard multi-label benchmarks demonstrates that incorporating the counterfactual curiosity reward into the CCG yields superior rare-label and mAP scores relative to baselines including RoBERTa and GNN-based methods. Ablations validate that the counterfactual reward term is indispensable for generalizability, particularly under distribution and label-frequency shifts. Theoretically, this construct enforces a form of counterfactual invariance—output stability under hypothetical, non-causal interventions—representing a principled extension of invariance regularization to the cooperative multi-label prediction setting (Fan et al., 30 Nov 2025).
6. Key Equations and Training Workflow
The core equations related to the counterfactual curiosity reward are as follows:
- Counterfactual consistency for player :
- Aggregated curiosity reward:
- The total loss incorporates this as:
Training proceeds by initializing model and SEM parameters, updating the causal graph and label partitions, generating augmented and counterfactual inputs, evaluating loss and curiosity, and optimizing via standard first-order methods with early stopping on validation mAP or rare-F1 (Fan et al., 30 Nov 2025).
7. Significance, Limitations, and Outlook
The introduction of counterfactual curiosity reward within the multi-label CCG framework directly addresses several long-standing challenges in robust classification, particularly in the presence of rare labels, spurious feature-label associations, and shifting input distributions. By formalizing curiosity as a stability-driven consistency reward under counterfactual interventions—and integrating it with explicit causal graph modeling and invariance penalties—this approach yields empirically validated gains in both prediction quality and interpretability. Future work may investigate adaptive or learned intervention strategies for counterfactual generation and explore the transferability of this reward structure to other structured prediction domains such as sequence labeling or hierarchical classification. A plausible implication is broader applicability to any machine learning setting where robust generalization under interventions is a primary desideratum (Fan et al., 30 Nov 2025).