Data-Driven Action Selection Approach

Updated 4 July 2026

Data-driven action selection is a framework that uses observed data, like cumulative rewards and transition histories, to dynamically determine the optimal action under task-specific criteria.
The approach integrates methodologies such as ranking, pruning, and candidate generation, demonstrated by up to 80% action space reduction and significant performance gains in RL and robotics.
Practical implementations span reinforcement learning, safety-critical control, and planning via imitation, employing techniques like lower-confidence bounds and knockoff sampling to convert data into actionable insights.

A data-driven action selection approach can be understood as a family of methods that uses observed data—such as cumulative reward, transition histories, behavior likelihoods, confidence bounds, weak labels, or historical assessments—to decide which action, action subset, action dimension, or generated action should be used next. Across recent work, the selected object ranges from training action spaces in reinforcement learning, to leaf-node actions in graph-guided robotics, to multimodal candidate actions in offline RL, to minimal sufficient action coordinates in deep RL, and even to reusable generated actions in expanding action libraries (Ghosh et al., 2022, Hoffmeister et al., 2024, Wang et al., 24 Mar 2026, Zhang et al., 5 Jul 2025, Xu et al., 30 Sep 2025).

1. Scope and conceptual variants

The literature uses the phrase in a broader sense than a single algorithmic template. In some works, the problem is to choose among existing actions under uncertainty; in others, it is to rank or prune an action space before policy learning; in still others, it is to decide whether a new action should be created and then reused. Several papers also extend “action selection” beyond control in the narrow sense, for example to frame selection in weakly supervised video recognition, to project funding decisions, or to site recommendation (Nguyen et al., 2024, Liu et al., 2018, Baumbach et al., 2016).

This suggests that the unifying property is not the presence of a particular optimizer, but the use of empirical evidence to determine which candidate intervention is most useful under a task-specific criterion. Those criteria differ substantially: cumulative reward in RL, success under blocking conditions in robotics, lower-confidence value in offline RL, conditional independence for action dimensions, or estimated downstream prediction loss in supervised subset selection.

Decision object	Representative mechanism	Example papers
Training action subsets	Categorize actions as dispensable or indispensable and rank by cumulative reward	(Ghosh et al., 2022)
Action suggestions	Treat suggested actions as observations for belief updating in a POMDP	(Asmar et al., 2022)
Sequential robot actions	Choose leaf-node actions that resolve current blocking conditions	(Hoffmeister et al., 2024)
Offline RL candidates	Rerank sampled actions by conservative value and behavior-normalized support	(Wang et al., 24 Mar 2026)
Action dimensions	Estimate a minimal sufficient action set and mask redundant coordinates	(Zhang et al., 5 Jul 2025)
Generated actions	Reuse an existing action or pay to create a new one	(Xu et al., 30 Sep 2025)

2. Core mathematical formulations

A common pattern is that action selection is defined relative to an explicit predictive or inferential object. In training-action-space evaluation, the underlying RL problem is written as the quartet $(S, A, P_a, R_a)$ , and the value of an action $a_i$ is tied to its marginal contribution to cumulative reward when added to a coalition $S \subseteq A$ , namely $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ . The same work operationalizes action importance in two stages: an indispensability test based on goal failure after removal, and a ranking of dispensable actions by cumulative reward (Ghosh et al., 2022).

In collaborative partially observable decision making, the action suggestion itself becomes an observation channel. The belief update is modified to

$b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$

so the selected action is not the suggestion itself but the autonomous policy output under the updated belief. This is a distinct data-driven pattern: suggestions alter state estimation, and action choice changes indirectly through that inference step (Asmar et al., 2022).

Offline RL work makes the action-selection interface explicit at inference time. GEM defines a candidate set, computes a conservative ensemble lower-confidence bound

$\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$

then adds a within-state standardized support term derived from an independent behavior model:

$\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$

The final deployed action is the top-ranked candidate under this score, not the raw actor mean (Wang et al., 24 Mar 2026).

In deep RL variable selection, the formal target is a minimal sufficient action set $G$ , defined by

$R_t \perp \mathbf{A}_{t,G^c} \mid \mathbf{S}_t,\mathbf{A}_{t,G}, \qquad \mathbf{S}_{t+1} \perp \mathbf{A}_{t,G^c} \mid \mathbf{S}_t,\mathbf{A}_{t,G}.$

This moves action selection from “which action vector to execute” to “which coordinates of the action vector are genuinely necessary.” The resulting mask is then integrated into the actor and, when relevant, the critic (Zhang et al., 5 Jul 2025).

Finally, online learning with generative action sets introduces a two-level rule. The best reusable action is chosen by an LCB,

$f_t := \arg\min_{f\in S_t} \check d_t(x_t,f),$

and generation is triggered by a UCB-based Bernoulli decision,

$a_i$ 0

Here the selected object is not only an action from the current library, but also the decision to expand that library itself (Xu et al., 30 Sep 2025).

3. Reinforcement-learning uses: ranking, pruning, and selecting action spaces

One major line of work treats action selection as a problem of determining which actions should even be available to an RL agent during training. A Shapley-inspired methodology evaluates action subsets by training the RL agent on those subsets, measuring cumulative reward $a_i$ 1, and identifying a cut-off cardinality by Monte Carlo simulation. In the cloud resource-tuning case study, the full power set of five EC2 resource pairs has $a_i$ 2 subsets, but the method identifies a cut-off cardinality of $a_i$ 3, leaving only $a_i$ 4 subsets to analyze. The reported search-space reduction is $a_i$ 5, rounded to $a_i$ 6 in the abstract. The resulting categorization marks small t3a and medium t3a as dispensable, and large t3a, xlarge t3a, and 2xlarge t3a as indispensable; the best-performing subset is $a_i$ 7medium, large, xlarge, 2xlarge $a_i$ 8 with reward $a_i$ 9, better than the full set’s $S \subseteq A$ 0 (Ghosh et al., 2022).

A related but distinct RL perspective appears in online model selection with bandit feedback. There, the meta-learner does not choose a primitive action directly; instead it selects which base learner to trust at round $S \subseteq A$ 1, and then executes that learner’s recommended policy. The selected index is

$S \subseteq A$ 2

where $S \subseteq A$ 3 is a balancing potential updated from empirical cumulative rewards and confidence corrections. This yields model selection guarantees in terms of realized regret coefficients rather than candidate worst-case bounds, and experiments show that data-driven regret balancing can outperform Corral, EXP3, greedy selection, and regret-balancing grids in multi-armed bandits, linear bandits, and contextual linear bandits (Pacchiano et al., 2023).

Deep RL work on action-coordinate selection addresses another version of the same problem: many action variables may be redundant even when the nominal action space is fixed. The proposed method uses knockoff sampling, where $S \subseteq A$ 4 is drawn from the same policy that produced $S \subseteq A$ 5, to compare original action dimensions against null copies. In semi-synthetic MuJoCo tasks with redundant coordinates added to Ant, HalfCheetah, and Hopper, the method reports true positive rate $S \subseteq A$ 6 in essentially all tabulated settings, near-zero false discovery rate, and substantial return improvements over training on all action dimensions. For example, in Ant with PPO and $S \subseteq A$ 7, the reported reward is $S \subseteq A$ 8 for knockoff sampling versus $S \subseteq A$ 9 for using all actions (Zhang et al., 5 Jul 2025).

These results directly contradict a common simplification that more available actions are always beneficial. Across these RL formulations, a larger or denser action space can produce worse cumulative reward, worse exploration, or poorer sample efficiency than a carefully selected subset. That conclusion is explicit in the training-action-space study and strongly supported by the action-dimension masking results (Ghosh et al., 2022, Zhang et al., 5 Jul 2025).

4. Sequential robotics, active perception, and planning

In robotics, data-driven action selection often appears as an online, context-sensitive choice among discrete candidates whose value depends on current execution state. A prominent example is sequential discrete action selection via blocking conditions and resolutions. The system represents goals as literals, tracks blocked actions and their blocking conditions in a state-transition graph, and defines the current candidate set as all leaf nodes. When an action is blocked, the graph is expanded by adding its blocking condition and the corresponding resolution actions; when an action succeeds, it and its siblings are removed. The graph state is converted into an eight-component prompt for a zero-shot LLM, including previous actions, completed sub-goals, candidate actions, remaining goal predicates, and prior errors. In AI2Thor experiments with 50 trials per task and a 100-action cap, the method achieves 50/50 on Coffee, 39/50 on Apple, 39/50 on Mug, and 49/50 on Toast, outperforming several LLM and classical planning baselines in overall success rate (Hoffmeister et al., 2024).

Active tactile localization offers a different template. There, the action is a touch ray $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 0, generated by sampling along the faces of a bounding box around the current pose estimate. For each candidate touch, the system simulates a future measurement by ray-mesh intersection, runs a one-step posterior update under the translation-invariant Quaternion filter, and scores the action by a divergence between hypothetical posterior and current belief. The paper compares KL divergence, Rényi divergence, Fisher information metric, Bhattacharyya distance, and $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 1-Wasserstein distance squared, and reports similar pose-accuracy performance for sparse measurements $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 2 points $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 3 across all selected criteria. The action-selection mechanism is therefore belief-driven and model-based, but still data-driven in the sense that each choice is conditioned on the current posterior inferred from visual and tactile observations (Murali et al., 2021).

A broader planning formulation appears in data-driven planning via imitation learning. The key move is to cast planning as a POMDP over hidden world maps and learn a policy $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 4 that maps the current search or sensing history $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 5 to the next planning action. Training uses a clairvoyant oracle that has access to the hidden world during training and can compute optimal or near-optimal action values. The paper proves that offline imitation of the clairvoyant oracle is equivalent to online imitation of a hallucinating oracle that averages oracle values over the posterior induced by the learner’s history. In informative path planning, the learned policy outperforms the best heuristic on 8 of 10 datasets and yields up to $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 6 more reward; in search-based motion planning, the reported speedup over A* reaches $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 7 (Choudhury et al., 2017).

Taken together, these works show that “data-driven” in robotics does not imply a monolithic learned policy. It can instead mean graph-guided candidate generation, posterior-driven information gain, or imitation of a training-time oracle under partial observability. This suggests that the crucial design choice is often the action-selection interface rather than the specific learning backbone.

5. Feasibility, safety, and generated action sets

Another large cluster of methods treats action selection as a constrained construction problem: the system must first determine which actions are feasible or certifiable before deciding which one is desirable. In “action mapping,” the goal is to learn a feasibility policy $\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 8 that generates all feasible actions for a given state. The target distribution is the uniform density over the feasible set,

$\sum R_a(S \cup \{i\}) - \sum R_a(S)$ 9

and the learned generator-induced distribution $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 0 is trained to match it by minimizing an $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 1-divergence. The paper develops KDE-based estimators and importance-resampled gradients for JS, forward KL, and reverse KL, and shows in 2D examples, spline path planning, and robotic grasping that JS and forward KL cover disconnected feasible sets better than reverse KL. This reframes action selection as a two-stage process: first learn to generate feasible actions, then optimize the task objective over that feasible latent space (Theile et al., 2023).

In safety-critical control, the selected action is the control input closest to a reference action while satisfying a learned certificate constraint. For an uncertain control-affine system $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 2, the paper models mismatch in the certificate derivative with a Gaussian process using an affine dot product kernel, then solves a chance-constrained SOCP. The distinctive action-selection contribution is an online data-selection algorithm that chooses only the most informative training samples for the current certificate direction. This reduces runtime from $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 3 ms to $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 4 ms on a real cart-pole swing-up task and from $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 5 ms to $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 6 ms on simulated RABBIT locomotion, while preserving the practical feasibility of the certifying filter (Choi et al., 2023).

Expanding-action-space learning makes the selection problem even more explicit. The agent observes a context $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 7, selects an existing action key $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 8 by minimizing a lower confidence bound $b_t(s_t) \propto p(o_t^s \mid s_t)\, p(o_t \mid s_t, a_{t-1}) \sum_{s_{t-1}} p(s_t \mid s_{t-1}, a_{t-1})\, b_{t-1}(s_{t-1}),$ 9, and then decides whether to generate a new action by paying creation cost $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 0. The paper’s doubly optimistic rule uses an LCB for reuse and a UCB for creation, with the generated action becoming permanently available for future rounds. The abstract states that this yields the first sublinear regret bound for online learning with expanding action spaces and reports favorable generation-quality tradeoffs on healthcare question-answering datasets (Xu et al., 30 Sep 2025).

A closely related offline-RL formulation appears in GEM. Rather than extract a single unimodal actor output, GEM trains a Gaussian-mixture actor by critic-guided, advantage-weighted EM-style updates, learns an independent Gaussian-mixture behavior model, and then performs candidate-based inference. The number of sampled candidates is an inference-time budget knob, and reranking uses a conservative ensemble lower-confidence bound together with behavior-normalized support. The paper reports competitiveness across D4RL, with suite-average improvements over IQL of $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 1 on Locomotion, $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 2 on AntMaze, and $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 3 on Maze2D (Wang et al., 24 Mar 2026).

These papers support a recurring conclusion: data-driven action selection is often inseparable from feasibility estimation, uncertainty control, or support quantification. A plausible implication is that, in high-dimensional or safety-critical settings, the main difficulty is less “which action has the highest nominal value” than “which high-value action is also feasible, supported, or certifiable.”

6. Weak supervision, collaborative evidence, and broader decision systems

Data-driven action selection also appears in settings where the selected object is not a control action in the classical sense. In collaborative decision making with action suggestions, a suggested action $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 4 is treated as an observation of the hidden state rather than as a command to be obeyed. The paper proposes two approximate suggestion models: a Scaled Rational model using a trust-like parameter $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 5, and a Noisy Rational model using a Boltzmann distribution over $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 6 with rationality coefficient $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 7. Across Tag and RockSample, the proposed Scaled and Noisy agents achieve near-naive reward with substantially fewer suggestions, and degrade much more gracefully than naive suggestion-following when suggestions become random (Asmar et al., 2022).

Weakly supervised multi-view action recognition uses the term in yet another way. MultiASL defines frame-level action selection by fusing multi-view spatial and temporal features, computing class probabilities $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 8 and actionness $\operatorname{LCB}_\lambda(s,a) = \bar Q(s,a)-\lambda\,\operatorname{Std}(Q_i(s,a)),$ 9, and then selecting the top- $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 0 frames for each class by maximizing $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 1. Pseudo-positive actionness labels are formed by the union of top- $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 2 frames over video-level positive classes, and the actionness head is trained with Generalized Cross Entropy using $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 3. On MM-Office, the full model reaches $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 4 and $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 5, while removing the Actionness loss drops performance to $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 6 and $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 7 (Nguyen et al., 2024).

Outside machine learning control, historical data can drive evidential action selection in administrative or strategic decisions. In R&D project selection, historical expert grades are converted into belief distributions over final outcomes through

$\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 8

where $\operatorname{Score}(s,a) = \operatorname{LCB}_\lambda(s,a) + w_p\, z_{\text{score},s}\!\big(\log \mu_\phi(a\mid s)\big).$ 9 is the likelihood of grade $G$ 0 under outcome $G$ 1. These pieces of evidence are then combined with weights and reliabilities using an evidential reasoning rule. In the NSFC case study, “Excellent” on comprehensive evaluation corresponds to funded and unfunded belief degrees $G$ 2 and $G$ 3, and the final ranking uses the combined belief assigned to “Funded” (Liu et al., 2018).

Site recommendation exhibits the same logic at a different scale. Candidate sites are treated as alternatives in a constraint satisfaction problem $G$ 4, where company requirements are expressed through a User Requirements Profile composed of Decision Criteria and Qualitative Ratings. In the supermarket study over Germany, the reported overlap between existing supermarket locations and recommended sites is $G$ 5, and the method recommends $G$ 6 additional municipalities where a store should be opened (Baumbach et al., 2016).

These examples make clear that data-driven action selection need not mean policy optimization alone. It can equally mean using data to identify action-relevant frames, update a hidden-state belief from suggestions, or rank interventions under historical evidence and source reliability.

7. Recurrent themes, misconceptions, and open issues

Several misconceptions recur across this literature. One is that data-driven action selection always means end-to-end policy learning from a static dataset. The surveyed papers show otherwise: some are wrappers around black-box RL algorithms (Ghosh et al., 2022), some are online graph-guided decision procedures with zero-shot LLMs (Hoffmeister et al., 2024), some are belief-driven filters for active perception (Murali et al., 2021), and some are supervised subset-selection methods tied to a ridge-regression estimator (Sasaki et al., 2024). Another misconception is that suggestions, heuristic actions, or actor outputs should be followed directly. Collaborative POMDP work instead uses suggestions as evidence; GEM reranks actor samples rather than trusting the actor alone; and planning-via-imitation learns to rank search actions using oracle-derived values rather than fixed heuristics (Asmar et al., 2022, Wang et al., 24 Mar 2026, Choudhury et al., 2017).

The main limitations are equally consistent. Many methods rely on strong structural assumptions: finite or discretized action sets and threshold conditions in RL action-space valuation (Ghosh et al., 2022); manually specified blocking conditions and resolution links in graph-guided robotics (Hoffmeister et al., 2024); availability of a feasibility oracle or critic in feasible-action generation (Theile et al., 2023); valid certificate functions and calibrated GP confidence bounds in certifying filters (Choi et al., 2023); stationarity and exponential $G$ 7-mixing in knockoff-based deep RL selection (Zhang et al., 5 Jul 2025). Offline RL support-aware methods face their own tradeoff: stronger behavior support reduces off-distribution risk, but can also become overly conservative when the highest-value actions are rare under the behavior distribution (Wang et al., 24 Mar 2026). In broad decision systems, historical outcome mappings and reviewer reliabilities can inherit the biases of prior institutional processes, which the evidential reasoning paper treats structurally but does not formalize with a separate fairness theorem (Liu et al., 2018).

Future directions are also explicit in the source material. The blocking-condition framework suggests integration with VLMs for detecting previously unknown blocking modes at runtime (Hoffmeister et al., 2024). The feasible-action generator leaves the objective-learning stage of action mapping to future work and identifies adaptive bandwidth selection and higher-dimensional density estimation as open problems (Theile et al., 2023). The deep-RL knockoff method notes that action selection is currently applied only once during training and suggests repeated adaptive selection stages (Zhang et al., 5 Jul 2025). Collaborative suggestion modeling states that the same ideas should extend to continuous settings and online solvers, though the experiments remain discrete (Asmar et al., 2022).

Taken together, these works indicate that a data-driven action selection approach is best viewed as an interface problem: how empirical evidence is converted into a ranked, filtered, or generated action set. The most successful formulations make that interface explicit—through Shapley-style valuation, posterior updates, graph-structured candidates, feasibility generators, support-aware reranking, or confidence-bound creation rules—and thereby separate the question of “what data imply is worth trying” from the broader question of policy representation.