Submodular Rewards in RL and Optimization

Updated 23 January 2026

Submodular rewards are functions exhibiting diminishing returns where the marginal gain decreases as the set of actions, states, or items grows.
They enable modeling non-additive objectives in domains like reinforcement learning, planning, and multi-agent systems, improving exploration and coverage.
Advanced optimization techniques such as continuous greedy and semi-gradient methods provide strong approximation guarantees and enhanced performance.

Submodular rewards arise when the value of a set—of actions, states, or visited elements—exhibits the diminishing returns property: the marginal increase from adding an element to a set decreases as the context set grows. In sequential decision-making, planning, @@@@1@@@@ (RL), bandit optimization, multi-agent systems, and subset recommendation, submodular rewards define a rich class of objectives extending beyond additive (modular) rewards. Modeling rewards as monotone submodular functions captures diverse application domains such as exploration, experimental design, influence maximization, sensor placement, and coverage tasks.

1. Formal Definition and Properties

Let $\Omega$ be a finite ground set, commonly realized as the set of possible state–action pairs, edges in a graph, or items in a recommendation domain. A reward function $f:2^\Omega\rightarrow\mathbb{R}_+$ is termed submodular if for all $S\subseteq T\subseteq\Omega$ and $e\notin T$ ,

$f(S \cup \{e\}) - f(S) \geq f(T \cup \{e\}) - f(T),$

expressing diminishing marginal returns. Equivalently, for all $S,T\subseteq \Omega$ ,

$f(S) + f(T) \geq f(S \cup T) + f(S \cap T).$

Frequently, rewards are assumed monotone: $S\subseteq T \implies f(S)\le f(T)$ . These properties yield powerful structural guarantees for optimization and are central to approximation algorithms in both offline combinatorial settings and sequential decision-making domains such as MDPs, RL, and multi-agent planning (Wang et al., 2020, Santi et al., 2024, Prajapat et al., 2023).

2. Submodular Rewards in Sequential Decision Processes

In planning and RL, submodular rewards assign value not to individual states or actions, but to the set of state–action pairs, states, or visited indices along an agent's trajectory. The general model consists of:

A finite-horizon MDP/MDP tuple $M=(S, A, P, \mu, H)$ .
Trajectories $T=(s_1,a_1,\dots,s_H,a_H)$ leading to a set $S_T$ (e.g., state–action pairs $(s_h,a_h)$ or visited states).
Cumulative reward is $f(S_T)$ , with $f$ monotone submodular.

The agent's objective is to optimize $J(\pi) := \mathbb{E}[f(S_T)\mid \pi]$ over policies $\pi$ , which is non-additive and generally history-dependent. This framework includes as special cases the classical additive reward ( $f$ modular) and subset maximization under cardinality constraints ( $H=k$ , $|A|=n$ ) (Wang et al., 2020, Prajapat et al., 2023).

3. Optimization Methodologies and Algorithmic Frameworks

3.1 Multilinear Extension and Continuous Greedy

To address non-additive, non-Markovian submodular objectives, (Wang et al., 2020) employs the multilinear extension $F:[0,1]^{\Omega}\rightarrow\mathbb{R}_+$ of $f$ :

$F(x) = \mathbb{E}[f(S^x)] = \sum_{S\subset \Omega} f(S)\prod_{e\in S}x_e\prod_{e\notin S}(1-x_e),$

where $S^x$ is a random set including each $e$ independently with probability $x_e$ .

A continuous-greedy ascent constructs solutions in the space of marginals $x\in [0,1]^{\Omega}$ , iteratively updating along the direction maximizing $\nabla F$ . Discretized updates rely on approximated gradients via sampling and repeated dynamic-programming MDP solves using the surrogate immediate reward $w_t(e)\approx \mathbb{E}[f(S^{y_{t-1} \cup \{e\}})-f(S^{y_{t-1}})]$ . The final randomized policy (or a pipage-rounded deterministic policy) yields strong approximation ratios (Wang et al., 2020).

3.2 Semi-gradient Surrogate Approaches

Global RL methods (Santi et al., 2024) adopt an iterative submodular semi-gradient scheme. At each iteration, a tight modular (additive) lower-bound (subgradient) $m_{X}^{\sigma}$ of $F$ at the current set $X$ is constructed:

$m_X^\sigma(v) = F(S_i^\sigma) - F(S_{i-1}^\sigma),$

with a permutation $\sigma$ starting with $X$ . Maximizing this modular surrogate via standard RL or DP yields new policies. Ascent in $F$ is monotonic, and curvature-dependent approximation ratios are available.

3.3 Greedy, Augmented-Greedy, and Bandit Algorithms

In stochastic and bandit submodular optimization, extensions of greedy, double-greedy, or "randomized greedy learning" approaches (Fourati et al., 2023, Zhou et al., 2024, Tajdini et al., 2023) are employed, frequently achieving $\frac12$ or $1-1/e$ approximation ratios (under monotonicity). Regret bounds in the stochastic bandit and combinatorial bandit paradigms explicitly leverage submodularity to reduce exploration complexity.

3.4 Policy Gradient and Pruned Submodularity Graphs

Policy gradient algorithms adapted to submodular reward domains exploit marginal gains in trajectory elements. For example, in SubPO (Prajapat et al., 2023), the policy is optimized via stochastic/natural gradients of $J(\theta)=\mathbb{E}_{\tau\sim\pi_\theta}[f(\tau)]$ . Pruned submodularity graphs (as in (Anand et al., 18 Jul 2025)) sparsify the search over state–action pairs by identifying high-divergence (high marginal-gain) states, improving training stability and sample efficiency.

4. Theoretical Guarantees and Curvature-Dependent Bounds

Performance guarantees for submodular reward maximization are often parametrized by the submodular curvature $\kappa$ . For a monotone submodular function $f$ ,

$\kappa = 1 - \min_{e\in\Omega} \frac{f(\Omega) - f(\Omega\setminus \{e\})}{f(\{e\})},$

where $\kappa=0$ indicates a modular function. Approximation ratios degrade gracefully with increasing curvature:

Greedy or continuous-greedy achieves $1/(p+\kappa)$ or $(1-1/e)$ under $p$ -system constraints (Jawaid et al., 2012, Wang et al., 2020).
For global RL, after one semi-gradient step, $J(\pi_1)\ge (1-\kappa)J(\pi^*)$ (Santi et al., 2024).
In the general setting, robust regret bounds interpolate between $T^{2/3}$ (limited horizon) and $\sqrt{T}$ (long-horizon) regimes, with submodularity enabling significant exploration complexity reduction compared to naïve arms-as-subsets approaches (Tajdini et al., 2023, Zhou et al., 2024).

5. Applications and Empirical Evaluation

Submodular rewards underpin a wide spectrum of sequential decision domains:

Exploration and coverage: Maximizing state visit diversity (Santi et al., 2024, Prajapat et al., 2023, Anand et al., 18 Jul 2025), area cover (robot exploration, vision coverage).
Experimental design/informative path planning: D-optimality, mutual information objectives (Wang et al., 2020, Santi et al., 2024, Prajapat et al., 2023).
Multi-agent coordination: Decentralized task assignment under capacity and exclusivity via submodular partition matroids (Liu et al., 2022).
Bandit subset selection and combinatorial optimization: Influence maximization, recommendation, contextual and full-bandit feedback (Fourati et al., 2023, Foster et al., 2021, Tajdini et al., 2023, Zhou et al., 2024).
Adaptive and nonmonotone regimes: Revenue maximization with costs, stochastic influence/truncation constraints (Tang et al., 2021, Tang et al., 2021).

Empirical results across RL and planning benchmarks, graph bandits, recommendation, and multi-agent VRP demonstrate that submodular-aware methods consistently outperform greedy modular baselines, directly exploiting diminishing returns to enhance sample efficiency and policy performance (Wang et al., 2020, Santi et al., 2024, Prajapat et al., 2023, Anand et al., 18 Jul 2025, Tang et al., 2021, Liu et al., 2022, Mehrotra et al., 2023).

6. Limitations, Hardness, and Open Problems

Despite their favorable structure, most submodular reward maximization problems remain NP-hard, and strong inapproximability results apply even in deterministic tabular domains (Prajapat et al., 2023, Anand et al., 18 Jul 2025, Santi et al., 2024). Stochastic and bandit settings amplify hardness due to noisy feedback and partial information. No polynomial-time algorithm can, in general, achieve a constant-factor approximation better than $(1-\kappa)$ for curvature $\kappa<1$ (unless $P=NP$ ). In the adaptive regularization and partition-constrained settings, nonmonotonicity and negative values require specialized algorithmic reductions and surrogate objective formulations for effective optimization (Tang et al., 2021, Tang et al., 2021).

A plausible implication is that further algorithmic progress will depend on leveraging additional domain structure, exploiting low-curvature functions, or developing scalable sampling and surrogate estimation techniques.

7. Connections and Broader Impact

Submodular reward modeling bridges the gap between additive-reward RL and combinatorial submodular maximization, supporting complex objectives involving coverage, diversity, information, risk, and interaction effects. This modeling paradigm is foundational in modern RL, multi-agent cooperation, adaptive experiment design, and recommendation under fairness/bias constraints (Mehrotra et al., 2023). Future research directions include improved practical algorithms for non-monotone/non-submodular objectives, tighter curvature-dependent analysis, and domain-adaptive neural surrogates for large-scale, high-dimensional deployments.