GraphAllocBench Benchmark Suite

Updated 4 February 2026

GraphAllocBench is a benchmarking suite that operationalizes Preference-Conditioned Policy Learning (PCPL) in complex multi-objective reinforcement learning environments using graph-structured resource allocation.
It features the CityPlannerEnv, a flexible bipartite graph environment that simulates city-scale resource allocation with diverse and non-convex objective landscapes.
The suite integrates MLP and graph-aware (HGNN) architectures along with novel evaluation metrics (HV, PNDS, OS) to diagnose scalability, generalization, and preference alignment.

GraphAllocBench is a configurable benchmarking suite for Preference-Conditioned Policy Learning (PCPL) in multi-objective reinforcement learning (MORL). The platform is built around a graph-structured resource allocation environment (CityPlannerEnv) that enables the evaluation of MORL algorithms across high-dimensional, diverse, and complex decision-making tasks with dynamic preference conditioning. The design and evaluation protocol is specifically tailored to expose the scalability, generalization, and preference-alignment properties of policy learning approaches in settings where objectives are non-trivially interdependent and combinatorial in nature (Jiang et al., 28 Jan 2026).

1. Formalization of Preference-Conditioned Policy Learning in GraphAllocBench

GraphAllocBench operationalizes PCPL in the context of MORL by considering agents that learn a single policy $\pi_\theta(a \mid s, w)$ , where $w \in \Delta^{N-1}$ denotes the user-specified preference vector over $N$ objectives, satisfying the simplex constraints $\sum_{i=1}^N w_i = 1$ , $w_i \geq 0$ . The agent interacts with an environment characterized by a vector-valued reward function $r: S \times A \to \mathbb{R}^N$ , with state space $S$ and (discrete) action space $A$ .

Given a trajectory $\tau = (s_0, a_0, ..., s_{T-1}, a_{T-1})$ , the cumulative objective vector is $r(\tau) = \sum_{t=0}^{T-1} \gamma^t r(s_t, a_t)$ , with $0 \leq \gamma < 1$ the discount factor. The objective is to maximize the expected scalarized return:

$J(\theta) = \mathbb{E}_{w \sim p(w)} \, \mathbb{E}_{\tau \sim \pi_\theta(\cdot|\cdot, w)} [ S_\epsilon(r(\tau), w) ],$

where $S_\epsilon$ is a smooth scalarization (e.g., Smooth Tchebycheff) parametrized by $w$ . At inference, the policy can adapt instantly to any preference weights $w$ , recovering or approximating the associated Pareto-optimal solution (Jiang et al., 28 Jan 2026).

2. CityPlannerEnv: Flexible Graph-Based Resource Allocation Environment

The core of GraphAllocBench is CityPlannerEnv, a highly parameterizable bipartite graph environment designed to emulate city-scale resource allocation problems. Nodes are partitioned into resources $R$ ( $|R| = m$ ) and demands $D$ ( $|D| = d$ ). The allocation matrix $A \in \mathbb{R}^{m \times d}$ encodes current resource-to-demand assignments, normalized by supply. The observation at each step is $o = (\mathrm{vec}(A), w)$ .

The action space consists of $2 \times d + 1$ discrete choices: a modification mode $t \in \{-1, 0, +1\}$ (decrement, no-op, increment) and a demand index $j \in \{1, ..., d\}$ . Environment objectives $J_i(P)$ are defined as real-valued functions over the vector of demand productions $P$ , with support for arbitrary combinations of polynomial, logarithmic, sinusoidal, and thresholding functions. This enables the construction of convex, non-convex, discontinuous, and sparse multi-objective landscapes, producing environments with varying Pareto front geometries and optimization challenges (Jiang et al., 28 Jan 2026).

3. Policy Architectures and Training Procedures

GraphAllocBench benchmarks policy architectures along two axes:

MLP-based architectures: The observation vector (flattened $A$ concatenated with $w$ ) feeds into two hidden layers (128 units, SiLU activations), used for both actor and critic heads in PPO updates.
Graph-aware (HGNN) architectures: Nodes representing resources, demands, and an unallocated resource node are embedded and processed by two layers of heterogeneous graph attention, each node type with its own attention head and residual connection. The preference vector $w$ is concatenated at the node embedding and again at the global pooling/output layer, using either mean+max pooling (best for Pareto coverage) or preference-conditioned multi-head attention pooling (slightly higher hypervolume, lower monotonicity).

Training employs PPO, sampling $w \sim Dirichlet(\mathbf{1}_N)$ each rollout and normalizing objectives using a moving estimate of the ideal point for robust reward scalarization. The architectures are evaluated for generalization to unseen preferences and scalability on large graphs (e.g., 100 demands/resources) (Jiang et al., 28 Jan 2026).

4. Evaluation Metrics: Hypervolume, PNDS, and Ordering Score

GraphAllocBench introduces new evaluation metrics specific to the preference-conditioned context:

Metric	Brief Description	Range
Hypervolume (HV)	Volume in objective space covered by the non-dominated solutions (requires true front)	$\geq 0$
PNDS	Proportion of non-dominated solutions, given $K$ preference samples	$[0, 1]$
Ordering Score (OS)	Average normalized Spearman correlation between preference weight and achieved objective	$[0, 1]$

PNDS: For $K$ sampled preferences $w_k$ , compute resulting objective vectors $x_k$ . $PNDS = | \{ x_k \,|\, x_k \text{ not dominated by any other} \} | / K$ . PNDS identifies traps where HV is high but policies are consistently dominated locally.
Ordering Score (OS): For each objective $i$ , sweep $w_i$ across $[0,1]$ (others random), and measure the Spearman rank correlation of $J_i$ values with $w_i$ . The final OS aggregates across objectives and sweeps, capturing monotonicity: increasing preference must not result in lower objective achievement (Jiang et al., 28 Jan 2026).

This triad provides granular diagnosis of Pareto coverage, preference-alignment, and solution diversity.

5. Empirical Results and Analysis of Failure Modes

Empirical evaluation reveals the following:

Generalization: PPO+MLP PCPL significantly outperforms DDQN-based PD-MORL on 13/15 problems in normalized HV, PNDS, and OS, indicating strong alignment with user-specified preferences.
Failure Modes: Sharp reward discontinuities and threshold objectives degrade both HV and PNDS. Non-convex fronts (e.g., multi-modal or highly oscillatory objectives) challenge conventional scalarization; local optima result in low PNDS despite acceptable HV.
Scalability: On high-dimensional graphs (e.g., 100+ nodes), HGNN policies consistently achieve 2–4× higher HV than MLPs with similar or smaller parameter counts. Mean+max node aggregation produces best PNDS, whereas attention pooling is associated with stronger global HV but slightly reduced ordering consistency.
Flexibility: Varying graph topology, objective function composition, and resource budgets systematically alters the trade-off landscape, promoting a diverse and extensible benchmarking framework (Jiang et al., 28 Jan 2026).

6. Design Insights and Extensions

Key technical insights include:

Multi-stage preference conditioning: In graph-aware architectures, injecting $w$ at input, intermediate, and pooling levels is necessary for robust global preference alignment.
Graph-based representations: Modeling resource allocation as a bipartite graph, and applying heterogeneous graph attention, proves essential for effective policy learning in high-dimensional, combinatorial domains. MLPs lose sample efficiency and generalization performance in these settings.
Failure diagnosis: The combination of HV, PNDS, and OS exposes policy limitations (e.g., local optima, sub-par generalization) that are invisible to traditional Pareto coverage metrics alone.

Potential avenues for extension identified in the benchmark ecosystem include the integration of:

risk-sensitive optimization by incorporating stochastic events (e.g., disasters)
meta-learning for transfer across structurally similar city graphs
offline PCPL for logged datasets using preference-conditioned critics
multi-agent PCPL with decentralized assignment and communication (Jiang et al., 28 Jan 2026).

7. Significance and Future Impact

GraphAllocBench addresses a notable gap in MORL benchmarking by supporting dynamic, preference-conditioned evaluation at scale, with flexible, complex problem generation and rigorous preference alignment diagnostics. It establishes a foundation for advancing PCPL algorithms to tackle real-world combinatorial resource allocation under arbitrary, user- or stakeholder-driven objectives, where scalarization is insufficient or dynamic adaptation is critical (Jiang et al., 28 Jan 2026).

The formalism aligns closely with emerging theoretical frameworks for preference-based policy learning, directly connecting to broader developments in ordinal RL, preference-conditioned treatment policy learning, and direct preference optimization approaches documented in contemporary literature (Carr et al., 2023, Janmohamed et al., 2024, Parnas et al., 3 Feb 2026, An et al., 2023).

Markdown Report Issue Upgrade to Chat

References (5)

GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning (2026)

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies (2023)

Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity (2024)

Preference-based Conditional Treatment Effects and Policy Learning (2026)

Direct Preference-based Policy Optimization without Reward Modeling (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GraphAllocBench.

GraphAllocBench Benchmark Suite

1. Formalization of Preference-Conditioned Policy Learning in GraphAllocBench

2. CityPlannerEnv: Flexible Graph-Based Resource Allocation Environment

3. Policy Architectures and Training Procedures

4. Evaluation Metrics: Hypervolume, PNDS, and Ordering Score

5. Empirical Results and Analysis of Failure Modes

6. Design Insights and Extensions

7. Significance and Future Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GraphAllocBench Benchmark Suite

1. Formalization of Preference-Conditioned Policy Learning in GraphAllocBench

2. CityPlannerEnv: Flexible Graph-Based Resource Allocation Environment

3. Policy Architectures and Training Procedures

4. Evaluation Metrics: Hypervolume, PNDS, and Ordering Score

5. Empirical Results and Analysis of Failure Modes

6. Design Insights and Extensions

7. Significance and Future Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research