Papers
Topics
Authors
Recent
Search
2000 character limit reached

GraphAllocBench Benchmark Suite

Updated 4 February 2026
  • GraphAllocBench is a benchmarking suite that operationalizes Preference-Conditioned Policy Learning (PCPL) in complex multi-objective reinforcement learning environments using graph-structured resource allocation.
  • It features the CityPlannerEnv, a flexible bipartite graph environment that simulates city-scale resource allocation with diverse and non-convex objective landscapes.
  • The suite integrates MLP and graph-aware (HGNN) architectures along with novel evaluation metrics (HV, PNDS, OS) to diagnose scalability, generalization, and preference alignment.

GraphAllocBench is a configurable benchmarking suite for Preference-Conditioned Policy Learning (PCPL) in multi-objective reinforcement learning (MORL). The platform is built around a graph-structured resource allocation environment (CityPlannerEnv) that enables the evaluation of MORL algorithms across high-dimensional, diverse, and complex decision-making tasks with dynamic preference conditioning. The design and evaluation protocol is specifically tailored to expose the scalability, generalization, and preference-alignment properties of policy learning approaches in settings where objectives are non-trivially interdependent and combinatorial in nature (Jiang et al., 28 Jan 2026).

1. Formalization of Preference-Conditioned Policy Learning in GraphAllocBench

GraphAllocBench operationalizes PCPL in the context of MORL by considering agents that learn a single policy πθ(a∣s,w)\pi_\theta(a \mid s, w), where w∈ΔN−1w \in \Delta^{N-1} denotes the user-specified preference vector over NN objectives, satisfying the simplex constraints ∑i=1Nwi=1\sum_{i=1}^N w_i = 1, wi≥0w_i \geq 0. The agent interacts with an environment characterized by a vector-valued reward function r:S×A→RNr: S \times A \to \mathbb{R}^N, with state space SS and (discrete) action space AA.

Given a trajectory τ=(s0,a0,...,sT−1,aT−1)\tau = (s_0, a_0, ..., s_{T-1}, a_{T-1}), the cumulative objective vector is r(τ)=∑t=0T−1γtr(st,at)r(\tau) = \sum_{t=0}^{T-1} \gamma^t r(s_t, a_t), with 0≤γ<10 \leq \gamma < 1 the discount factor. The objective is to maximize the expected scalarized return:

J(θ)=Ew∼p(w) Eτ∼πθ(⋅∣⋅,w)[Sϵ(r(τ),w)],J(\theta) = \mathbb{E}_{w \sim p(w)} \, \mathbb{E}_{\tau \sim \pi_\theta(\cdot|\cdot, w)} [ S_\epsilon(r(\tau), w) ],

where SϵS_\epsilon is a smooth scalarization (e.g., Smooth Tchebycheff) parametrized by ww. At inference, the policy can adapt instantly to any preference weights ww, recovering or approximating the associated Pareto-optimal solution (Jiang et al., 28 Jan 2026).

2. CityPlannerEnv: Flexible Graph-Based Resource Allocation Environment

The core of GraphAllocBench is CityPlannerEnv, a highly parameterizable bipartite graph environment designed to emulate city-scale resource allocation problems. Nodes are partitioned into resources RR (∣R∣=m|R| = m) and demands DD (∣D∣=d|D| = d). The allocation matrix A∈Rm×dA \in \mathbb{R}^{m \times d} encodes current resource-to-demand assignments, normalized by supply. The observation at each step is o=(vec(A),w)o = (\mathrm{vec}(A), w).

The action space consists of 2×d+12 \times d + 1 discrete choices: a modification mode t∈{−1,0,+1}t \in \{-1, 0, +1\} (decrement, no-op, increment) and a demand index j∈{1,...,d}j \in \{1, ..., d\}. Environment objectives Ji(P)J_i(P) are defined as real-valued functions over the vector of demand productions PP, with support for arbitrary combinations of polynomial, logarithmic, sinusoidal, and thresholding functions. This enables the construction of convex, non-convex, discontinuous, and sparse multi-objective landscapes, producing environments with varying Pareto front geometries and optimization challenges (Jiang et al., 28 Jan 2026).

3. Policy Architectures and Training Procedures

GraphAllocBench benchmarks policy architectures along two axes:

  • MLP-based architectures: The observation vector (flattened AA concatenated with ww) feeds into two hidden layers (128 units, SiLU activations), used for both actor and critic heads in PPO updates.
  • Graph-aware (HGNN) architectures: Nodes representing resources, demands, and an unallocated resource node are embedded and processed by two layers of heterogeneous graph attention, each node type with its own attention head and residual connection. The preference vector ww is concatenated at the node embedding and again at the global pooling/output layer, using either mean+max pooling (best for Pareto coverage) or preference-conditioned multi-head attention pooling (slightly higher hypervolume, lower monotonicity).

Training employs PPO, sampling w∼Dirichlet(1N)w \sim Dirichlet(\mathbf{1}_N) each rollout and normalizing objectives using a moving estimate of the ideal point for robust reward scalarization. The architectures are evaluated for generalization to unseen preferences and scalability on large graphs (e.g., 100 demands/resources) (Jiang et al., 28 Jan 2026).

4. Evaluation Metrics: Hypervolume, PNDS, and Ordering Score

GraphAllocBench introduces new evaluation metrics specific to the preference-conditioned context:

Metric Brief Description Range
Hypervolume (HV) Volume in objective space covered by the non-dominated solutions (requires true front) ≥0\geq 0
PNDS Proportion of non-dominated solutions, given KK preference samples [0,1][0, 1]
Ordering Score (OS) Average normalized Spearman correlation between preference weight and achieved objective [0,1][0, 1]
  • PNDS: For KK sampled preferences wkw_k, compute resulting objective vectors xkx_k. PNDS=∣{xk ∣ xk not dominated by any other}∣/KPNDS = | \{ x_k \,|\, x_k \text{ not dominated by any other} \} | / K. PNDS identifies traps where HV is high but policies are consistently dominated locally.
  • Ordering Score (OS): For each objective ii, sweep wiw_i across [0,1][0,1] (others random), and measure the Spearman rank correlation of JiJ_i values with wiw_i. The final OS aggregates across objectives and sweeps, capturing monotonicity: increasing preference must not result in lower objective achievement (Jiang et al., 28 Jan 2026).

This triad provides granular diagnosis of Pareto coverage, preference-alignment, and solution diversity.

5. Empirical Results and Analysis of Failure Modes

Empirical evaluation reveals the following:

  • Generalization: PPO+MLP PCPL significantly outperforms DDQN-based PD-MORL on 13/15 problems in normalized HV, PNDS, and OS, indicating strong alignment with user-specified preferences.
  • Failure Modes: Sharp reward discontinuities and threshold objectives degrade both HV and PNDS. Non-convex fronts (e.g., multi-modal or highly oscillatory objectives) challenge conventional scalarization; local optima result in low PNDS despite acceptable HV.
  • Scalability: On high-dimensional graphs (e.g., 100+ nodes), HGNN policies consistently achieve 2–4× higher HV than MLPs with similar or smaller parameter counts. Mean+max node aggregation produces best PNDS, whereas attention pooling is associated with stronger global HV but slightly reduced ordering consistency.
  • Flexibility: Varying graph topology, objective function composition, and resource budgets systematically alters the trade-off landscape, promoting a diverse and extensible benchmarking framework (Jiang et al., 28 Jan 2026).

6. Design Insights and Extensions

Key technical insights include:

  • Multi-stage preference conditioning: In graph-aware architectures, injecting ww at input, intermediate, and pooling levels is necessary for robust global preference alignment.
  • Graph-based representations: Modeling resource allocation as a bipartite graph, and applying heterogeneous graph attention, proves essential for effective policy learning in high-dimensional, combinatorial domains. MLPs lose sample efficiency and generalization performance in these settings.
  • Failure diagnosis: The combination of HV, PNDS, and OS exposes policy limitations (e.g., local optima, sub-par generalization) that are invisible to traditional Pareto coverage metrics alone.

Potential avenues for extension identified in the benchmark ecosystem include the integration of:

  • risk-sensitive optimization by incorporating stochastic events (e.g., disasters)
  • meta-learning for transfer across structurally similar city graphs
  • offline PCPL for logged datasets using preference-conditioned critics
  • multi-agent PCPL with decentralized assignment and communication (Jiang et al., 28 Jan 2026).

7. Significance and Future Impact

GraphAllocBench addresses a notable gap in MORL benchmarking by supporting dynamic, preference-conditioned evaluation at scale, with flexible, complex problem generation and rigorous preference alignment diagnostics. It establishes a foundation for advancing PCPL algorithms to tackle real-world combinatorial resource allocation under arbitrary, user- or stakeholder-driven objectives, where scalarization is insufficient or dynamic adaptation is critical (Jiang et al., 28 Jan 2026).

The formalism aligns closely with emerging theoretical frameworks for preference-based policy learning, directly connecting to broader developments in ordinal RL, preference-conditioned treatment policy learning, and direct preference optimization approaches documented in contemporary literature (Carr et al., 2023, Janmohamed et al., 2024, Parnas et al., 3 Feb 2026, An et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GraphAllocBench.