Decomposition of Driving Policy

Updated 13 December 2025

Decomposition of driving policy is a strategy to modularize complex autonomous driving tasks using hierarchical and temporal abstractions, simplifying control and improving learning efficiency.
It employs techniques like hierarchical reinforcement learning, option graphs, and modular abstraction to break down maneuvers into tailored sub-policies for specific decision dimensions.
Empirical results indicate that decomposed policies achieve faster convergence, safer execution, and robust sim-to-real transfer, enhancing adaptability in dynamic driving scenarios.

The decomposition of driving policy refers to the modularization or hierarchical structuring of the policy that governs autonomous vehicle behavior. This concept is central to achieving robust, data-efficient, and transferable decision making in self-driving systems. Decomposition enables separation of complex tasks into functional components—whether by temporal abstraction, semantic segmentation, hierarchical control, or modular design—to address the high-dimensional and multi-agent nature of real-world driving.

1. Hierarchical Reinforcement Learning-Based Decomposition

A prominent decomposition strategy utilizes hierarchical reinforcement learning (HRL) for decision making in autonomous vehicles. Here, the driving policy is split into a high-level master policy responsible for maneuver selection (e.g., lane keeping, left lane change, right lane change) and multiple low-level sub-policies for executing maneuvers in both lateral and longitudinal dimensions (Duan et al., 2020).

Master Policy ( $\pi_{\text{master}}(m|s; \theta_{\text{master}})$ ): Selects among maneuvers $\{0: \text{drive-in-lane}, 1: \text{left-change}, 2: \text{right-change}\}$ based on an extended 26-dimensional state vector encompassing ego, road, neighboring vehicles, and destination attributes. Rewards are structured to incentivize lane occupancy, destination reaching, and to penalize illegal or collision states.
Sub-Policies: Each maneuver $i$ employs separated steering ( $\mathrm{SP}_i(a_{\text{lat}}|s_{\text{lat}}; \theta_i^{\text{lat}})$ ) and acceleration ( $\mathrm{AP}_i(a_{\text{long}}|s_{\text{long}}; \theta_i^{\text{long}})$ ) networks. Action spaces and feature selections are maneuver-specific, ensuring tailored state-reward mappings (e.g., focusing on lane geometry for lane-keeping, adjacent-lane heads for lane change).
Objective: Maximize the sum of expected returns over master and sub-policy networks with asynchronous parallel actor-critic learners (APRL).

This structure addresses the overcomplexity of monolithic policies, reduces reward engineering burden, and enables sub-policy reuse across new scenarios. Empirically, hierarchical decomposition results in faster convergence (≈50–60 epochs for subs, ≈10 for master), smoother and safer policy execution (zero test collisions), and superior episodic rewards compared to monolithic RL (Duan et al., 2020).

2. Hierarchical Temporal Abstraction and Option Graphs

Temporal abstraction techniques introduce explicit hierarchy in time. An option graph (“Editor’s term”) structures the driving policy as a directed acyclic graph (DAG) of temporally extended options (Shalev-Shwartz et al., 2016).

Option Graph Structure: The root node selects high-level actions like “Prepare” or “Merge,” which are further decomposed into sub-decisions (lane, execution style, speed choice). Terminal nodes label vehicle-specific semantics (give/take/offset way). Each node encapsulates a local policy with its own features and parameters.
Termination and Gating: Sub-options execute over fixed mini-episodes (2–3s), gating transitions up the hierarchy. This shortens effective horizon length (from T ≈ 250 to ≈30), dramatically reducing gradient variance in policy optimization.
Learning Efficiency: Simulation studies on multi-agent merges find that option graph decomposition enables convergence in O( $10^5$ ) episodes; flat policies require O( $10^6$ ). Empirical variance drops by factor ≈8 with Option Graph usage, and ablation experiments reveal increased collision rates and sample complexity when hierarchy is removed (Shalev-Shwartz et al., 2016).

3. Semantic, Modular, and Abstraction-Based Pipelines

Decomposition is also implemented via abstraction boundaries enforced in the system design. Müller et al. sandwich the driving policy between a semantic perception module and a low-level control layer (Müller et al., 2018).

Three-Stage Pipeline: Raw image $x_t$ is converted to a semantic segmentation $a_t$ (road vs. not-road), which is consumed by a waypoint-predicting driving policy $y_t = (\varphi_1, \varphi_2)$ , and finally actuated via handcrafted PID controllers.
Decoupling and Invariance: Perception is trained on real-world images, with the driving policy learning on noisy output from perception, enforcing invariance to domain shifts (e.g., different illumination, weather, sensor noise). No adversarial adaptation is required; abstraction boundaries alone suffice for reliable sim-to-real transfer.
Empirical Outcomes: Modularity enables policies trained exclusively in simulation to transfer robustly to real-world hardware, outperforming end-to-end alternatives and achieving a 1.00 success rate in road-following trials (Müller et al., 2018).

4. Multi-Agent Contextual and Socially-Adversarial Decomposition

Driving policies in multi-agent environments benefit from a decomposition that explicitly separates context communication from policy execution. Zhang et al. formulate a Contextual Partially-Observable Stochastic Game (CPOSG) and introduce Social Value Orientation (SVO) as a communicated context (Zhang et al., 2023).

Hierarchical Policy Structure:
- Upper-Level Policy ( $\beta^u$ ): Communicates context vectors (SVOs) to all agents.
- Lower-Level Policy ( $\beta^\ell$ ): Executes control decisions conditioned on observations and received SVOs.
Two-Stage Training:
- Stage 1: Learn socially-aware flow, maximizing team-based or individually weighted rewards according to SVO parameters.
- Stage 2: Robustify ego policy through zero-sum adversarial training, whereby adversarial upper-level policies inject context perturbations to minimize ego’s return. Lower-level execution policies for background agents remain frozen to ensure traffic-rule compliance.
Experimental Insights: Decomposition into communication and execution enables traffic flows that sustain 80% success rates in merges with 50 agents, outperforms baselines in four scenarios, and yields ego policies with up to 15–16 statistically significant zero-shot generalization improvements over nearest competitors (Zhang et al., 2023). This suggests modular communication improves transfer and adversarial robustness.

5. Theoretical Motivation and Reward Engineering Implications

Across decomposition paradigms, several common theoretical motivations arise:

Complexity Reduction: Decomposing policies into maneuver, temporal, or semantic modules reduces the effective size of state and action spaces, facilitating sample-efficient learning.
Reward Shaping Localization: Custom reward functions per module simplify the objective, circumventing the challenge of monolithic, conflicting reward terms.
Variance Reduction: Temporal and policy hierarchy (Option Graph, HRL, SVO-based contexts) yield shorter effective horizons, lowering gradient variance and stabilizing training.

A plausible implication is that policy decomposition not only enhances learning efficiency and stability but also enables better generalization to diverse driving tasks and environments.

6. Implications for Transfer, Generalization, and Real-World Deployment

Empirical evaluations across studies consistently show that decomposed driving policies improve sim-to-real transfer, generalize across unseen scenarios, and achieve functional safety.

Decomposition Paradigm	Sample Efficiency	Transfer Robustness
Hierarchical RL (APRL) (Duan et al., 2020)	Converges in 10–60 epochs	Zero test collisions
Option Graph (Shalev-Shwartz et al., 2016)	10 $^5$ vs. 10 $^6$ episodes	8× gradient variance drop
Modular abstraction (Müller et al., 2018)	Reliable sim-to-real transfer	1.00 real-world success
SVO Contextual (Zhang et al., 2023)	80% merge @ 50 agents	88% zero-shot success

These results indicate that modular and hierarchical decomposition strategies are integral for scalable, safety-certifiable autonomous driving.

7. Controversies and Limitations

While decomposition reduces complexity and accelerates learning, potential limitations include the risk of sub-optimal coordination between modules, increased design burden for selecting appropriate abstraction boundaries, and the need for comprehensive empirical validation in edge-case scenarios. Some studies note that removing hierarchy leads to unstable and unsafe behaviors (Shalev-Shwartz et al., 2016), and that domain randomization alone cannot compensate for lack of semantic constraint (Müller et al., 2018).

This suggests that decomposition must be judiciously applied, with deep integration between theory, architecture, and empirical testing, to ensure robust policy deployment.