Papers
Topics
Authors
Recent
Search
2000 character limit reached

PIC4rl-gym: Modular DRL Navigation

Updated 24 March 2026
  • PIC4rl-gym is a modular framework for autonomous navigation that unifies DRL, ROS2, and Gazebo for consistent experimentation.
  • It provides a Gym-like API and standardized metrics for training, testing, and benchmarking navigation policies in diverse environments.
  • The framework supports customization through ROS2 YAML configurations, enabling rapid integration of new robot models and sensor modalities.

PIC4rl-gym is a modular framework specifically designed for autonomous robot navigation research, integrating deep reinforcement learning (DRL) with established robotics tools such as ROS2 and the Gazebo simulator. By offering a cohesive system that unifies agent training, testing, and benchmarking across diverse platforms, tasks, and sensory inputs, PIC4rl-gym targets the rapid development and standardized evaluation of DRL navigation policies in both simulated indoor and outdoor environments (Martini et al., 2022).

1. System Architecture and Module Organization

PIC4rl-gym structures its core functionality around a minimal set of ROS2 nodes and Python classes to minimize inter-process communication latency, especially for high-bandwidth data such as images and LiDAR scans. The principal architectural elements include:

  • Gazebo Simulator: Hosts the physical static/dynamic world, robot models (URDF/SDF), physics, and all sensor plugins.
  • pic4rl_environment Class: Implements the RL environment with a Gym-like API (reset(), step()), encapsulating logic for simulation state control, action dispatch, reward calculation, and termination checks.
  • pic4rl_training Node: Inherits from pic4rl_environment and manages the DRL agent’s training loop, policy updates, and replay buffer operations.
  • Sensors Helper: Uses a parameter-driven configuration to subscribe to specific ROS2 sensor topics, preprocesses incoming data, and composes the observation tensor from multiple modalities (e.g., LiDAR, RGB-D).
  • Trainer (TF2RL Wrapper): Provides instantiation and orchestration of DRL algorithms (DDPG, TD3, SAC), corresponding neural architectures, and ROS2-configurable hyperparameters.
  • Testing Package (Tester class): Loads trained policies to run repeated episodes, records detailed trajectory, state, and reward logs, and outputs the results for benchmarking.

Reset and stepping routines coordinate closely with Gazebo services (e.g., /reset_world, /set_entity_state) for deterministic scenario resampling and synchronous MDP progression. Parameters such as episode limits, reward weights, and sensor configurations reside in ROS2 YAML files for rapid reconfiguration.

2. Integration with ROS2 and Gazebo

PIC4rl-gym is deeply integrated with the ROS2 ecosystem and leverages standard message types and Gazebo plugins. The main inter-process interactions include:

  • Velocity Commands: Dispatched as geometry_msgs/Twist on /cmd_vel.
  • Sensor Data: Consumed from /scan (sensor_msgs/LaserScan, for 2D LiDAR) and /camera/depth/image_raw (sensor_msgs/Image, for RGB-D).
  • Simulation Control: Invokes Gazebo services for resetting, pausing/unpausing physics, and setting or querying entity states.
  • Parameterization and Launch: Each experiment is defined by a launch file and corresponding YAML, typically following the pattern:
    • launch/train_navigation.launch.py for scenario setup and argument resolution,
    • config/params_training.yaml detailing environment variables, algorithm selection, architecture, and task-specific tuning.

Platform transition (e.g., robot or sensor swaps) involves minimal change—usually a single change in the launch argument—allowing seamless testing of different robot models using their respective URDF/SDF specifications and sensor topic assignments.

3. Deep Reinforcement Learning Components

3.1. Markov Decision Process Formalization

Autonomous navigation tasks are modeled as MDPs, defined by the tuple (S,A,P,R,γ)(\mathcal{S}, \mathcal{A}, P, R, \gamma), where:

  • S\mathcal{S}: Observations (e.g., LIDAR vector RNr\ell \in \mathbb{R}^{N_r}, depth image IRH×WI \in \mathbb{R}^{H \times W}, relative goal vector g=(xgxr,ygyr)g=(x_g-x_r, y_g-y_r))
  • A\mathcal{A}: Continuous command spaces (Twist or 3D velocities)
  • P(ss,a)P(s'|s,a): Dynamics induced by Gazebo’s physics and collision models
  • R(s,a)R(s,a): Task-custom reward function, e.g., large positive for goal reaching, negative for collision, and incremental progress
  • γ\gamma: Discount factor controlling future reward weighting

3.2. Neural Architectures

Two main architecture families are provided:

  • MLP (Dense): For vectorial state spaces (e.g., LiDAR, pose), comprising two hidden layers of size 256 each, with ReLU activations.
  • CNN + FC: For image-based states (RGB-D), three Conv layers [16, 32, 64 filters, 3×3, stride 2] feed into a flattened feature set, concatenated with auxiliary inputs, then processed by dense layers as above.

Critics for TD3 output two Q-values, and one for SAC; all networks support optional dropout and weight decay. Architecture choice is parameterized via YAML.

3.3. DRL Algorithms and Training Paradigm

PIC4rl-gym leverages the TF2RL library for all major continuous-control algorithms:

  • DDPG, TD3, SAC: All with commonly adopted hyperparameters (batch size 256, actor/critic learning rates 3e-4, discount γ=0.99\gamma=0.99, target smooth update τ=0.005\tau=0.005).
  • Exploration: Initial random warm-up phase (5,000 steps), followed by ϵ\epsilon-greedy exploration for DDPG/TD3 with exponential decay (ϵϵ0.995\epsilon \leftarrow \epsilon \cdot 0.995 per episode).
  • Replay Buffer: Up to 1,000,000 transitions, supporting standard, prioritized, and nn-step return variants.

All training, architecture, and sensor hyperparameters are specified via ROS2 parameter server and YAML, ensuring experiment reproducibility and modularity (Martini et al., 2022).

Multiple canonical navigation scenarios are pre-integrated as environment subclasses:

  • Indoor Office (8 × 8 m²): Cluttered layout with procedurally randomized start/goal locations and obstacles.
  • Vineyard Row (Outdoor): Dense, vision-centric task requiring navigation in rows between obstacles, as a proxy for agricultural applications.
  • Person Following: Domestic setting with an omnidirectional robot tracking a dynamic human agent, with both position and orientation shaping the reward structure.

Gym classes enable:

  • Point-to-Point Navigation: Reset randomizes robot and goal, and reward is structured by distance improvement, collision penalty, and goal acquisition.
  • Waypoint Following: State augmented with next waypoints, with composite reward.
  • Obstacle Avoidance and Person Following: Reward function extensions penalize unsafe proximity, with additional penalties based on orientation misalignment during tracking.

5. Metrics and Benchmarking

PIC4rl-gym outputs evaluation logs compatible with standardized benchmarking. Core metrics include:

Policy Success Time [s] Path [m] dd/Path vmeanv_{\rm mean} [m/s]
TD3 + LiDAR 5/5 24.11 5.84 0.96 0.28
TD3 + Depth 5/5 19.29 5.77 0.94 0.31
SAC + LiDAR 5/5 29.36 6.06 0.94 0.25
SAC + Depth 5/5 16.65 5.74 0.98 0.36

Metrics definitions:

  • Success Rate: S=1Ni=1N1{goal reached}S = \frac{1}{N} \sum_{i=1}^N \mathbf{1}_{\{\text{goal reached}\}}
  • Average Path Length: Lavg=1Ni=1NLi,  Li=t=1Tiptpt1L_{\rm avg} = \frac{1}{N} \sum_{i=1}^N L_i, \; L_i = \sum_{t=1}^{T_i} \|p_t - p_{t-1}\|
  • Average Time to Goal, Cumulative Reward
  • Clearance Statistics: dmin(i)=mintdist(pt,O)d^{(i)}_{\min} = \min_t \operatorname{dist}(p_t, \mathcal{O})

Additional per-step logging includes heading changes, accelerations, and minimum obstacle distances. Benchmark outputs are produced as CSV/JSON for downstream analysis.

6. Customization and Extensibility

PIC4rl-gym’s architecture explicitly supports rapid extensibility:

  • New Robot Platforms: Add URDF/SDF to the gazebo_models directory, define model-specific YAML entries, and set the robot model argument in the launch file.
  • Custom Reward Functions: Subclass the environment and override the reward calculation. Update the launch file to reference the new environment class.
  • Sensor/Policy Swaps: Switch sensor or network configurations via YAML—e.g., changing LiDAR beam count, swapping DDPG for SAC, or moving from MLP to CNN—causing the system to reload and re-instantiate accordingly at launch without recompilation.

This parameterization enables researchers to train, test, and benchmark DRL navigation policies across diverse settings and hardware, all controlled by unified configuration schemas (Martini et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PIC4rl-gym.