Robot Person Following (RPF)

Updated 17 September 2025

Robot Person Following is defined as the autonomous ability of mobile robots to detect, track, and follow designated human targets in shared environments.
It employs diverse sensor modalities and fusion techniques—from cameras and lasers to sonar and IMUs—to achieve real-time perception and robust navigation.
Advanced RPF systems integrate sophisticated planning, deep learning, and social interaction models to manage occlusions, multi-human scenarios, and dynamic movements.

Robot Person Following (RPF) denotes the autonomous capability of mobile robots to perceive, track, and maintain appropriate movement with respect to one or more human targets in shared environments. This functionality is foundational for applications ranging from domestic service and healthcare to industrial, underwater, and aerial robotics, and spans a variety of operational contexts and sensing modalities. RPF is characterized by the requirements of real-time perception, reliable target re-identification, robust navigation in dynamic or unstructured scenes, adherence to social norms (proxemics), and the ability to handle long-term interactions and occlusions.

1. Taxonomy of Person-Following Methods

RPF systems are primarily categorized by context of operation, sensing strategy, interaction mode, autonomy level, and underlying perception models (Islam et al., 2018). Key differentiating criteria include:

Operational Domain: Ground (UGV), underwater (AUV/ROV), aerial (UAV). Ground robots, such as shopping assistants or domestic robots, leverage structured environments, whereas underwater and aerial robots contend with unique sensing, communication, and mobility constraints.
Sensor Modality and Fusion: Broadly ranging from monocular/stereo/RGB-D cameras and laser rangefinders (ground robots), to sonar and hydrophones (underwater), and multi-modal sensor suites including IMUs, and thermal/infrared sensors (aerial). The selection of primary sensors largely dictates downstream perception and planning architectures.
Interaction Mode: Systems can incorporate explicit user interactions (voice, hand gestures, AR tags, smartphones) or leverage implicit interaction via robot behaviors. The granularity may target single or group following.
Autonomy Level: Fully autonomous systems are more feasible on ground due to ease of 2D navigation; semi-autonomous approaches are employed in underwater/aerial domains due to the complexity of 3D environments and communication constraints.
Perception Pipeline: Includes both model-based (color segmentation, template matching) and model-free (feature tracking, machine learning, deep CNNs: e.g., SSD, YOLO, OpenPose) methods; feature selection (e.g., mean-shift tracking update: $x_\text{new} = \frac{\sum_x x\, w(x)}{\sum_x w(x)}$ ) and capability to handle appearance variations and partial occlusions are crucial design choices.

2. Domain-Specific Operational Challenges

Distinct challenges arise depending on the RPF deployment scenario (Islam et al., 2018):

Ground Robots: Suffer from unreliable sensory input under lighting variation, dynamic multi-human interaction, sensor fusion complexity (for handling occlusions), and the necessity of socially acceptable navigation (spacing, angles).
Underwater Robots: Experience poor visibility, color distortion, communication loss (lack of GPS), and higher safety demands. Robust diver-following necessitates fusing vision and acoustic tracking, as well as leveraging periodicity in human motion.
Aerial Robots (UAVs): Are limited by battery and weight, require agile 3D planning under dynamically changing perspectives and occlusions, and must perform obstacle avoidance in unstructured outdoor environments, often necessitating stabilization sensors.

Common across all domains are the requirements for robustness to occlusion, re-identification of targets, and reliable navigation in dynamic, cluttered environments.

3. State-of-the-Art Approaches

Recent advances span the RPF perception, planning, control, and interaction pipeline (Islam et al., 2018):

Perception: Transitioned from color/shape-based segmentation and particle filters to deep learning, e.g., HOG+SVM, AdaBoost, or convolutional backbones (YOLO, SSD, OpenPose). CNN-based trackers deliver state-of-the-art accuracy but are computationally intensive, posing challenges for embedded real-time deployment.
Planning and Control: Classical map-assisted planners (occupancy grids, A*, D*, PRMs) are employed alongside target-centric planners and SLAM-based localization. Sensor fusion using Kalman, EKF, or UKF is standard. Image-based servoing and, increasingly, deep reinforcement learning approaches (including end-to-end variants) are explored.
Interaction: Explicit modalities use voice/speech, gestures, or AR markers, while implicit behaviors involve dynamic proxemic adaptation and socially meaningful conduct. Recent studies formalize preferred spatial relations and integrate modality fusion (e.g., vision+hapic feedback).
Learning-Based Advances: Imitation learning and deep RL (e.g., D4PG in LBGP (Nikdel et al., 2020)) allow systems to estimate user intent and predict short-term navigational goals, increasing performance and generalization, particularly when curriculum learning or hybrid goal-planning architectures are adopted.

4. Comparative Analysis and Trade-offs

RPF solutions are compared across multiple axes (Islam et al., 2018):

Method Type	Perception Robustness	Planning Optimality	Social/Interactive Suitability
Classical Features	Fast, sensitive to occlusion	Reactive, sub-optimal in dynamic	Limited social awareness
Machine Learning	Accurate, mod. compute req.	Improved anticipation	Moderate, explicit interaction possible
Deep CNN	Highest accuracy, slow edge	Autonomous, can integrate RL	Adaptable, needs further user studies

Key observations include:

Feature-based trackers trade off between speed and robustness to occlusion; CNN/detector-based paradigms outperform in accuracy but require more compute.
Map-based planning excels in static environments; anticipation and local reactive planners fare better under dynamic conditions.
Socially aware planners/adaptive interaction models are required for real-world deployment, as simplistic geometric following can produce user discomfort.
Practicality depends not only on empirical tracking/accuracy metrics but also on real-time capability on embedded platforms, multi-human/group tracking, and group-following scenarios.

5. Application Domains and Use-Cases

The literature details a diversity of RPF applications (Islam et al., 2018):

Domestic and Service Robotics: Shopping assistants, personal or hospital helpers, museum/tour guides.
Underwater Robotics: Diver following for intervention, monitoring, and search-and-rescue.
Aerial Robotics: UAVs in sports filming, surveillance, or disaster response.
Convoying/Multi-agent: Multi-robot systems for group or team following, with applications in industrial or exploratory contexts.

Modern RPF systems increasingly demand flexibility across these domains, necessitating modular architectures and adaptable sensor fusion pipelines.

6. Open Problems and Research Frontiers

Several unresolved research challenges remain (Islam et al., 2018):

Multi-person/Team RPF: Robust systems for simultaneous group following, requiring advances in multi-target perception, large-scale data association, and coordinated motion planning.
Convoying/Coordination: Coherent behavior in multi-robot leader–follower configurations, especially under dynamic and adversarial conditions.
Following Position and Intent Prediction: Algorithms that dynamically select following/leading positions (useful e.g., in filming or assistive scenarios) necessitate predictive models of human motion.
Learning From Demonstration and Few-Shot Learning: Transitioning from end-to-end or imitation learning approaches to effective, data-efficient systems is required.
Rich Human–Robot Communication: Progress toward flexible, context-aware communication interfaces using a broader array of cues (gaze, gesture, proxemics) is ongoing.
Social and Spatial Awareness: Incorporating accurate models of social norms, personal space, and cultural behaviors into navigation modules.
Long-term Adaptation: Systems capable of persistent, longitudinal adaptation to user habits are needed, particularly for assistive and healthcare robots.
Computing Constraints and Edge Deployment: Enabling real-time, deep learning-based approaches on battery- and resource-constrained mobile robots remains a limiting factor.
Privacy and Safety Concerns: With robots operating in public spaces and private domains, addressing privacy and safety via both technical and regulatory means is increasingly vital.

7. Synthesis and Prospective Directions

The field of Robot Person Following has matured significantly, with a spectrum of methods now supporting robust perception, planning, and interaction in a wide variety of static and dynamic settings. Nevertheless, scalable deployment requires ongoing progress in robust occlusion/re-identification handling, adaptive social behavior, long-term autonomy, and efficiency on embedded platforms. Integration of hybrid planning (model-based with learned components), context-sensitive interaction, and modular software architectures is anticipated to accelerate real-world adoption. Continued empirically grounded user studies, especially with sensitive populations (e.g., older adults) and in diverse operational contexts, will drive advances in both the effectiveness and acceptance of RPF systems.