Distilling On-device Language Models for Robot Planning with Minimal Human Intervention (2506.17486v1)

Published 20 Jun 2025 in cs.RO, cs.AI, and cs.LG

Abstract: LLMs provide robots with powerful contextual reasoning abilities and a natural human interface. Yet, current LLM-enabled robots typically depend on cloud-hosted models, limiting their usability in environments with unreliable communication infrastructure, such as outdoor or industrial settings. We present PRISM, a framework for distilling small LLM (SLM)-enabled robot planners that run on-device with minimal human supervision. Starting from an existing LLM-enabled planner, PRISM automatically synthesizes diverse tasks and environments, elicits plans from the LLM, and uses this synthetic dataset to distill a compact SLM as a drop-in replacement of the source model. We apply PRISM to three LLM-enabled planners for mapping and exploration, manipulation, and household assistance, and we demonstrate that PRISM improves the performance of Llama-3.2-3B from 10-20% of GPT-4o's performance to over 93% - using only synthetic data. We further demonstrate that the distilled planners generalize across heterogeneous robotic platforms (ground and aerial) and diverse environments (indoor and outdoor). We release all software, trained models, and datasets at https://zacravichandran.github.io/PRISM.

Summary

The paper introduces PRISM, a framework for distilling large language models (LLMs) into small language models (SLMs) to enable on-device robot planning with minimal human intervention.
PRISM-distilled SLMs achieve over 93% of GPT-4o's planning success rate across diverse robot domains and environments while running efficiently (under 5GB memory, real-time latency) on standard robot hardware.
The method automates the entire process via synthetic data generation from an LLM, offering scalability and robustness to network failures compared to cloud-dependent LLMs.

Distilling On-device LLMs for Robot Planning with Minimal Human Intervention

This paper introduces PRISM, a framework for distilling small LLMs (SLMs) for robot planning, enabling on-device execution with minimal human supervision. The motivation stems from the computational and infrastructural limitations of deploying LLMs such as GPT-4o on physical robots, particularly in environments with unreliable or unavailable network connectivity. PRISM addresses this by automating the synthesis of training data and distillation of SLMs that can serve as drop-in replacements for LLMs in robot planning pipelines.

Methodology

PRISM operates in three stages: scenario generation, plan elicitation, and planner distillation.

Scenario Generation: Given the action and observation spaces of an LLM-enabled planner, PRISM uses an LLM to synthesize diverse tasks and textual environment representations (e.g., scene graphs, object sets). The generator is prompted to ensure semantic coherence and adherence to the planner’s input format. This process is fully automated and does not require manual dataset curation or simulators.
Plan Elicitation: For each synthesized scenario, PRISM interacts with the source LLM-enabled planner to elicit plans. It masks parts of the environment to simulate partial observability, then iteratively queries the planner in a closed-loop fashion, updating observations based on the planner’s actions. This yields a dataset of task-observation-action sequences, with plan validation to filter out invalid or incomplete rollouts.
Planner Distillation: The collected dataset is used to fine-tune a target SLM via supervised fine-tuning (SFT), minimizing cross-entropy over action predictions. The process leverages parameter-efficient techniques such as LoRA to reduce memory footprint and training cost, making it feasible to distill models with a memory footprint under 5GB.

Experimental Evaluation

PRISM is evaluated on three LLM-enabled planners across distinct domains:

SPINE: Language-driven navigation, mapping, and exploration on both ground (UGV) and aerial (UAV) robots in indoor and outdoor environments.
SayCan: Hierarchical manipulation tasks in a simulated tabletop environment.
LLM-Planner: Household assistance tasks in the ALFRED simulator, requiring multi-step object manipulation and navigation.

The evaluation compares three configurations: the original LLM-enabled planner (GPT-4o), the same planner with an undistilled SLM (Llama-3.2-3B), and the planner with a PRISM-distilled SLM. The primary metric is planning success rate.

Key Results

Performance: PRISM-distilled SLMs achieve over 93% of the planning success rate of GPT-4o across all three domains, a substantial improvement over undistilled SLMs, which achieve only 10–20% of LLM performance.
Efficiency: The distilled SLMs run in real-time on robot hardware, with latency within 200ms of GPT-4o under ideal network conditions and over 1s faster under realistic (high-latency) conditions. This enables deterministic, network-independent planning.
Generalization: The distilled planners generalize across heterogeneous robotic platforms and diverse environments, including both indoor and outdoor settings.
Ablation: Removing environment masking or plan validation from the data synthesis pipeline significantly degrades performance, highlighting the importance of interactive, validated data for effective distillation.

Implementation Details

Data Synthesis: All training data is generated synthetically via LLM prompting, requiring only a high-level configuration of the planner’s action and observation spaces.
Fine-tuning: SLMs are fine-tuned using the Unsloth library (built on Huggingface Transformers and PyTorch) with LoRA for parameter-efficient adaptation. Hyperparameters are tuned per domain, with typical settings including 5 epochs, learning rates in the range 1e-4 to 2e-4, and LoRA ranks between 16 and 32.
Deployment: The resulting SLMs are deployed on standard robot compute platforms (e.g., Nvidia Jetson Orin NX, RTX 4000), with memory footprints under 5GB, enabling on-device inference without reliance on cloud infrastructure.

Implications and Discussion

PRISM demonstrates that high-quality, on-device LLM planners can be distilled from LLMs using only synthetic data, without manual annotation or simulation. This has several practical implications:

Scalability: The approach is scalable to new domains and robot platforms, as it requires minimal human input and no real-world data collection.
Robustness: On-device planners are robust to network failures and can be deployed in unstructured or remote environments.
Reproducibility: The release of code, models, and datasets facilitates reproducibility and further research.

However, the method inherits certain limitations from the underlying LLMs. The distilled SLMs are constrained by the expressivity of the action and observation spaces and may struggle with tasks requiring complex spatial reasoning or formal action representations (e.g., code generation). Additionally, safety vulnerabilities present in LLM-enabled planners (e.g., unsafe action generation) are likely to persist in the distilled models, necessitating further research into integrated safety mechanisms.

Future Directions

Potential avenues for future work include:

Improved Data Synthesis: Enhancing the quality and diversity of synthetic scenarios, particularly for tasks requiring advanced spatial or temporal reasoning.
Safety Integration: Incorporating safety objectives or constraints directly into the distillation process, possibly via reinforcement learning or constitutional AI techniques.
Broader Action Spaces: Extending the framework to planners with more complex or formal action representations, such as code or temporal logic.

Conclusion

PRISM provides a practical and effective solution for deploying LLM-based robot planners on-device, overcoming the computational and infrastructural barriers of LLMs. By automating data synthesis and distillation, it enables the creation of efficient, high-performing SLMs that can be readily integrated into existing robotic systems, broadening the applicability of language-driven robot planning in real-world settings.

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention (2506.17486v1)

Summary

Distilling On-device LLMs for Robot Planning with Minimal Human Intervention

Methodology

Experimental Evaluation

Key Results

Implementation Details

Implications and Discussion

Future Directions

Conclusion

GitHub

YouTube

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention (2506.17486v1)

Summary

Distilling On-device LLMs for Robot Planning with Minimal Human Intervention

Methodology

Experimental Evaluation

Key Results

Implementation Details

Implications and Discussion

Future Directions

Conclusion

Related Papers

GitHub

YouTube