Semantic Task Planning Framework
- Semantic Task Planning Framework is a system that uses hierarchical And-Or Graphs and LSTM networks to generate valid, context-aware sequences of atomic actions.
- It employs an AOG-LSTM to navigate task grammars and an Action-LSTM to sequentially predict actions and objects, achieving up to 93.7% accuracy in complex tasks.
- The framework's two-stage training with data augmentation and curriculum learning effectively overcomes limited annotations, enhancing robustness in real-world environments.
Semantic task planning frameworks are structured systems designed to generate contextually appropriate sequences of atomic actions for intelligent agents, given a particular scene and a specified task. These frameworks combine explicit task knowledge with data-driven learning, enabling agents to generate robust, context-aware action sequences that can handle the complexity and variability of real-world environments. The approach is exemplified by the integration of manually defined knowledge representations (such as And-Or Graphs) with recurrent neural network architectures, specifically Long Short-Term Memory (LSTM) networks. Recent methodologies leverage data augmentation via task grammars and neural sampling to overcome constraints of limited annotations, supporting high-performance, generalizable task planning.
1. Knowledge Representation and And-Or Graphs
A central component in semantic task planning is the explicit modeling of task structure and decomposition. The And-Or Graph (AOG) is adopted as a hierarchical, grammar-like knowledge representation that encodes all valid decompositions of a given task:
- Each task is represented by a root node .
- Non-terminal nodes consist of "and-nodes" (sequential decomposition into subtasks or action primitives) and "or-nodes" (alternative strategies or methods).
- Terminal nodes correspond to atomic action tuples (action, object), e.g., ("move to", "cup").
- Probability distributions are assigned to choices at or-nodes, capturing common-sense likelihoods or frequencies of alternative strategies.
The AOG structure is numerically represented by:
- Numbered nodes for unique identification.
- An adjacency matrix , where for "and" edges and for "or" edges.
- One-hot encodings for node types and, at leaves, for primitive actions and associated objects.
This hierarchical model restricts the action sequence generation to only those compatible with commonsense task decomposition, significantly reducing the semantic space and the possibility of generating invalid plans.
2. LSTM-Based Action Sequence Generation
The framework utilizes two recurrent neural network modules:
- AOG-LSTM: Processes the parsed AOG for a given task, predicting branch choices at each or-node. Initialized with a feature vector encompassing scene image features (object categories, locations, states) and the task encoding, the AOG-LSTM traverses the graph in depth-first order, recurrently choosing the most probable branch at each or-node. The output is a set of valid action sequences corresponding to concrete methods of task completion.
- Action-LSTM: Given an image and a task, generates the full sequence of atomic actions. The network uses an encoder-decoder schema:
- The initial hidden state is derived from a concatenated feature vector of the image and task.
- At each time step, the network ingests the previous atomic action encoding.
- The output comprises two independently predicted distributions: one over primitive actions, one over objects, improving training stability over joint prediction in sparse datasets.
Loss functions for both networks are negative log-likelihoods of the correct predictions, with Action-LSTM aggregating losses over primitive actions and associated objects separately.
3. Two-Stage Training Regime and Sample Augmentation
Training deep task planning models suffers from data scarcity due to the combinatorial diversity of action decompositions and orderings. The framework addresses this systematically:
- Stage 1: Train the AOG-LSTM on a modest, hand-annotated corpus, with ground-truth or-node selections as supervision.
- The loss: , where is the correct branch at the -th or-node for sample .
- Stage 2: Use the trained AOG-LSTM to generate a large, automatically augmented set of valid action sequences from the AOG. This augmented dataset, combined with the original, is used to train the Action-LSTM via curriculum learning:
- Samples are ranked by uncertainty (mean entropy across or-node predictions); training begins with low-uncertainty ("easy") samples, gradually moving to higher-uncertainty examples.
- The training objective is: , for all augmented samples with uncertainty below a threshold .
This strategy leverages explicit knowledge to generate diverse training data, substantially improving the coverage and robustness of the learned model.
4. Empirical Validation and Dataset Construction
The framework is evaluated on a purpose-built dataset comprising 1,284 scene images, covering 15 daily activities in varied environments (labs, kitchens, offices, bedrooms). Each image is annotated with:
- Object categories, spatial locations, and states.
- Task decompositions represented via the corresponding AOG.
- 12 primitive actions and 12 object types; a "task fail" action is also included for completeness.
Comparisons against baselines (Nearest Neighbor, Multi-Layer Perceptron, plain RNN) are performed using several accuracy metrics: action prediction, object prediction, atomic action tuple accuracy, and full sequence accuracy.
Key reported findings:
- The framework achieves atomic action sequence accuracy of approximately 93.7% on 12 tasks when AOG augmentation is used.
- Using AOG-generated training instances for tasks related to those in the annotated set can boost sequence accuracy by as much as 63% over models trained without AOG guidance.
- Each component (AOG structure, curriculum learning, separate action/object prediction) is validated via ablation analysis as essential for peak performance.
5. Integration of Symbolic and Sub-Symbolic Methods
This approach exemplifies the integration of symbolic task knowledge (AOGs) with sub-symbolic, data-driven sequence modeling (LSTM networks):
- The explicit task grammar induces structured constraints, ensures only physically/semantically valid action sequences are considered, and supports sample-efficient generalization.
- The neural architecture learns compositional, context-sensitive action ordering and selection, handling the variability in visual scenes and task objectives.
- This hybridization balances interpretability, sample efficiency, and the ability to generalize to unobserved but structurally similar tasks, addressing key challenges in transition from symbolic AI to modern deep learning.
6. Applications, Limitations, and Future Directions
The framework, by generating detailed action sequences from seen and novel visual scenes and task queries, supports real-world robotic planning in settings such as household service robotics, experimental manipulation, or complex human-robot interaction.
Identified limitations include:
- Reliance on hand-crafted scene features rather than end-to-end learned representations.
- Manual specification of AOG task grammars, which restricts scalability and adaptability to previously unseen task types.
Future work is oriented towards:
- Automated learning of task grammars from large-scale behavioral data.
- End-to-end training and integration with raw perception (scene understanding, object recognition).
- Deployment in more dynamic, interactive, or real-world simulation settings (e.g., AI2-THOR), potentially in closed-loop operation with sensor feedback.
Such advancements would further solidify semantic task planning as a foundational technology for general-purpose, intelligent autonomous systems.