Can Large Language Models Generalize Procedures Across Representations?

Published 3 Feb 2026 in cs.CL and cs.LG | (2602.03542v1)

Abstract: LLMs are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that LLMs trained solely on symbolic representations significantly fail to generalize procedures to natural language despite isomorphic task structures.
The proposed two-stage curriculum involves initial symbolic training followed by natural language adaptation using RL, yielding impressive cross-representation improvements.
Empirical analysis indicates that structural analogy rather than frequency matching underlies successful procedural transfer, highlighting fundamental limits in current LLM paradigms.

Generalization of Procedures Across Representations in LLMs

Introduction

This paper investigates the ability of LLMs to generalize procedural knowledge across distinct representations: natural language (NL), symbolic graphs, and code. The central inquiry is whether LLMs trained on symbolic forms such as code or graphs can reliably transfer learned abstract procedures to solve analogous tasks expressed in NL, which remains the dominant interface for real-world users. Through a controlled isomorphic task setting, the paper reveals systematic failures in cross-representation generalization and introduces a two-stage data curriculum as a remedy. Analysis further demonstrates that cross-representation success in LLMs is best understood as generative analogy rather than frequency-based pattern matching.

Experimental Framework

The study employs asynchronous planning tasks, where procedural steps, time durations, and constraints are presented in NL, graph (adjacency list), and code representations—each structurally isomorphic, with the same underlying algorithmic requirements (e.g., finding the critical path in a Directed Acyclic Graph). The paper tests several model architectures (Qwen, Llama-3, Olmo-2) using supervised fine-tuning (SFT), knowledge distillation, Self-Taught Reasoner (STaR), and Group Relative Policy Optimization (GRPO, RL).

Models are trained on a single representation and then evaluated on all three, ensuring that performance differences directly manifest limitations in cross-representation transfer. Performance is primarily measured by accuracy on unseen representations and statistical significance (McNemar's test).

Empirical Findings

Limitations of Cross-Representation Generalization

LLMs trained exclusively on symbolic representations (code or graph) perform well within those domains but fail to generalize procedures to NL formulations, despite identical underlying structure. This failure persists across architectures, scales, and post-training methods. Notably, GRPO (RL) achieves the highest in-domain accuracy but shows pronounced degradation when tested on an unseen representation, suggesting reliance on shallow schema exploitation rather than robust structural abstraction.

Training solely on NL yields the best NL test performance, but exhibits inefficient scaling—small models struggle to gain high accuracy even when trained on the target representation. Scaling model size does not mitigate the generalization gap; larger models often exhibit sharper performance drops under distributional shifts.

Effectiveness of the Two-Stage Curriculum

To address generalization failures, the paper proposes a curriculum: initial training on symbolic data (graph or code) for procedural induction, followed by NL adaptation. This regimen leads to significant improvements in cross-representation performance. A 1.5B Qwen model trained on the curriculum matches zero-shot GPT-4o and outperforms both its larger 7B variant and a 3B model trained only on NL, with identical training budget.

Order is critical—reversing the curriculum (NL → symbolic) nullifies gains. Furthermore, RL proves superior to SFT as the curriculum's adaptation phase.

Robustness and Domain Extension

Curriculum-trained models generalize more robustly to dialectal variants (NL-AAVE), outperforming baseline and zero-shot frontier models without explicit exposure to the dialect, implying that the curriculum equips LLMs with structural rather than purely surface-level knowledge.

Applying the same approach to high-complexity math and physics tasks demonstrates generalization benefits—even with comparable or slightly lower in-representation accuracy, curriculum-trained models generalize better to novel domains.

Analytic Interpretation

The paper quantifies generalization through analogy-based and frequency-based hypotheses. Using structure-mapping metrics, analogical strength (structural similarity of DAGs) consistently predicts generalization success better than frequency effects (sheer number of similar instances). Frequency explanations fail to account for transfer between symbolic and NL tasks, while analogical similarity correlates with curriculum-enabled transfer.

Qualitative analysis shows that models trained only on symbolic data often default to naive or additive reasoning in NL, while curriculum-trained models execute systematic path enumeration and selection, reflecting explicit procedural reasoning.

Implications and Future Directions

The findings highlight that LLMs do not innately bridge representational gaps even when tasks are procedurally identical, exposing fundamental limitations in abstract knowledge induction via popular post-training methods. Reliance on symbolic data in pre-training (code, graph, etc.) without targeted adaptation is insufficient for robust naturalistic performance.

Curriculum strategies that induce symbolic abstraction prior to NL adaptation offer scalable solutions. These principles may extend to a wider spectrum of domains requiring compositional reasoning, procedural transfer, or alignment with user-facing NL interfaces.

The revealed divergence between LLM and human analogical learning—humans generalize across forms with minimal exposure while LLMs require extended, structured training—underscores gaps in current architectures and training paradigms. Closing this gap may necessitate advances in explicit relational representations, structural induction, and analogical mapping in neural models.

Conclusion

This paper demonstrates that LLMs trained solely on symbolic data do not reliably generalize procedural knowledge to natural language problems. A two-stage RL curriculum—symbolic induction followed by NL adaptation—recovers and significantly enhances cross-representation generalization. Structural analogy, not frequency effects, underpins procedural transfer, suggesting generative analogical reasoning as a central mechanism for robust LLM generalization. Effective cross-representation learning in LLMs thus requires dedicated curricula, contrasting with the efficiency of human analogical cognition. Future work should refine curriculum design, improve structural induction architectures, and explore broader transfer across diverse reasoning domains.

Reference:

"Can LLMs Generalize Procedures Across Representations?" (2602.03542)

Markdown