LegoGPT: Spatial Assembly & Reasoning

Updated 24 July 2025

LegoGPT is a conceptual framework that assesses large language models’ abilities to perform spatial assembly tasks under strict LCL rules.
The framework reveals that despite reciting rules accurately, current models struggle with generating valid assemblies due to misapplied interlocking constraints.
Benchmark results show significant performance gaps, with GPT-4 producing only 16 valid constructions out of 400 compared to GPT-3.5’s consistent failures.

LegoGPT refers to the conceptual framework and empirical assessment of LLMs’ abilities to perform structured spatial reasoning and assembly tasks inspired by LEGO construction, as examined through the lens of the LEGO Connect Language (LCL) benchmark. The critical focus is on the limitations of current GPT models when tasked with geometric planning, spatial logic, and strict rule compliance, as distinct from conventional textual or conversational competencies.

1. Formalization of LEGO Connect Language (LCL)

LEGO Connect Language (LCL) is a formal system for specifying two-dimensional LEGO assemblies under precise constraints. In the LCL₂ setting, each LEGO brick is defined as a tuple:

$P = (l, w, (x, y), c, h)$

where $l$ (length, fixed at 4 units), $w$ (width, fixed at 2 units), $(x, y)$ (position coordinates), $c$ (color), and $h$ (height, fixed at 1 unit) characterize each piece. The construction $M$ is valid within LCL₂ only if two principal conditions are satisfied:

No overlapping of pieces: no two pieces occupy the same $y$ -coordinate with any overlapping $x$ -span.
All pieces are interconnected via interlocking pegs, ruling out mere side contact.

Tasks for a system under LCL include (i) validating assembly legality with respect to these rules and (ii) generating sets of brick coordinates that respect the outlined constraints, such as building a specific shape or fulfilling a natural language prompt.

2. Benchmarking GPT Models Using LCL

Assessment of GPT-3.5 and GPT-4 on LCL tasks reveals profound challenges in spatial reasoning. Experimental tasks are bifurcated into:

Validity testing: Models must determine whether a proposed assembly conforms to LCL rules.
Construct generation: Models must output a coordinate list for bricks that form a valid assembly per the prompt.

Key findings include:

Model	Validity Testing (Best-Case)	Valid Construct Generation
GPT-3.5	Linear improvement with temperature, but performance remains poor	0/400 valid constructions
GPT-4	Peaked performance at temp=0.5, then degraded	16/400 valid constructions

Failure cases consistently result from misapplication of “interlocking” constraints and frequent overlaps, even as models can recite the rules in natural language. The results indicate that simply describing rules is easier for these models than adhering to them during generative assembly tasks.

3. Interpretation of Strategic and Spatial Reasoning Requirements

Tasks within the larger ChildPlay benchmark, and the LCL task in particular, necessitate robust spatial cognition and procedural planning. Unlike traditional board games (e.g., Tic-Tac-Toe, Connect Four) where performance can be measured through missed wins and move heatmaps, LCL exposes the inability of LLMs to plan valid configurations under geometric and combinatorial constraints. For instance, even a simple “line” of three connected bricks must meet interlocking criteria not satisfied by mere adjacency, necessitating recursive spatial planning as formalized by:

$f(0) = 0 \ f(s) = 4 \times (s-1) + f(s-1)$

and its closed-form: $f(s) = 2(s-1)s$

These constraints impose requirements exceeding mere token prediction, demanding a combinatorial exploration of valid spatial layouts and the translation of abstract prompts to rule-consistent coordinate outputs.

4. Patterns in Model Performance and Temperature Effects

Model behavior under varying temperature hyperparameters illustrates a trade-off between determinism and exploratory variability. At low temperatures, models tend to produce repeated, invalid outputs with little diversity, often systematically violating connectivity constraints. At higher temperatures, diversity of proposals increases but valid constructions remain sparse—generative accuracy does not appreciably improve. This suggests that neither increased randomness nor deterministic decoding suffices to instill the requisite spatial logic for the LCL domain.

Furthermore, across both validity and construction tasks, the ability to explain or recite constraints in text does not transfer to rule-consistent generative behavior, marking a fundamental boundary in the procedural generalization of current GPT models.

5. Implications for Future “LegoGPT” Models

Empirical evidence suggests that current LLMs lack the internal mechanisms necessary for enforcing geometric constraints and integrating combinatorial spatial reasoning with linguistic synthesis. For a viable “LegoGPT”—a system capable of robust structured assembly or CAD-like design—the following would be required:

Integrated spatial awareness, either via architectural extensions (e.g., explicit spatial planning modules) or by coupling LLMs with symbolic or visual geometric solvers.
Dedicated training or finetuning on structured assembly tasks, using abstract rule sets not present in standard textual corpora.
Potential application of reinforcement learning or hybrid methods, as exemplified in domains such as game-playing and image generation, to bridge the gap between rule articulation and compliance.

Such advancements would enable applications in automated design, educational instruction translation, and rapid prototyping, contingent on models that can consistently map instructions to valid assemblies within tight geometric constraints.

6. Technical Formalizations and Constraints

The LCL framework enforces constraints not typically present in text-based tasks, including:

Permissible brick rotations restricted to integer multiples of $\pi$ ( $n\pi$ , $n \in \mathbb{Z}$ ).
Explicit overlap and connectivity tests defined at the level of coordinate algebra and combinatorial geometry.
Assembly validity defined not only by absence of rule violations but also the capacity for constructive recursion, suitable for algebraic specification and automated theorem-checking.

These aspects necessitate a symbolic or programmatic approach, and present a challenge for autoregressive, text-only models which traditionally excel at pattern completion rather than explicit geometric logic.

7. Broader Significance and Limitations

The poor performance of GPT-3.5 and GPT-4 on LCL tasks underscores a limitation in the current generative technology. While these models display conversational and descriptive proficiency, their weaknesses in structured spatial assembly suggest limited capacity for applications requiring planful manipulation of physical or abstract objects under explicit rule regimes. This distinction is central to evaluating claims of emergent general intelligence in LLMs and steers subsequent research toward more multimodal or hybrid architectures for next-generation “LegoGPT” systems.

A plausible implication is that establishing benchmarks such as LCL is essential to systematically probe advances in this area and to clarify the domain boundaries where LLMs’ purported generality does not translate into substantive problem-solving capacities. Further refinement of evaluation protocols, integration of non-linguistic reasoning modules, and development of datasets removed from training distributions will be critical for future progress.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LegoGPT.