LEGO-Compiler: Modular Code Generation Methods

Updated 10 March 2026

LEGO-Compiler is a modular suite of compiler and code generation methodologies that uses formal compositionality and verifiable transformations to ensure scalable and correct translation.
Its layout algebra decouples logical computation from physical memory layouts, enabling optimized GPU code generation with precise, bijective mappings.
The framework extends to spatial accelerator generation for tensor applications, delivering significant speedups and energy efficiencies through fully automated RTL synthesis.

LEGO-Compiler designates a suite of compiler and code generation methodologies unified by modular decomposition, formal compositionality, and rigorous mapping between logical computation and physical representation. This term encompasses systems for neural compilation, layout-driven GPU code generation, and spatial architecture synthesis. Focal representatives include LEGO-Compiler for neural source-to-assembly translation (Zhang et al., 26 May 2025), the LEGO layout compiler for GPU memory mapping (Tavakkoli et al., 12 May 2025), and the LEGO spatial accelerator generator for tensor applications (Lin et al., 15 Sep 2025). While terminological scope varies, the unifying thread is the use of composable, verifiable transformations or layout expressions that modularize complexity, enable correctness reasoning, and facilitate high-performance mapping across architectures.

1. Principles of Compositional Compilation

LEGO-Compiler (Zhang et al., 26 May 2025) operationalizes compilation as the composition of control blocks, leveraging the mathematical property of translation composability. For source programs $P_1$ and $P_2$ , there exists a translation function $T$ satisfying

$T(P_1 \circ P_2) \equiv T(P_1) \cdot T(P_2)$

where “ $\circ$ ” denotes syntactic concatenation at the source level and “ $\cdot$ ” denotes the corresponding assembly-level combination. Control blocks—basic blocks, loops, branches—are defined inductively, supporting hierarchical decomposition. This structure enables translation to proceed via isolated processing of each block, followed by concatenation, with correctness guarantees formalized in Theorems 4.1–4.4 of (Zhang et al., 26 May 2025). This approach decouples the translation of program fragments, enabling scale and parallelism.

2. System Architecture and Workflow

LEGO-Compiler implements a modular neural compilation pipeline (Zhang et al., 26 May 2025), comprising:

LEGO Translation Engine: Decomposes input source code (e.g., C functions) into semantically composable control blocks using an LLM-driven Part Split algorithm. Each block is independently translated with appropriate context (symbol tables, type information).
Verifiable LLM Workflow: Sequences domain-specific chain-of-thought (CoT) passes (variable renaming, type/layout analysis, mapping, part split/translate/rebuild), with each step paired with algorithmic or test-based external verification. For example, type/layout analysis results are cross-checked against structural layouts from Clangd or GCC; part splits are control-flow checked for isomorphism to the original.
Feedback Mechanism: Any semantic, runtime, or behavioral error (arising from assembler diagnostics, failed simulation, or failed unit tests) is used to craft feedback prompts for LLM-driven iterative correction, up to a bounded number of rounds. This workflow ensures that translations are both scalable and functionally correct via external testing (ExeBench, AnsiBench, and Csmith benchmarks).

3. Layout Algebra and GPU Code Generation

The LEGO memory layout compiler (Tavakkoli et al., 12 May 2025) introduces a layout algebra framework to separate computation over logical index spaces from explicit physical memory layouts. The approach is characterized by:

Layout Independence: All arithmetic kernels are expressed solely in terms of logical indices, with no manual stride or address arithmetic.
One-to-One Mapping: Every logical index is mapped bijectively to a unique physical address, ensuring optimal memory access (coalesced loads, bank conflict avoidance) on GPU architectures.

The core layout language supports constructs:

GroupBy—specifies block/grouping;
OrderBy—specifies sequential permutation (RegP: regular permutation; GenP: general bijection with provided inverse);
Canonical Mapping: For row-major, $f_{\text{row}}(i, j) = i \cdot n_2 + j$ ; for blocked layouts, $f_{\text{block}}(i, j; B, M) = \lfloor i/B\rfloor B M + (i \bmod B) n_2 + j$ , etc.

Indexing expressions are automatically generated via composition of permutation/grouping constructs, with symbolic simplification and code generation targeting backends such as Triton (Python/Sympy/Jinja2) and MLIR (affine map dialect embedding). This enables rapid exploration of new data layouts and integration with upstream compilation workflows.

4. Spatial Accelerator Generation for Tensor Applications

The LEGO framework for spatial accelerator generation (Lin et al., 15 Sep 2025) extends compiler-like decomposition into hardware architecture generation, especially for tensor computation workloads. This pipeline comprises:

Affine-Transformation-Based Representation: Compute and dataflow mappings are specified via affine transformations between iteration and data spaces: $\vec{d} = M_{I\to D} \vec{i} + \vec{b}_{I\to D}$

$\vec{i} = \begin{bmatrix} M_T & M_S \end{bmatrix} \begin{pmatrix} \vec{t}\ \vec{s} \end{pmatrix}$

Architecture Description Graph (ADG): High-level graph of functional units (FUs), buffer/data nodes, and reuse edges.
Detailed Architecture Graph (DAG): Pruned and fused representation, lowered to primitive netlist components.
Graph-Level Optimizations: Linear-programming–based pipeline register insertion, broadcast rewiring, balanced reduction tree extraction, pin reuse scheduling, and bit-width inference.
End-to-End RTL Generation: From loop nests to synthesizable RTL via SpinalHDL, with fully automated controller, data path, and memory generation.

This approach supports dynamic multi-dataflow fusion, automated on-chip memory synthesis, and broadcast/fanout minimization without handwritten templates.

5. Empirical Results and Quantitative Evaluation

The compositional and modular paradigm yields empirical improvements in accuracy, scalability, and hardware efficiency.

Neural Compilation (LEGO-Compiler) (Zhang et al., 26 May 2025):

On ExeBench (17,121 C functions): LEGO-Compiler obtains $>$ 99% behavioral test-pass rate, outperforming direct LLM translation by up to 9%.
On AnsiBench: Achieves 97.9% correctness.
Significantly extends the maximal compilable token length by nearly 10 $\times$ compared to direct LLM inference.
On Csmith (randomized stress programs): Successful compilation of 25/40 cases, double to eight-fold the baseline.

GPU Layout Compilation (LEGO) (Tavakkoli et al., 12 May 2025):

For matrix multiplication (8192 $^2$ , FP16), LEGO-generated code matches cuBLAS (ratio $T_{\rm cuBLAS}/T_{\rm LEGO} \approx 1.01$ ).
For FP8 $A^T B$ , $\sim$ 5% faster than Triton.
For Rodinia NW, switching to anti-diagonal layout yields $1.4\times$ – $2.1\times$ speedups.
In MLIR, transpose kernels match hand-tuned CUDA to within 2%.

Spatial Accelerator Generation (LEGO) (Lin et al., 15 Sep 2025):

On a suite of neural networks (AlexNet, MobileNetV2, BERT, LLaMA-7B), achieves 3.2 $\times$ speedup and 2.4 $\times$ energy reduction over Gemmini.
1.5 $\times$ area and 1.4 $\times$ energy reduction over unoptimized designs.
Utilization rates above 90% on diffusion models; competitive with hand-tuned ASICs and HLS generators.

6. Limitations and Future Directions

Limitations of LEGO-based systems reflect both architectural and methodological constraints (Zhang et al., 26 May 2025, Tavakkoli et al., 12 May 2025, Lin et al., 15 Sep 2025):

Compiler-level: Minimal code optimization versus classic compilers; stochastic outputs introduce reproducibility/security concerns; unstructured features (e.g., unrestrained gotos, exceptions) challenge composability; large symbolic expressions may exceed LLM context length.
Memory layout: Partial tile and masking overhead for irregular extents; dynamic layouts induce complexity in generated address expressions; support for sparse, distributed, or co-iterated layouts (à la TACO, Legion) is ongoing.
Accelerator synthesis: Highly automated netlist generation is template-free but may not yet outperform hand-tuned designs for domain-expert–critical kernels.

Planned extensions include integration of traditional optimization passes (inlining, loop unrolling), hybrid neural-symbolic compilation (LLM plus SMT solver–driven verification), perspective layout generalization, and cross-domain application to other assembly domains.

7. Significance and Comparative Perspective

LEGO-Compiler and related systems define a new direction in compiler and code generation methodology by emphasizing explicit compositionality and verification, decoupling logical and physical representation, and modularizing translation via data-driven and neural techniques. In neural compilation, this results in performance and correctness levels rivaling traditional toolchains for structured code, with greatly enhanced extensibility and potential for rapid adaptation to new languages or architectures. In layout and accelerator compilation, it enables the systematic exploration of the scheduling–layout–mapping space, yielding both high hardware efficiency and portability across GPU and custom accelerator targets. This modular paradigm positions LEGO-Compiler as a bridge between classical compiler engineering, hardware mapping, and LLM-driven program synthesis (Zhang et al., 26 May 2025, Tavakkoli et al., 12 May 2025, Lin et al., 15 Sep 2025).