Loom: A Scalable Analytical Neural Computer Architecture

Published 9 Apr 2026 in cs.LG | (2604.08816v1)

Abstract: We present Loom, a computer architecture that executes programs compiled from C inside a looped transformer whose weights are derived analytically. The architecture implements a 22-opcode instruction set in 8 transformer layers. Each forward pass executes one instruction; the model is applied iteratively until the program counter reaches zero. The full machine state resides in a single tensor $X \in \mathbb{R}^{d \times n}$ of fixed size, and every step has fixed cost for fixed $d$ and $n$, independent of program length or execution history. The default configuration uses $d = 155$ and $n = 1024$, yielding 4.7 million parameters and 928 instruction slots. A compact configuration at $d = 146$ and $n = 512$ suffices for a 9$\times$9 Sudoku solver (284 instructions). The weights are program-independent: programs live in the state tensor, and the same fixed-weight model executes any compiled program. We make Loom source code publicly available at https://github.com/mkturkcan/Loom.

Abstract PDF Upgrade to Chat

Authors (1)

Mehmet Kerem Turkcan

Summary

The paper presents an 8-layer transformer with analytically fixed weights that supports a complete 22-opcode ISA in a single forward pass.
It utilizes innovative opcode-as-operand routing and a single-layer subtraction method to achieve exact, deterministic computation.
Empirical tests on FPGA and classic benchmarks confirm its parameter efficiency, scalability, and practical hardware readiness.

Loom: An Analytical Transformer-Based Neural Computer Architecture

Overview

"Loom: A Scalable Analytical Neural Computer Architecture" (2604.08816) introduces a fixed-weight, 8-layer transformer with analytically constructed parameters that implements a complete 22-opcode instruction set architecture (ISA). Loom can execute any program compiled from a practical subset of C, encoding full machine state in a persistent tensor and executing each instruction in a single transformer forward pass. The design enables exact, deterministic program execution—without training—by mapping all instructions to operand preparation for a shared subtract core, utilizing bipolar encoding and elaborate analytical routing schemes.

Technical Contributions

Loom substantially extends the programmable-computer transformer paradigm. While prior analytical constructs focused on minimal Turing-complete ISAs (e.g., SUBLEQ), Loom supports 22 opcodes in only 8 layers—compared to previous results of 1 opcode in 10 layers [giannou2023]. The core technical innovations are:

Opcode-as-operand-routing: All instructions, including arithmetic and control flow, are mapped to operand transformations for a universal borrow-chain subtract core. This eliminates the need for distinct per-opcode execution layers.
Single-layer direct subtraction: A 6-threshold-per-bit ReLU pipeline computes bitwise two's-complement subtraction in one layer, fusing classical bit-flip, increment, and addition stages into a single matrix operation; this is mathematically exact by construction.
STORE instruction via address rewriting: Indirect memory write is implemented by L2 FFN rewiring the scratchpad’s write address with a dereferenced pointer, dramatically reducing compiled code size for pointer-heavy programs.
Scale-independent analytic construction: The architecture, ISA, and compiler are parameterized: the same design functions across substrate sizes up to $164 \times 2048$ (1,792 instruction slots).
Hardware implementation: A resource-optimized FPGA implementation leverages argmax attention to eliminate the $n \times n$ attention matrix, closely matching analytical softmax behavior and achieving practical execution speeds on silicon.

Model Architecture

State Representation and Execution Model

The core state is a $d \times n$ real matrix $X$ , with $d$ rows partitioned into functional regions (commands, memory, scratchpad, PC, buffers, tags, indicator rows). All machine data uses bipolar encoding vital for exact gating and thresholding via piecewise ReLU functions. Program and execution state (including data and PC) reside entirely within this tensor, which is constant-size and does not grow with program or execution length.

A forward pass through the fixed, 8-layer transformer executes one instruction. Programs are encoded into matrices by a C-to-ISA compiler; instruction addresses reference columns, and instruction set extensions are realized within the analytic FFN and attention layouts.

Analytical Pipeline

The transformer executes a deterministic, pipelined sequence mapped layerwise:

L1: Instruction fetch via attention-based PC-to-instruction mapping.
L2: Operand read, opcode routing, and preparatory operand transforms.
L3: Indirect memory access and scratchpad error correction.
L4: Direct borrow-chain subtraction (shared by all arithmetic and many logic instructions).
L5: Memory (and, for SWAP, dual-location) write via analytic attention.
L6–L7: Branch flag computation and PC branching, merged for efficiency.
L8: Persistent bipolar correction to ensure numeric exactness over steps.

All weights are analytically designed, strictly sparse, and non-adaptive, relying on exact calculation rather than gradient-based learning.

Instruction Set and Compiler Support

The ISA supports 22 instructions, including HALT, MOV, INC/DEC, ADD/SUB, SHL/SHR, AND/OR/XOR, Jumps (JMP/JZ/JNZ/CMP), LOAD/FIND, SWAP, CMOV, MULACC, and the pivotal STORE for indirect writes. Programs are compiled from a C subset (including loops, arrays, function inlining) to this ISA. The compiler backend is implemented in Python and JavaScript; the latter enables fully client-side compilation and browser-based execution.

The addition of the STORE instruction is empirically significant: in the case of a 9 $\times$ 9 Sudoku solver, STORE reduces the instruction count from 1,085 to 284 and enables execution in the compact $146 \times 512$ configuration rather than requiring the largest matrix size.

Empirical Validation

Loom is validated extensively:

Unit tests: All 22 opcodes and multi-write/cross-head interactions are exhaustively tested, covering all boundary cases and permutations.
End-to-end functional tests: Classic algorithms (Fibonacci, GCD, array ops), games (Snake, raycasting engine), and a full-featured Sudoku solver are compiled and validated for bitwise equivalence between transformer, interpreter, and sparest JavaScript/FPGA argmax implementations.
Hardware deployment: The design synthesizes successfully on a Xilinx Alveo U200 FPGA, leveraging bit-identical argmax attention to the analytic softmax, and achieves reliable, deterministic execution of C-compiled programs.

Numerically, the model attains high parameter efficiency: in $155 \times 1024$ configuration, it requires 4.7M parameters and is $>$ 99.9% sparse with only 27 distinct nonzero values, substantially reducing hardware and inference footprint.

Theoretical and Practical Implications

Loom provides a concrete instantiation of analytical neural computation architectures with a non-trivial ISA. Unlike works that only prove Turing-completeness or focus on theoretical universality [schuurmans2024, li2025constantbit, jiang2025softmax], Loom offers a complete executable path with an expressive C subset compiler, deterministic behavior, and validated correctness across execution platforms. The scale-independence and weight sparsity make it suitable for both compact embedded contexts and high-throughput FPGA/ASIC arrays.

Practically, Loom demonstrates feasibility for deterministic rule enforcement and embedded algorithmic logic within broader perception systems. By embedding provably correct algorithmic and safety constraints within fixed subnetworks, learned models can guarantee execution of mission-critical logic (e.g., for autonomous vehicles or real-time control), avoiding the unreliability of purely learned approximations.

Loom’s success in folding classical control/dataflow constructs into analytic transformer architectures points toward more unified architectures where algorithmic, symbolic, and perception layers share a common substrate. Embedding deterministic algorithms in neural attention layers may provide new avenues to combine transparent program execution with connectionist models for trustworthy, interpretable AI.

Future Directions

Potential directions include extending the analytic construction to richer ISAs (e.g., supporting floating point, concurrency), developing neural/analytic hybrid architectures where learned submodules are gated or augmented by embedded analytic logic, and exploring tighter co-design with hardware for area-optimal FPGA/ASIC instantiations. Given the generality of operand-as-routing, similar constructions may be feasible in other differentiable architectures (e.g., recurrent or MLP-based programmable machines), which could push the limits of combined symbolic and neural computation.

Conclusion

Loom establishes a novel approach for implementing complete, scale-portable neural computers with analytically fixed weights in transformer architectures. By leveraging operand preparation, analytic borrow-chain computation schemes, and advanced address rewriting, Loom can execute a non-trivial ISA efficiently and exactly, with strong formal guarantees and hardware realizability. This model forms a promising blueprint for integrating algorithmic reasoning directly into connectionist systems and demonstrates that practical programmability is achievable with analytically structured transformers (2604.08816).

Markdown Report Issue