Papers
Topics
Authors
Recent
2000 character limit reached

XIP Mechanism in FPGA Design

Updated 7 December 2025
  • XIP is a mechanism that executes instructions directly from non-volatile SPI flash, eliminating the need for dedicated instruction BRAM.
  • It uses a Harvard-style architecture with dual stacks and separated instruction/data paths to enhance efficiency in resource-constrained environments.
  • A twelve-state finite state machine coordinates flash-read timing and control logic to maintain stable operation and predictable throughput.

The Execute-in-Place (XIP) mechanism refers to the direct execution of program instructions from non-volatile external storage—specifically, SPI Flash—without preloading code into on-chip block RAM or SRAM. In the context of the WASM-subset stack architecture for low-cost FPGAs, XIP provides a solution for conserving scarce on-chip memory resources, optimizing for high code density, and enabling transparent, open-source synthesis and routing flows (Chakrabarti, 30 Nov 2025). The Tang Nano-9K FPGA implementation exemplifies this strategy by employing a hardware/software co-design approach that leverages a dual-stack processor core and SPI-based instruction fetching, obviating the need for dedicated instruction BRAM.

1. Hardware Architecture for XIP

The XIP mechanism is enabled by a Harvard-style bus split, wherein the instruction path and data path are physically and logically decoupled:

  • Instruction path: The core generates a 24-bit address bus targeting the SPI-Flash Controller. Instruction bytes are fetched serially over an 8-bit data line (MISO). The sequence is initiated by lowering the spi_cs_n signal and driving the address, after which the SPI controller clocks out one instruction byte per fetch cycle and asserts a “data_ready” handshake to the CPU.
  • Data path: A 32-bit wide internal SRAM, realized with 1 KB of on-chip block RAM, is reserved for application data, buffers, and the dual-stack subsystem. The two stacks (Data Stack and Return Stack) are realized via distributed LUT-RAM, each configured as 8×32 cells.
  • Stack subsystem: The stacks are accessed with 3-bit circular stack pointers (SP<2:0>) and are directly interfaced to registers that cache the top two entries for direct operand access. During EXECUTE, ALU operations and control flow (CALL/RET) manipulate these stacks accordingly.
  • I/O: Peripherals are accessed via a side-bus; in the reference design, UART is supported on an 8-bit bus.

This architecture eliminates the need for any instruction BRAM, thereby conserving all but two BRAM blocks—exclusively used for the data SRAM—leaving the rest for potential buffering or caching.

2. Finite State Machine and Control Logic

Instruction execution is orchestrated by a twelve-state finite state machine (FSM), tuned to address latency-induced stalls and resource hazards inherent in the XIP paradigm:

  1. FETCH_CMD: Dispatch the SPI read command and address.
  2. FETCH_WAIT_LOW / FETCH_WAIT_HIGH: Enforce SPI Flash timing parameters (tLOW, tHIGH).
  3. DECODE: Parse opcode, determine microoperation (EXECUTE, FETCH_IMM).
  4. FETCH_IMM: Gather multi-byte immediates via iterative read cycles.
  5. EXECUTE: Trigger ALU computation, stack pointer updates, or memory/I-O cycles.
  6. ALU_WAIT: Insert wait cycle to resolve read-modify-write hazards on the stack.
  7. UART_WAIT / KEY_WAIT: Manage peripheral or user input stalling.

Crucial control signals include spi_clk, spi_cs_n, and spi_mosi for upstream flash commands, and spi_miso for downstream instruction bytes. Stack write enables, addresses, and timing are synchronized, particularly to avoid race conditions during stack pointer updates and ALU writes. The ALU output is buffered via temp_alu to mediate combinatorial and sequential interactions.

3. SPI Flash Access Timing

Flash-read latency dominates throughput under the XIP mechanism:

  • SPI flash cycle: Each byte fetch involves tLOW ≈ 100 ns, tHIGH ≈ 100 ns; cumulative T_flash ≈ 200 ns per byte.
  • Instruction fetch pipeline: A single-byte opcode fetch comprises three cycles of SPI access, one for DECODE, one for EXECUTE (five cycles total). Fetching instructions with four-byte immediates extends to seventeen cycles due to repeated byte-fetch loops.
  • Clocking and throughput: The theoretical internal logic is capable of F_logic > 40 MHz, yet effective f_max is throttle-limited by flash speeds—f_max ≈ 1/T_flash = 5 MHz (byte-wise). In practice, the overlapped FSM allows stable operation at 27 MHz.
  • Latency masking: The twelve-state FSM introduces precise wait states to mask instruction-fetch stalls, maintaining deterministic execution.

A schematic relationship between timing elements is summarized by:

fmax=1max(FSM_setup+Tflash)f_{\text{max}} = \frac{1}{\max(\text{FSM\_setup} + T_{\text{flash}})}

where Tflash200T_{\text{flash}} \approx 200 ns (Chakrabarti, 30 Nov 2025).

4. Stack Depth Parametrization and Resource Trade-offs

Stack depth directly impacts look-up table (LUT) allocation, routing complexity, and ultimately logic utilization:

Stack Depth LUT Utilization Routing Characteristics
4 entries ~65% Safe; limited routine complexity
8 entries ~80% Tightest timing, adequate headroom
16 entries ~99% Severe congestion, closure fails

A core observation is that beyond 8 entries, the incremental overhead (fan-out from 3-bit pointers on 32-bit stack data) induces non-linear routing congestion. After 16 entries, place-and-route closure becomes infeasible. The empirical fitting for resource usage follows:

LUTsused(depth)LUTbase+depth×32×LUTper_bit\text{LUTs}_{\text{used}}(\text{depth}) \approx \text{LUT}_{\text{base}} + \text{depth} \times 32 \times \text{LUT}_{\text{per\_bit}}

XIP’s elimination of instruction BRAM leaves all 26 FPGA BRAM blocks free (except for the SRAM and stacks), maximizing memory headroom for large buffers or potential caching extensions.

5. Performance Characteristics and Practical Application

Timing and throughput in the XIP-enabled design are flash-constrained:

  • Clock frequency: Stable at f_clk = 27 MHz.
  • Instruction throughput: Simple-byte instructions execute in ≈ five cycles (~5.4 MIPS nominal); instruction streams with multi-byte immediates require ≈ seventeen cycles (~1.6 MIPS).
  • Benchmarks: Empirical system throughput ranges from 4–6 MIPS depending on workload composition and instruction encoding density.

The infix calculator application, which exercises DUP, SWAP, EQ, BR_IF, arithmetic, and I/O primitives, demonstrates stack pointer operation correctness (e.g., ALU_WAIT) and sufficiency of an 8-entry stack depth. Software division is realized via repeated subtraction, validating FSM hazard resolution.

Because no instruction cache is present, every byte fetch incurs a 200 ns flash penalty. The system compensates by enabling single-cycle stack and ALU operations (except in ALU_WAIT scenarios) and by employing compact, zero-address WASM-style instructions to preserve cycle efficiency.

6. Enabling Factors and Practical Implications

System viability under the XIP mechanism is secured by several interlocking design strategies:

  • Twelve-state FSM: Inserts target-specific wait states to absorb serial flash-read latency and resolve stack read-write hazards.
  • Dual-stack approach in distributed LUT-RAM: Minimizes operand access logic, circumvents BRAM allocation for instruction memory, and enables single-cycle stack ops.
  • Parametrizable stack depth: 8 entries are shown empirically to offer an efficient trade-off between resource utilization and computational complexity.
  • Open-source EDA flow compatibility: The design uses exclusively open-source synthesis and routing, circumventing the limitations of proprietary toolchains (Chakrabarti, 30 Nov 2025).
  • High code density: Achieved by in-place execution from SPI flash and use of a minimal, zero-address WASM-subset ISA.
  • Stable throughput: 27 MHz operation with predictable MIPS for typical classes of embedded workloads.

A plausible implication is that such XIP-enabled soft-cores are particularly well suited for resource-constrained FPGA deployments in transparent, verifiable hardware experiments, and applications where on-chip memory is at a premium.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Execute-in-Place (XIP) Mechanism.