Dual-Stack Microprocessor Architecture
- Dual-Stack Microprocessor Architecture is a design featuring separate data and return stacks that streamline operand management and control-flow handling.
- It employs Execute-in-Place (XIP) from SPI Flash and a centralized FSM to optimize instruction fetch and hazard resolution in resource-constrained FPGAs.
- Empirical evaluations show that an 8-entry configuration balances LUT utilization and timing, achieving up to 5.4 MIPS at a 27 MHz operating frequency.
A dual-stack microprocessor architecture employs two independent stack structures within its datapath—typically a Data Stack and a Return Stack—to decouple operand management from control-flow bookkeeping. Recent designs for low-cost, resource-constrained FPGAs demonstrate that such an approach, when combined with Execute-in-Place (XIP) from SPI Flash and minimal on-chip memory overhead, yields high code density, transparent implementation, and practical throughput for embedded workloads. Notably, the WASM-subset soft-core on the GW1NR-9 FPGA—implemented entirely in open-source EDA flows—features an 8-entry dual-stack configuration using distributed (LUT) RAM, a deterministic FSM for hazard management, and operates at 27 MHz, successfully executing interactive infix calculators and UART-based I/O (Chakrabarti, 30 Nov 2025).
1. Architectural Overview
The dual-stack microprocessor architecture, as exemplified by the WASM-subset design, implements a Harvard-like split in which instructions are fetched directly ("in place") from off-chip SPI Flash, while data memory resides in a 1 KB on-chip Block RAM. The hardware datapath comprises:
- Data Stack: An 8-entry, 32-bit circular buffer for operand storage, implemented in distributed (LUT) RAM, with the top two entries cached in registers for zero-cycle ALU access.
- Return Stack: An independent 8×32 structure dedicated to return addresses for CALL/RET operations.
- Stack Pointers: 3-bit pointers for both stacks, wrapping automatically, obviating pointer-overflow logic.
- Instruction Fetch Mechanism: Via a 24-bit address on the SPI Flash controller, supporting streaming of 8-bit opcodes and multi-byte immediates into the decode stage without intermediate instruction caching.
- Absence of On-Chip Instruction Cache: By excluding on-chip instruction RAM, all Block RAM resources are preserved for application data, optimizing resource allocation on FPGAs with limited embedded memory (Chakrabarti, 30 Nov 2025).
2. Execute-in-Place (XIP) and Memory Optimization
XIP is central to this architecture’s efficiency. Instructions are streamed directly from off-chip SPI Flash to the decode unit, bypassing the need for instruction RAM and thus conserving precious BRAM for data functions. The distributed-RAM stacks, augmented with register-cached top-of-stack (TOS) entries, minimize ALU access latency and simplify datapath control.
This configuration allows all 26 BRAM blocks in the GW1NR-9 device to be used for general-purpose data, leaving only distributed RAM and LUTs for stack implementation. The XIP approach is feasible due to the lightweight, parameterizable depth of the stacks and the simple FSM control path, which together ensure that instruction fetches from Flash are absorbed efficiently into the execution pipeline (Chakrabarti, 30 Nov 2025).
3. FSM-Based Stack Operation and Hazard Resolution
A 12-state centralized FSM orchestrates instruction fetch, decode, stack operation, memory access, and hazard management. Key FSM states include:
- FETCH, SPI_WAIT: Issue SPI requests and manage 200 ns Flash read latency (two explicit wait states).
- DECODE, FETCH_IMM: Classify opcode types and loop over fetches for multi-byte immediates.
- EXECUTE: Activate ALU, update stack pointers, and perform memory I/O.
- ALU_WAIT: Inserted after comparison instructions to resolve R-M-W (Read-Modify-Write) hazards inherent in single-port, distributed RAM stacks—guaranteeing that writes follow pointer decrements.
- Special I/O Wait States: UART_WAIT and KEY_WAIT handle blocking I/O sequences.
The simplified transition diagram is as follows:
1 2 3 4 5 |
FETCH → SPI_WAIT → DECODE DECODE –[is-imm?]→ FETCH_IMM (x4) → EXECUTE DECODE –[no-imm]→ EXECUTE EXECUTE –[comp-inst]→ ALU_WAIT → FETCH EXECUTE –[else]→ FETCH |
This FSM structure ensures deterministic execution timing, explicit hazard management, and stability under the constraints of Flash latency (Chakrabarti, 30 Nov 2025).
4. Stack Depth Parametrization: Resource and Timing Trade-offs
Resource and timing trade-offs were investigated by synthesizing the design with stack depths of 4, 8, and 16 entries on the GW1NR-9 FPGA. The following table summarizes empirical results:
| Stack Depth | LUT Utilization | FF Utilization | Routing/Timing Outcome |
|---|---|---|---|
| 4 entries | ≈ 65% | ≈ 28% | Meets timing; calculator overflows |
| 8 entries | ≈ 80% | ≈ 32% | Meets timing; optimal balance |
| 16 entries | ≈ 99% | ≈ 40% | Fails timing; excessive congestion |
The 8-entry configuration balances logic utilization and functional requirements, avoiding program stack overflows seen in smaller depths and severe routing congestion for larger depths. The logic resource model is approximately:
with representing LUT-RAM implementation cost per bit. This indicates that stack sizing must be informed by both device constraints and application workload (Chakrabarti, 30 Nov 2025).
5. Performance Characterization and Application Demonstrations
The architecture achieves a flash-latency-limited stable operating frequency of 27 MHz. Measured instruction throughput is 4–6 MIPS dependent on instruction type mix:
- Zero-immediate instructions: 5 cycles (27 MHz ÷ 5 ≈ 5.4 MIPS).
- Instructions with 32-bit immediates: 12–13 cycles (2.1 MIPS).
Example software benchmarks demonstrate real-world capability. For instance, a digit-parsing loop in an infix calculator requires approximately 30 cycles per digit at 27 MHz, with MUL, ADD, two stack push/pop operations, and a branch. Successful operation includes both single- and multi-digit calculators and UART-based I/O at 115200 baud, timing-validated via oscilloscope traces (Chakrabarti, 30 Nov 2025).
6. Design Principles and Implementation Considerations
Principal design decisions facilitating robust XIP and efficient stack operation include:
- Small, fixed-depth stacks in distributed LUT-RAM to minimize routing and logic overhead.
- Register caching of the top two stack entries for instantaneous ALU access.
- Explicit multi-state FSM that integrates Flash latency waits into the control path.
- Centralized ALU_WAIT state to guarantee correctness in distributed RAM R-M-W sequences.
- Full openness and transparency via synthesis and routing in open-source EDA flows, avoiding proprietary IP blocks.
Together, these factors yield a transparent, portable, and resource-efficient WASM-subset soft-core processor able to comfortably run interactive and computational applications directly from SPI Flash, leaving on-chip RAM solely to user data requirements (Chakrabarti, 30 Nov 2025).