FlashAbacus: Self-Governing Data Accelerator
- FlashAbacus is a self-governing data-processing accelerator that unifies heterogeneous kernel execution with direct NAND flash access for energy-efficient computing in embedded and edge systems.
- It employs both static and dynamic inter-kernel as well as in-order and out-of-order intra-kernel scheduling to optimize core utilization and reduce latency for mixed workloads.
- Direct flash integration bypasses traditional filesystems, using a dedicated on-chip Flashvisor and efficient page mapping to achieve low-latency storage management and energy savings.
FlashAbacus is a self-governing data-processing accelerator that integrates heterogeneous kernel execution and direct NAND-flash storage access within a unified low-power multicore system. It is designed to maximize energy efficiency and computational flexibility for data-intensive workloads, specifically targeting embedded, edge, and near-sensor systems that operate within strict power envelopes. By eliminating the need for host-level filesystems, I/O runtime libraries, and external CPU arbitration, FlashAbacus fundamentally restructures the hardware–software boundary in storage-centric heterogeneous computing architectures (Zhang et al., 2018).
1. Architectural Composition
FlashAbacus consists of three tightly integrated hardware domains:
- Lightweight VLIW multiprocessors (LWPs): Eight 1 GHz VLIW cores, each featuring eight functional units (2 multipliers, 4 general-purpose ALUs, 2 load/store units), private 64 KB L1 and 512 KB L2 caches, collectively sharing 1 GB DDR3L memory (800 MHz, ~6.4 GB/s peak) and a 4 MB SRAM scratchpad (500 MHz, ~16 GB/s).
- FPGA-based Flash Backbone: Four NV-DDR2 open-channel flash channels, each connected to four triple-level-cell packages (totaling 16 NAND dies and 32 GB). Custom Virtex Ultrascale FPGA logic (~2M logic cells) implements per-channel flash controllers that translate high-level requests into flash I/O operations and manage SRIO links to the network.
- On-Chip Networks and I/O: A two-tier crossbar interconnection (tier-1: 256 lanes for cores, DDR3L, and scratchpad; tier-2: multiple 128-lane sub-crossbars for PCIe and FMC ports). PCIe v2.0 ×2 endpoint interface (≈1 GB/s) supports host kernel offloading. Internal management is handled by an embedded controller (“Flashvisor”) and a message-queue accelerator (Navigator), ensuring the host CPU remains uninvolved in both resource arbitration and core scheduling.
A representative high-level block diagram:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
+--------------------------------------+
| FlashAbacus PCIe Endpoint |
| +-------------------+ +---------+ |
| | Flashvisor (LWP) | | StorEng | |
| +-------------------+ +---------+ |
| | \ / | |
| | \ Tier-1 XBar / | |
| +----v----+ +----v----+ +----v----+ |
| | LWP0 | | LWP1 | | LWP7 | |
| +----+----+ +----+----+ +----+----+ |
| +---+---+---+---+ Tier-2 +------+
| | | | | XBars |
+----------v---v---v---v-------------+
| | | | +--------------------------------+
| 4× FPGA Flash Controllers |
+--------------------------------+
| | | | 16 NAND dies (32 GB total) |
2. Kernel Execution and Scheduling Mechanisms
FlashAbacus achieves self-governance through the Flashvisor, a dedicated on-chip controller responsible for all aspects of kernel launch, I/O orchestration, and resource assignment. It employs two complementary scheduling strategies:
A. Inter-Kernel Scheduling:
- Static Inter-Kernel: Each kernel is assigned to a fixed core, creating simple but unbalanced workloads.
- Dynamic Inter-Kernel: Kernels are queued globally; free LWPs pull the next available kernel in round-robin, improving load balancing and idle reduction.
B. Intra-Kernel Scheduling:
- Kernels are split into "microblocks" (sequential steps) and "screens" (parallelizable inner loops or partitions).
- In-Order Intra-Kernel: Microblocks execute serially; within a block, all screens run concurrently on available LWPs.
- Out-of-Order Intra-Kernel ("IntraO3"): Flashvisor dispatches any ready screen from any kernel or microblock, constrained only by data dependencies. This maximizes utilization by allowing idle cores to execute screens from other kernels.
Simplified Inner Logic (OOO Scheduler):
1 2 3 4 5 6 7 8 9 |
// global ready-screen list: (kernel_id, microblock_id, screen_id, data_range) while (not all_kernels_done) { wait_for_event(core_freed or new_kernel_arrived); for each ready_screen s in global_list: if no data_conflict(s) and there is a free LWP: assign s to some free LWP; mark s as in-flight; // cores process screens, notify on completion } |
Data dependencies are maintained in a "multi-app execution chain," where subsequent microblocks unlock only after all screens in the preceding block complete.
3. Direct Flash Integration and Data Management
FlashAbacus presents a byte-addressable flash abstraction, mapped directly into kernel data sections in DDR3L, bypassing host filesystems and DMA engines. Flashvisor intercepts load/store requests, verifies them against a range-lock structure to prevent overlapping writes, and converts logical page groups into physical NAND accesses using an in-scratchpad page table.
- Logical address space: Partitioned into 64 KB "page groups" (distributed as 4 × 16 KB pages across four channels).
- Page mapping table: Entire 32 GB space tracked by ≈512K entries, stored in the 4 MB SRAM scratchpad for zero-latency access.
- Write Policy: Log-structured allocation of page groups; background migration and block reclamation handled by Storengine (a secondary LWP) during garbage collection and wear leveling.
- Protection: Range locking utilizes a red–black tree keyed by page group start address. Any request overlapping an in-flight operation is blocked until commit.
Only critical metadata (first two pages per block) is persisted on flash, enabling recovery post power-fail.
4. Implementation Characteristics
The prototype employs:
- TI TCI6678 multicore PCIe card, 2W PCIe lanes
- Eight 1 GHz VLIW LWPs, each 0.8 W/core
- 1 GB DDR3L @ 800 MHz, 0.7 W, 6.4 GB/s peak bandwidth
- 4 MB SRAM scratchpad, shared, at 500 MHz (~16 GB/s)
- Tier-1 crossbar: 256 lanes @ 500 MHz; Tier-2 crossbars: 128 lanes @ 333 MHz
- 4 NV-DDR2 flash channels @ 200 MHz, totaling 11 W
- Per-channel I/O rates: 16 KB read ≈ 81 μs, 16 KB write ≈ 2.6 ms
- PCIe v2.0 ×2 (5 GHz), 1 GB/s peak, 0.17 W
All Flashvisor metadata resides in SRAM scratchpad; persisted page metadata occupies minimal flash space for durability.
5. Performance and Energy Evaluation
FlashAbacus was evaluated using 14 Polybench kernels and 14 heterogeneous kernel mixes, compared across five system configurations:
| Configuration | Scheduling Type | Hardware Integration |
|---|---|---|
| SIMD | Baseline (host DMA) | Embedded accelerator |
| InterSt | Static inter-kernel | FlashAbacus (no intra) |
| InterDy | Dynamic inter-kernel | FlashAbacus (no intra) |
| IntraIo | In-order intra-kernel | FlashAbacus |
| IntraO3 | Out-of-order intraO3 | FlashAbacus |
Throughput improvements (normalized to SIMD):
- In data-intensive kernels, IntraO3 yields +144% average throughput vs. SIMD; InterDy +130%; InterSt +80%.
- For mixed workloads, IntraO3 is +127% over SIMD, exceeding InterDy by 15%.
Speedup equation:
where includes total computation, I/O, and scheduling latency.
Energy metrics (relative to SIMD):
- SIMD: ≈71% of energy consumed by host–accelerator transfers, with 85% attributed to storage software stack.
- IntraO3 reduces total energy by 78.4%, with major efficiency from diminished data movement and host CPU interaction.
- InterSt may increase energy usage in compute-heavy, poorly balanced kernels due to non-ideal core activity.
Core utilization:
- SIMD: ~50% (due to stalls on I/O)
- InterSt: lower, from static imbalance
- InterDy: up to 98%
- IntraO3: ~95%, highest on mixed workloads
6. Scalability, Limitations, and Research Directions
FlashAbacus's out-of-order intra-kernel scheduling is optimal when workloads are partitionable into independent screens; strictly serial code constrains achievable parallelism but enables overlapping with other kernel parallel regions. The system is suited to low-power scenarios (≤20 W) such as embedded analytics and edge inference, where close coupling of storage and compute is essential.
Limitations include:
- FPGA-based flash controller; ASIC deployment could further reduce I/O latency.
- Lack of support for commercial accelerators and GPGPUs due to insufficient access to low-level flash and memory interfaces.
- Simple page mapping and block recycling (log-structured, round-robin); future enhancements could incorporate wear-leveling, power-fail atomicity, or multi-objective victim selection.
Future research interests include:
- Extension of Flashvisor for multitenant support and QoS guarantees.
- Hardware acceleration of flash management (dedicated FTL engine) to offload management tasks from the LWPs.
- Application of the self-governing paradigm to additional architectures (e.g., ML ASICs, network processors) and novel non-volatile memories (e.g., 3D XPoint).
FlashAbacus exemplifies the efficacy of collapsing host, accelerator, and storage boundaries into a unified domain governed by on-chip lightweight cores, achieving demonstrable improvements in both throughput (+127%) and energy efficiency (–78.4%) for heterogeneous data-intensive workloads as compared to conventional CPU–SSD–accelerator designs (Zhang et al., 2018).