Papers
Topics
Authors
Recent
2000 character limit reached

VeBPF Many-Core Architecture

Updated 21 December 2025
  • VeBPF many-core architecture is an FPGA-based system enabling parallel, dynamic eBPF processing for efficient line-rate packet handling.
  • It features a custom soft-core with a five-stage pipeline and single-cycle rule switching, optimizing latency and throughput across diverse platforms.
  • Empirical tests show up to 4–6× lower latency than traditional RISC-V cores, confirming its scalable and resource-efficient design for SmartNICs and IoT FPGAs.

The VeBPF many-core architecture is a resource-optimized, highly configurable FPGA-based system for line-rate network packet processing, designed to efficiently execute extended Berkeley Packet Filter (eBPF) programs in parallel across multiple soft-cores. Each VeBPF core is eBPF ISA compliant and implemented in Verilog HDL, enabling seamless integration with standard FPGA IP and interoperability with Linux eBPF toolchains. The VeBPF architecture supports massive parallelism, dynamic rule updates at runtime without FPGA reconfiguration, and targets both low-end IoT FPGAs and high-end SmartNICs. The architecture, core microarchitecture, system organization, dynamic reconfigurability, performance models, and experimental validation are addressed in detail below (Tahir et al., 14 Dec 2025).

1. Architectural Objectives and Target Platforms

The VeBPF many-core architecture is engineered to provide a scalable, resource-efficient framework for high-throughput network packet processing across a spectrum of platforms.

Goals:

  • Line-rate packet handling on both resource-constrained IoT edge FPGAs and high-performance data-center SmartNICs.
  • Full compliance with the 64-bit eBPF ISA; enables program exchange with standard Linux eBPF toolchains.
  • Arbitrary scaling: any number NVeBPFN_{\text{VeBPF}} of VeBPF soft cores instantiated (limited only by FPGA resources), and any number of eBPF match/action rules (RR), with dynamic runtime updating.
  • Dynamic rule switching: rules can be modified at runtime in a single clock cycle per core, avoiding hardware resynthesis or bitstream modification.

Supported Platforms:

  • Low-end Artix-7 (e.g., Arty A7-100T) for edge and IoT applications.
  • High-end Xilinx/Intel UltraScale+ FPGAs for cloud DPUs and SmartNICs.

2. VeBPF CPU Core Microarchitecture

Each VeBPF core is a custom, single-issue, in-order, Harvard-architecture soft processor optimized for packet-processing workloads under the eBPF 64-bit instruction set.

Core Features:

  • Separate 64-bit program memory and 8-bit data memory (sized for accelerating packet header processing).
  • Eleven 64-bit general-purpose registers R0R0R10R10; R1R1R5R5 accept parsed packet fields, R0R0 conveys output/policy decision.
  • Single-cycle, five-stage pipeline: IFetch, Decode, Execute/ALU, Load/Store, WrBack.
  • Custom PL Call Handler block enables hardware acceleration for eBPF “call” instructions.

Pipeline and Functional Units:

  • IFetch retrieves eBPF bytecode instructions.
  • Decode unpacks opcode, register indexes, immediates.
  • Execute/ALU supports full 64-bit integer arithmetic, logic, and shifts.
  • Load/Store provides byte-addressable, single-cycle access to packet headers.
  • Branch Unit resolves all branches in a single cycle (no branch prediction).
  • Writes occur in the same cycle as data availability.

Branch, Call, and Rule Switching:

  • Calls are handled by a user-customizable hardware block.
  • No out-of-order structures or dynamic scheduling; deterministic per-instruction timing.
  • Single-cycle rule switching leverages an external instruction-pointer register (RESET_IN), permitting prompt re-programming in live packet-processing deployments.

3. Many-Core System Organization, Dataflow, and Scheduling

The system is organized as a many-core accelerator that operates between the network-facing MAC and main packet memory, coordinated by a management-plane RISC-V soft-core processor.

Principal Subsystems:

  • Packet Slicer & DMA: Accepts AXI-Stream Ethernet packets, slices programmable header regions (HH bytes), pushes headers into a FIFO, and DMAs full payloads to DDR RAM. A per-packet descriptor tracks metadata.
  • Data Loader: Broadcasts each header word (64 bits) to all VeBPF cores over a shared bus; global ACK reduction ensures broadcast reliability.
  • Program Loader: Accepts eBPF bytecode via UART, parses metadata {rule_start_addr,rule_length}\{\text{rule\_start\_addr}, \text{rule\_length}\}, loads every rule into every core over a shared bus, and synchronizes via per-core ACKs.
  • Multi-Rule Scheduler: Dynamically assigns rules to idle cores by single-cycle instruction-pointer redirection. The scheduler implements round-robin or priority arbitration and tracks which rule/core pairs are active.
  • Result Analyzer: Selects the first conclusive result (STORE\text{STORE}, DROP\text{DROP}, ERROR\text{ERROR}), updates the packet’s descriptor, and triggers header loading for the next packet.

Scheduler Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
wait All_eBPF_rules_uploaded_flag && VeBPF_data_loading_done_flag
rule_idx  0
while rule_idx < R:
    core_id  arbitrer.grant_idle_core()
    program_bus.select(core_id)
    core[core_id].PC  rule_metadata[rule_idx].start_addr
    tracker.register(core_id, rule_idx)
    rule_idx  rule_idx + 1
end
wait until tracker.valid_result_received()
signal data_loader.load_next_header()
goto top

4. Dynamic Rule Update and Run-Time Flexibility

VeBPF enables dynamic eBPF rule updates at runtime without FPGA reconfiguration:

  • The host processor asserts VeBPF_rst_new_rules_flag.
  • The Program Loader receives new rules over UART, parses updated metadata/length, and re-programs every VeBPF core.
  • Rule switching is achieved in a single clock cycle using the external program-counter assignment logic.
  • No bitstream or FPGA image reload is required; the system continues processing with the new rule set on the next event loop.

5. Hardware–Software Interface and Open-Source Ecosystem

The management plane is implemented with a RISC-V soft-processor, providing both configuration and runtime controls via memory-mapped I/O over a CSR bus. Key routines:

  • configure_packet_buffer(addr, len)
  • upload_rules_via_UART()
  • arm_program_loader()
  • arm_data_loader()
  • read_results()
  • reset_for_new_rules()

The UART link is utilized for both bytecode rule streams and RISC-V firmware uploads. All cores, architectural modules, and simulation frameworks (Python, Cocotb) are released as open source to accelerate further research and adoption in the FPGA and SmartNIC community (Tahir et al., 14 Dec 2025).

6. Resource Utilization and Performance Modeling

Resource Usage

Empirical measurements and synthesis indicate the following per-core resource profile:

Core Type LUTs FFs BRAMs
VeBPF core 3500 1600 1.5
RISC-V PE (from [13]) 7878 1944 20
SEPHIROT (eBPF via VLIW) 27000 4000

Total resources scale linearly with NN:

LUTtotal(N)=N×3500+LUToverhead\mathrm{LUT}_{\mathrm{total}}(N) = N \times 3500 + \mathrm{LUT}_{\mathrm{overhead}}

BRAMtotal(N)=N×1.5+BRAMoverhead\mathrm{BRAM}_{\mathrm{total}}(N) = N \times 1.5 + \mathrm{BRAM}_{\mathrm{overhead}}

Latency and Throughput

Let

  • HH: header size in bytes
  • w=8w = 8: bytes/word (64 bits)
  • II: instructions in longest rule
  • fclkf_{\mathrm{clk}}: core clock frequency
  • kk: number of parallel cores (up to RR rules)

Packet processing time per packet:

Tload=H/wfclkT_{\mathrm{load}} = \frac{\lceil H/w \rceil}{f_{\mathrm{clk}}}

Texec=IfclkT_{\mathrm{exec}} = \frac{I}{f_{\mathrm{clk}}}

Ttotal(k)=H/8+IfclkT_{\mathrm{total}}(k) = \frac{\lceil H/8 \rceil + I}{f_{\mathrm{clk}}}

Aggregate throughput:

Throughput(k)=min(k,R)  1Ttotal(k)\mathrm{Throughput}(k) = \min\bigl(k, R\bigr)\;\frac{1}{T_{\mathrm{total}}(k)}

7. Empirical Validation and Experimental Outcomes

A firewall implementation on an Artix-7 (Arty A7-100T) validated VeBPF’s scalability and performance.

Experimental metrics:

  • 12-core VeBPF many-core, processing SANS-style filtering rules.
  • Sustained line-rate (100 Mbps) throughput on all packet sizes, including minimum-size Ethernet frames (64 B).
  • VeBPF-accelerated packet filtering demonstrated 4–6× lower latency than a single RISC-V core, with measured per-packet latency of 0.25μs\approx 0.25\,\mu\mathrm{s} (VeBPF) vs. 1.5μs\approx 1.5\,\mu\mathrm{s} (RISC-V) for 64 B packets.
  • The full system required 42,000\sim 42{,}000 LUTs and 20 BRAMs for 12 cores, while an equivalent 12-core RISC-V based system would vastly exceed the fabric’s LUT capacity.

A plausible implication is that the VeBPF architecture substantially increases parallel eBPF processing density per FPGA, enabling the deployment of dense, dynamically reprogrammable network processing fabrics across both resource-limited and high-performance deployment contexts (Tahir et al., 14 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to VeBPF Many-Core Architecture.