Papers
Topics
Authors
Recent
Search
2000 character limit reached

Byte-Offset Indexing Architecture

Updated 2 January 2026
  • Byte-Offset Indexing Architecture is a hardware-accelerated technique that remaps virtual page slots directly to physical data pages, eliminating explicit pointer dereferences.
  • It employs standard Linux APIs and asynchronous remapping to efficiently support lookup, insertion, and bucket splitting in extendible hashing implementations.
  • Empirical evaluations show significant reductions in page-table walks and lookup latency, demonstrating practical improvements in throughput and TLB performance.

The Byte-Offset Indexing Architecture, also referred to as the "page-table shortcut" design, is a hardware-accelerated technique for database indexing that actively incorporates the operating system's virtual memory page table. Instead of materializing pointer-based indirections within software-managed data structures, this approach remaps virtual address slots directly to physical data pages by manipulating page table entries (PTEs) via standard OS interfaces. The result is a reduction in pointer dereference overhead, a decrease in address translation steps per lookup, and efficient exploitation of memory management hardware such as the MMU and TLB. This architecture has been realized and evaluated in the context of extendible hashing, yielding substantial improvements in lookup throughput and latency (Schuhknecht, 2023).

1. Pointer Elision via Page Table Indirection

Traditional database indexes, such as extendible hash tables, utilize directories where each slot contains a pointer to a data or "bucket" page. In operation, a lookup involves traversing several levels of indirection:

  • Reading the directory page (implicit page-table walk)
  • Dereferencing a pointer to reach the bucket page (explicit pointer chase, may cause another page-table walk)
  • Additional walks if indirections span multiple levels

The Byte-Offset Indexing Architecture eliminates explicit pointer semantics. Each directory slot is materialized as a separate 4 KB virtual page. The mapping from directory slot to physical bucket is encoded directly in the page table, making the page table entry itself the de facto "pointer." Concretely, an access to Vdir+i4KBV_{dir} + i \cdot 4\,\mathrm{KB} retrieves data from bucket ii, with no extra pointer dereference. All buckets and target pages are allocated from a pre-sized Linux memfd file (the "page pool"), and directory slots are remapped via:

1
2
3
4
5
6
7
8
mmap(
  addr = V_dir + i*4KB,
  length = 4KB,
  prot = PROT_READ|PROT_WRITE,
  flags = MAP_SHARED|MAP_FIXED,
  fd = p_pool,
  offset = O_i
)

where OiO_i is the file offset of the target bucket. The PFN field of the PTE points directly to the bucket's physical frame, thus collapsing levels of logical indirection into one hardware-managed translation step. Kernel modifications are unnecessary—standard Linux APIs and the unextended x86_64 4-level page table suffice.

2. Algorithms for Lookup, Insertion, and Maintenance

Lookup

A single-threaded lookup operates as follows:

1
2
3
4
5
6
7
function lookup(key):
    h = hash(key)
    i = topBits(h, global_depth)
    base = V_dir + i * PAGE_SIZE
    for slot in bucket_at(base):
        if slot.key == key: return slot.value
    return NOT_FOUND

This procedure leverages the fact that each directory slot is already mapped to the correct bucket page. The hardware performs a single page-table walk for the lookup, reducing translation steps and indirections.

Insertion and Bucket Splitting

Bucket splits and directory resizing are managed with minimal blocking. When a split occurs:

  • Allocate two new buckets in the pool (B0, B1).
  • Redistribute records and update the backing pointer array (retained for correctness).
  • Enqueue remap requests describing the necessary PTE rewiring for affected directory entries onto a lock-free FIFO.
  • An asynchronous mapper thread periodically (every 25 ms) drains remap requests and synchronizes the page table via mmap(MAP_FIXED).
  • Upon completion, the shortcut_version is incremented, and lookups can resume via the shortcut.

Deletion and Directory Shrinkage

Analogously, merges and directory shrinkage events enqueue remap requests and update metadata. Directory collapse triggers full remapping of affected virtual regions.

3. Hardware and Operating System Integration

The architecture leverages the underlying memory management unit (MMU), translation lookaside buffer (TLB), and page-table walk logic. A directory slot access translates into a TLB probe; on miss, a hardware page walker steps through PML4→PDPT→PD→PT→PTE, then fills the TLB. By compressing what would be 2–3 software-managed pointer hops into a single (potentially cached) translation, the architecture:

  • Reduces the number of page-table walks per lookup to one
  • Lowers L1-TLB pressure, due to fewer distinct inner-node pages being accessed
  • Allows hardware prefetchers operating on the page table to be more effective

Empirical results for 10⁷ random lookups (over 2222^{22} slots) show the shortcut variant halves the worst-case page-walk cost relative to pointer-based implementations. However, high "fan-in" scenarios—where kk directory slots map to one physical bucket—can lead to TLB thrashing, especially when kk \gg TLB_size; the empirical crossover occurred at fan-in ≈16, after which pointer-based indexes recovered performance parity.

4. Empirical Performance Characteristics

The performance advantages of Byte-Offset Indexing are quantified both analytically and through microbenchmarks.

Analytical Model

Let hptrh_{ptr} be the number of pointer hops. Lookup latency is modeled as:

LptrhptrCwalk+Cmem_scanL_{ptr} \approx h_{ptr} \cdot C_{walk} + C_{mem\_scan}

With the shortcut, hsc=1h_{sc} = 1:

Lsc1Cwalk+Cmem_scanL_{sc} \approx 1 \cdot C_{walk} + C_{mem\_scan}

Speedup is Lptr/LscL_{ptr}/L_{sc}, proportional to the reduction in indirection.

Measured Results

Benchmark Pointer-based Shortcut Linear Probe Hash Table
Inner-node traversal (μs) 22.6 16.5–16.7
Extendible hashing lookups (M/s) 60 100 120
Insert cost (100M ops, 35% resizes) ≈8% slower than pointer-based
Hybrid (1% insert, 99% lookup) Fallback to pointer on split; shortcut after catchup

Once warmed, shortcut-based inner-node traversals are approximately twice as fast as the conventional approach. In extendible hashing, shortcut mode attains 100 million lookups per second, compared to 60 million for pointer-EH. The cost of maintaining shortcuts during inserts was only ≈8% slower than the pointer baseline due to background asynchronous PTE remaps. Under hybrid workloads with interleaved maintenance, shortcut-EH briefly falls back to pointer mode during splits until remappings are completed.

5. Correctness, Consistency, and Limitations

Consistency and Recovery

Because the OS page table is ephemeral, process failures result in loss of all custom mappings. To maintain crash-recovery safety, a ground-truth pointer-based directory is retained alongside shortcuts; upon restart, the system can either reconstruct page-table shortcuts from this directory or revert to the pointer path. Lookup operations must check the shortcut_version to ensure index/table consistency with mapping state; version mismatches trigger the safe pointer traversal.

Concurrency Control and TLB Shootdowns

All changes to virtual address mappings require TLB shootdowns, imposing systemic performance costs if performed on the hot path. By batching and offloading remap operations to a background mapper thread, these costs are isolated from lookup and insert operations.

Portability and Security

No kernel modifications, nonstandard PTE fields, or elevated OS privileges are necessary. The approach relies solely on mainstream Linux APIs: memfd_create, mmap with MAP_SHARED|MAP_FIXED, and ftruncate. On alternative UNIX systems, similar memory-backed file APIs can be employed. As all frames are allocated from a private memfd, user-space isolation is preserved. The design ensures that overlapping MAP_FIXED calls do not corrupt mappings.

6. Architectural Significance and Application Scope

By encoding index indirection within the page table and offloading traversal to hardware, the Byte-Offset Indexing Architecture exposes the OS’s virtual memory radix tree as a hardware-accelerated index node. This effectively collapses software-managed indirection levels into 4 KB-granular, pointerless hardware constructs while maintaining data integrity and transactional capability via a parallel pointer-based shadow directory. Applied to extendible hashing, this results in near-halving of directory-lookup latency (100M vs. 60M lookups/sec) relative to traditional pointers, with only modest background maintenance cost for remapping. This approach demonstrates the viability of actively integrating virtual memory management into high-performance database index designs (Schuhknecht, 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Byte-Offset Indexing Architecture.