Papers
Topics
Authors
Recent
2000 character limit reached

Relic Parallel Framework: SMT Fine-Grained Parallelism

Updated 7 September 2025
  • Relic Parallel Framework is a minimalist user-space library designed for fine-grained task parallelism on client-class SMT cores using a strict single-producer single-consumer model.
  • It employs a fixed-capacity, lock-free SPSC queue and busy-wait loops to minimize scheduling and synchronization overhead, achieving an average speedup of 42.1% over serial execution.
  • Its design restricts task submission to a pair of logical threads per physical core, ensuring low-latency execution but limiting scalability to simple, non-recursive task patterns.

The Relic Parallel Framework is a minimalist, software-only parallel programming solution engineered for enabling extremely fine-grained task parallelism specifically on simultaneous multithreading (SMT) CPU cores. Distinct from general-purpose task-parallel programming systems designed for multicore or manycore servers, Relic explicitly targets client-class systems with two logical threads per physical core, emphasizing low-overhead and tightly controlled intra-core parallel execution (Los et al., 2 Oct 2024).

1. Design Principles and Operational Model

Relic is architected around the principle of reducing scheduling and synchronization overhead to the minimum necessary for situations where tasks complete in a few microseconds and only two hardware threads (SMT contexts) are available. Unlike conventional frameworks that implement elaborate work-stealing, hierarchical scheduling, or multi-queue task systems to support scalable parallelism across many cores, Relic specializes for a single physical core with SMT, mapping exactly one thread as the "main" (producer) and the other as an "assistant" (consumer). This role partitioning prohibits concurrency in task submission and task execution, yielding a strictly single-producer single-consumer (SPSC) interaction pattern.

Tasks are communicated between threads via a fixed-capacity, lock-free SPSC queue (using Boost’s SPSC queue, capacity set to 128 in the reference evaluation). The main thread exclusively enqueues new tasks—function pointers and arguments—while the assistant thread exclusively dequeues and executes them. Busy-waiting (with processor pause instructions) is utilized to minimize latency both when the main thread waits for completion and when the assistant polls for available work.

This operational model removes the need for work-stealing, complex synchronization, or load balancing heuristics, targeting the operational niche where the generality and overhead of broader frameworks would otherwise dominate the cost of parallel execution.

2. Performance Characteristics and Comparative Evaluation

To evaluate Relic, a suite of fine-grained application kernels was executed on systems with SMT support, focusing solely on using both logical threads of a single physical core. The kernels included:

  • Graph algorithms: betweenness centrality (BC), breadth-first search (BFS), connected components (CC – Shiloach-Vishkin), page ranking (PR), single-source shortest paths (SSSP), triangle counting (TC).
  • I/O-bound kernel: JSON parsing via RapidJSON.

The critical metric was speedup relative to serial execution and relative improvement over established frameworks (LLVM OpenMP, GNU OpenMP, Intel OpenMP, X-OpenMP, oneTBB, Taskflow, OpenCilk) (Los et al., 2 Oct 2024).

Framework ΔPerformance vs Relic (%)
LLVM OpenMP –19.1
GNU OpenMP –31.0
Intel OpenMP –20.2
X-OpenMP –33.2
oneTBB –30.1
Taskflow –23.0
OpenCilk –21.4

Relic achieved an average speedup of 42.1% over serial baselines in these fine-grained scenarios. The performance advantage is attributed to Relic’s specialized queue, avoidance of multiproducer contention, and minimal thread management overhead. The framework’s efficiency diminishes when tasks exhibit longer idle periods or are unsuited to SPSC dispatch.

3. Technical Framework and Synchronization Scheme

Relic is implemented as a pure user-space library with the following defining features:

  • SPSC Lock-Free Queue: Task transfer between threads is realized without locking, using a queue with statically bound producer (main) and consumer (assistant) (Los et al., 2 Oct 2024). Each entry encapsulates the function pointer and argument bundle.
  • Strict Thread Role Partitioning: Only the main thread can submit tasks, and only the assistant may execute. Both must be pinned to the same physical core for optimal performance. The framework does not set or manage affinity.
  • Busy-Waiting with Pause: Both threads employ a busy-loop using processor pause instructions to minimize context switch and wake-up latency.
  • Explicit Task Waiting: After submission, the main thread invokes wait(), which spins until the assistant signals task completion.
  • No Recursive Task Generation: Assistant tasks cannot submit further tasks; all scheduling is globally serialized through the main thread. This restriction is a direct trade-off to eliminate multiproducer coordination overhead.
  • Thread Activity Hints: Developers may use wake_up_hint() and sleep_hint() to inform the runtime about periods of assistant thread activity or dormancy, enabling external adjustment for energy use or responsiveness without integrated thread sleeping logic.

A relevant pseudocode outline for the assistant's main loop is:

1
2
3
4
5
6
while true:
    while SPSCQueue.ReadAvailable() == false:
        Pause()
    TaskRoutine, Args = SPSCQueue.Front()
    TaskRoutine(Args)
    SPSCQueue.Pop()

This technical scheme is optimized for the lowest overhead in dispatch and task transition under fine-grained, highly regular workloads.

4. Application Domains and Empirical Use Cases

The Relic framework’s applicability is demonstrated on latency-sensitive, micro-tasked workloads, most common in client-class machine environments:

  • Graph algorithms: All tested kernels (BC, BFS, CC, PR, SSSP, TC) are dispatched as microtasks, each bound to one logical thread, with execution times in the 0.4–6.4 μs range per task. By splitting such compute-bound and lightweight jobs, measurable speedup is obtained even when absolute runtimes are extremely short.
  • JSON parsing: Two parallel JSON parsing tasks, handled by RapidJSON, highlight Relic’s benefits even for I/O-dominated workloads executed in parallel by SMT threads.

These demonstrations confirm Relic’s focus on practical, fine-grained parallelism within strict SMT execution budgets.

5. Limitations and Implementation Constraints

Relic’s architectural optimizations result in inherent functional constraints:

  • Lack of Recursion and General Task Graphs: No support for recursive or nested task submission, restricting usage to flat task lists/batches submitted by the main thread only.
  • Affinity Dependency: Maximal performance requires both threads be explicitly pinned to the same physical core; divergent scheduling by the operating system reduces or negates benefits. CPU pinning must be handled externally.
  • Busy-Waiting Cost: While spin-waiting is optimal for very brief idle times, it is inefficient (wasted cycles, power) for long waits. The framework’s hints permit hybrid strategies, but adaptive sleeping/waking is not natively included.
  • Single-Core Scope: The design does not scale to scenarios with more than two concurrent threads or across multiple physical cores. The SPSC mechanism is unsuitable for task graphs requiring dynamic load balancing or complex dependencies.

This suggests that Relic’s design choices are intentionally niche, prioritizing absolute task dispatch throughput and minimized latency on two-logic-thread systems at the expense of flexibility and scalability.

6. Comparative Context and Research Significance

Relic’s operational focus and scheduling philosophy stand in deliberate contrast to broad-purpose parallel frameworks:

Framework Scheduling/Load-Balancing Model Task Graph Flexibility Per-Task Overhead
Relic SPSC, strict roles Flat only Minimal
OpenMP (all flavors) Thread pool/worksharing Arbitrary (spawn, join) Moderate-high
oneTBB, Taskflow Dynamic/work-stealing Arbitrary DAGs Moderate-high
OpenCilk Work-stealing Arbitrary DAGs Moderate-high

Whereas general frameworks must amortize higher thread-management and dispatch costs to justify their design, Relic achieves gains precisely because such generality is unneeded—and costly—when targeting the fine-grained SMT domain. Performance comparisons across representative algorithms and workloads, with empirical gains ranging 14–33% over established frameworks, validate this design hypothesis (Los et al., 2 Oct 2024). A plausible implication is that further specialization for hardware-defined threading contexts may yield future advances in minimal-overhead parallel libraries, particularly as client devices demand increasing throughput under stringent resource budgets.

7. Prospective Developments and Directions

Potential future enhancements for Relic include:

  • Automated affinity/pinning integration to guarantee threads reside on the target core.
  • Adaptive waiting strategies to dynamically trade waiting efficiency for power preservation.
  • Exploration of lightweight extensions to enable limited forms of nested tasking or more complex dependency management without negating the low-overhead SPSC dispatch model.
  • Investigation into generalizing some of Relic’s design patterns for environments featuring more than two hardware threads while retaining strict overhead minimization.

These directions may further delineate the boundaries of when hyper-specialized task frameworks are preferable to general-purpose counterparts, especially in edge or client-oriented deployments.


In summary, the Relic Parallel Framework is an SMT-focused, SPSC-parallel, minimalist library for fine-grained task parallelism on client-class systems. Its specialization yields significant performance improvements for workloads decomposable into microtasks dispatched via strictly partitioned threads, as substantiated by empirical bench-marking against state-of-the-art general frameworks (Los et al., 2 Oct 2024). The approach exemplifies a contemporary trend in parallel computing: domain-specific minimization of coordination overhead, targeting tightly constrained, latency-critical environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Relic Parallel Framework.