PROMPT: A Fast and Extensible Memory Profiling Framework (2311.03263v1)

Published 6 Nov 2023 in cs.PF and cs.PL

Abstract: Memory profiling captures programs' dynamic memory behavior, assisting programmers in debugging, tuning, and enabling advanced compiler optimizations like speculation-based automatic parallelization. As each use case demands its unique program trace summary, various memory profiler types have been developed. Yet, designing practical memory profilers often requires extensive compiler expertise, adeptness in program optimization, and significant implementation efforts. This often results in a void where aspirations for fast and robust profilers remain unfulfilled. To bridge this gap, this paper presents PROMPT, a pioneering framework for streamlined development of fast memory profilers. With it, developers only need to specify profiling events and define the core profiling logic, bypassing the complexities of custom instrumentation and intricate memory profiling components and optimizations. Two state-of-the-art memory profilers were ported with PROMPT while all features preserved. By focusing on the core profiling logic, the code was reduced by more than 65% and the profiling speed was improved by 5.3x and 7.1x respectively. To further underscore PROMPT's impact, a tailored memory profiling workflow was constructed for a sophisticated compiler optimization client. In just 570 lines of code, this redesigned workflow satisfies the client's memory profiling needs while achieving more than 90% reduction in profiling time and improved robustness compared to the original profilers.

Summary

The paper presents a factorized approach to memory profiling by splitting instrumentation (frontend) and processing (backend) to simplify development and improve speed.
It employs LLVM IR-based event generation, a high-throughput event queue, and efficient shadow memory, achieving speedups of 5.3x to 7.1x over traditional profilers.
The framework reduces code complexity and overhead, enabling robust speculative parallelization and significantly lowering profiling slowdowns in real-world applications like Perspective.

The paper "PROMPT: A Fast and Extensible Memory Profiling Framework" (2311.03263) introduces a novel framework designed to significantly simplify the development and accelerate the execution of memory profilers. The core motivation behind PROMPT is to address the challenges associated with building practical, fast, and robust memory profilers, which typically require deep compiler expertise and extensive implementation effort. These difficulties have historically limited the adoption of profiling-guided optimizations, such as speculative automatic parallelization.

PROMPT proposes a factorization of the memory profiling process into two distinct phases: a frontend responsible for generating profiling events and a backend that processes these events to produce the desired profiles. This separation is key to allowing developers to focus primarily on the core profiling logic rather than the complexities of instrumentation and low-level system interactions. The two phases communicate via an event queue.

Beyond separation, PROMPT generalizes common components and optimizations found across various memory profilers.

Design and Implementation:

Frontend: The frontend instruments the program at the LLVM Intermediate Representation (IR) level. It standardizes and categorizes profiling events, including:
- Memory Access: Load, Store, Pointer Creation (with instruction ID, address, value, size/type).
- Allocation: Heap/Stack Allocation/Deallocation, Global Initialization (with instruction ID/object ID, address, size).
- Context: Function Entry/Exit, Loop Invocation/Iteration/Exit, Program Start/Terminate (with Function ID/Loop ID/Process ID). Developers specify which events and associated data they need using a simple configuration, and PROMPT automatically instruments the program binary to generate these events. Adding new events is possible but requires understanding LLVM IR.
Event Queue: A high-throughput, Single-Producer-Multiple-Consumer (SPMC) queue connects the frontend and backend. Designed to prioritize throughput over latency, it utilizes a ping-pong buffer and streaming writes (on X86) to minimize the frontend's overhead and avoid polluting the cache. The large buffer size amortizes communication costs and smooths out event bursts.
Backend Components: PROMPT provides a library of generalized components for the backend:
- Backend Driver: Consumes events from the queue and dispatches them to the appropriate callback functions within the profiling module. It also manages parallel workers if data parallelism is enabled.
- Generic Shadow Memory: Allows mapping memory addresses to arbitrary metadata. It uses a direct mapping scheme (shift and mask) for efficient address translation and handles allocation/deallocation automatically.
- Generic Context Manager: Tracks the current program context (call stack, loop nest). It interacts with profiling modules to encode/decode contexts and supports caching. Each backend thread gets its own context manager to avoid synchronization overhead.
- High-throughput Data Structures: Provides specialized containers (like htmap_count, htmap_constant, htmap_sum, htmap_set) designed for common profiling tasks (counting, checking constancy, summing, collecting sets). These containers implement a parallel reduction strategy: insertions are buffered locally, and reductions to the main map are performed in parallel by a thread pool when buffers fill or results are needed. This leverages the observation that insertions are frequent but result reads are rare during profiling.
Optimizations:
- Specialization: PROMPT automatically removes instrumentation for events or data not requested by the profiling module. This specialization is handled at link-time by generating empty callback functions or removing unused arguments, which the compiler then optimizes away via Link-Time Optimization (LTO). This avoids complex LLVM pass configurations. Evaluation shows significant event reduction (17% to 72%).
- Data Parallelism: PROMPT provides a wrapper and infrastructure to easily parallelize the backend logic. It supports address-based parallelism (different workers handle different address ranges) and can be extended to other types. Profiling modules can use the wrapper to mark operations to be executed by the worker responsible for a specific data item (e.g., instruction ID, address) and provide a merge function for results. The framework manages the parallel workers.
Implementing a Profiler: A developer creates a new profiler by defining the required events in a YAML specification and implementing the core logic in a C++ class that utilizes PROMPT's backend components and overrides callback functions (e.g., load, store, finish).

Evaluation and Practical Impact:

The evaluation demonstrates PROMPT's effectiveness in terms of extensibility, speed, and applicability.

Extensibility:
- Porting two state-of-the-art LLVM-based profilers, LAMP and the Privateer profiler, to PROMPT resulted in a significant reduction in code size (around 65-70%). This highlights that developers only need to implement the core logic (which may increase slightly due to interfacing with the framework) while PROMPT provides instrumentation, event generation, shadow memory, and queuing logic.
- Adapting a basic memory dependence profiler to variants (tracking count, distance, context, all dependence types) required only a few lines of code delta (1-16 LOC), showcasing the ease of adaptation.
Speed:
- On SPEC CPU 2017 benchmarks, PROMPT versions of LAMP and Privateer significantly outperformed their originals. The ported LAMP was 5.3x faster (using 16 backend threads), benefiting from both pipeline parallelism (frontend/backend separation) and data parallelism. The ported Privateer was 7.1x faster (even without data parallelism), primarily due to PROMPT's high-throughput event queue resolving a frontend bottleneck in the original profiler.
- Comparing PROMPT's slowdowns for various memory dependence profiler variants against reported numbers from prior work (implemented with diverse technologies) showed that PROMPT is competitive or superior, achieving geomean slowdowns between 7.5x and 13.1x.
Perspective Case Study: This is a key real-world application demonstrating PROMPT's impact. Perspective, a speculative automatic parallelization system, relies on memory profiling (using LAMP and Privateer). By redesigning the profiling workflow with PROMPT, tailoring the profilers to exactly match Perspective's four specific needs (memory dependence, value pattern, object lifetime, points-to), the total code for the four modules was only 570 lines. This redesigned workflow reduced the critical path profiling slowdown for Perspective's benchmarks from 217.2x to 5.9x and the total profiling time sum from 201.2x to 15.3x (a >90% reduction). Furthermore, the PROMPT-based profilers showed improved robustness, running on more complex benchmarks where the original Privateer profiler failed or timed out, making Perspective more broadly applicable.
Performance Analysis of Optimizations: Evaluation showed the individual contributions of PROMPT's techniques:
- Specialization: Reduced geomean slowdown by 51% by removing unnecessary instrumentation.
- High-throughput Queue: Further reduced slowdown by 18%.
- Data Parallelism Wrapper: Brought a 57% slowdown reduction (peaking around 16 workers on the test machine).
- High-throughput Data Structures: Added another 8% improvement by accelerating reducible operations. Queue comparison showed PROMPT's queue is significantly faster (at least 81%) than Boost lock-free queues and the Liberty queue. High-throughput maps outperformed standard C++ maps and even highly optimized third-party hash maps (like phmap::flat_hash_map) when using parallelism.
Overhead:
- Memory overhead: Constant size for the frontend queue buffer. Backend memory includes code, data structures for results, and auxiliary structures like shadow memory. Shadow memory is a significant component, proportional to the program's heap size and the metadata ratio. Peak backend memory overhead for the memory dependence profiler ranged from 20% to 9.7x (excluding the fixed queue buffer).
- Binary size: Instrumented binaries were 17% to 231% larger than the original.

Discussion:

PROMPT is particularly well-suited for speculative optimizations but can support other memory profiling tasks like prefetching or security analysis. It currently focuses on single-threaded programs, a limitation for general multi-threaded profiling, though the decoupled design could facilitate future extension. Profiling programs without source code is partially supported (external library calls are noted), but full precision requires compile-time source availability. While designed for memory events, the framework's factorization might inspire similar approaches for other profiling types, though optimizations are tailored for memory profiling characteristics (high event volume, latency insensitivity).

In conclusion, PROMPT successfully addresses the practical hurdles in memory profiling by separating concerns, generalizing components, and providing optimized implementations of common techniques like specialization, high-throughput queuing, data parallelism, and parallel data structures. Its demonstrated ability to drastically reduce development effort, accelerate profiling, and enhance robustness makes it a valuable framework for developing and applying memory profiling techniques in research and real-world applications.

PDF Markdown

PROMPT: A Fast and Extensible Memory Profiling Framework (2311.03263v1)

Summary

Related Papers