Hot Entry Profiling for System Optimization
- Hot entry profiling is a systematic method for identifying performance bottlenecks at small-scale entry points in distributed microservices and FPGA systems.
- It utilizes frameworks like Atys and RealProbe to capture execution frequency and cycle counts with high accuracy while minimizing runtime overhead.
- The approach enables actionable optimizations in latency, throughput, and cost efficiency through fine-grained data aggregation and adaptive sampling.
Hot entry profiling, also referred to as hotspot function or hot‐entry point profiling, is a systematic approach for identifying performance bottlenecks at granular entry points—such as functions, submodules, or loops—within distributed cloud microservices or hardware-accelerated systems. By quantifying execution frequency or resource consumption at these entry points, hot entry profiling enables targeted optimization to improve latency, throughput, and cost efficiency at scale. Two representative frameworks for hot entry profiling in modern systems are Atys, targeting large-scale cloud microservices, and RealProbe, designed for high-level synthesis (HLS) FPGA workflows. Both frameworks offer automated, fine-grained profiling methods while minimizing runtime overhead and preserving production fidelity (Sun et al., 18 Jun 2025, Kim et al., 4 Apr 2025).
1. Architectural Approaches for Hot Entry Profiling
Hot entry profiling architectures are tailored to the system type: distributed cloud microservices or FPGA-based HLS designs.
1.1 Cloud Microservices: Atys
Atys employs a distributed, three-tier architecture:
- Local Profiler (per node): Integrates language-specific, PMU-based sampling kernels (e.g., async-profiler for Java, py-spy for Python, Perf for native code) and a coordinating agent. The agent performs local aggregation, pruning, and metric exposure in Prometheus format.
- Controller: Central entity receiving user configuration (targets, rates, thresholds), dispatching commands to local agents, and orchestrating global flamegraph aggregation.
- Prometheus Server: Scrapes metrics from agents, stores time series, enables query/visualization (via PromQL/dashboard).
The agent's dynamic kernel selection—based on process image inspection—enables fully language-agnostic operation, automatically choosing sampling tools and instrumentation methods without mutating target code or requiring restarts.
1.2 FPGA HLS: RealProbe
RealProbe extends the Vitis HLS compiler workflow:
- Frontend Instrumentation: Recognizes
#pragma HLS RealProbeat function or loop boundaries via Clang/LLVM AST and IR tagging. - RTL Augmentation: Associates each annotated entry point with corresponding FSM control signals (e.g.,
ap_start,ap_done), exporting them as new top-level ports. - Profiling Hardware: Instantiates independent RealProbe IP capturing timing for each entry point via a global cycle counter and lightweight event-triggered counters, offloading data only as needed to host DRAM. This non-intrusive block is fully decoupled from kernel datapath and applies no back-pressure on main memory interfaces.
2. Data Aggregation and Result Construction
Data aggregation techniques for hot entry profiling focus on scalability and minimal communication overhead.
2.1 Two-Level Aggregation in Cloud Microservices
Atys employs a two-level aggregation strategy to construct global flamegraphs:
- Local Aggregation: Each agent deduplicates and counts sampled stack traces, generating a compact local flamegraph mapping stack signatures to counts. This reduces the data volume by approximately two orders of magnitude (e.g., 1 GB raw traces to <10 MB aggregated).
- Cluster Aggregation: The controller merges local flamegraphs () from all nodes into a global flamegraph , where for each unique stack ,
employing tree-based merge algorithms. This hierarchical aggregation reduces inter-node traffic by up to 99%, distributing computation and controlling bottlenecks.
2.2 Hot Entry Cycle Counting in HLS Designs
RealProbe, by wiring control signals and sampling a cycle counter on relevant transitions, logs active spans for every function, submodule, and loop annotated by #pragma HLS RealProbe. The cumulative cycle counts per entry over all invocations are extracted post-run, enabling direct ranking of hot entries by total runtime.
3. Profiling Efficiency and Overhead minimization
Reducing profiling overhead while preserving accuracy is central for large-scale or hardware-constrained deployments.
3.1 Function Selective Pruning (FSP) in Atys
Atys's Function Selective Pruning (FSP) focuses aggregation on high-activity threads:
- Algorithm: Samples are grouped by thread, threads are sorted by activity, and only samples from the top P% (by cumulative coverage, e.g., P99) are aggregated. This leverages the empirical finding that a small minority of threads typically account for nearly all actionable hot traces in microservices.
- Effectiveness: At P=99%, FSP achieves a 6.8% reduction in aggregation time with only 0.58% mean average percentage error (MAPE) for the top-50 functions.
3.2 RealProbe Resource Utilization
RealProbe’s design ensures low-impact hot entry profiling in hardware:
- Logic Overheads: Overhead is 16.98% in LUTs, 43.15% in FFs, and negligible (0%) in BRAMs, with cycle count accuracy of 100%. Analytical cost models and DSE allow pre-implementation tuning of N (number of profiled modules) and Dᵢ (buffer depth per module).
| Framework | LUT Overhead | FF Overhead | BRAM Overhead | Profiling Error | Aggregation Efficiency |
|---|---|---|---|---|---|
| Atys | N/A | N/A | N/A | ~0.58% MAPE | 6.8% faster w/ FSP |
| RealProbe | 16.98% | 43.15% | 0% | 0% | N/A |
3.3 Frequency Dynamic Adjustment (FDA) in Atys
Atys employs adaptive sampling:
- Jensen-Shannon Divergence: Computes divergence of hot entry distributions between sampling windows, increasing sampling rate when hotspots shift (divergence > θ), decreasing when stable. Control follows a “bang-bang” approach with hysteresis for stability.
- Results: FDA yields an 87.6% reduction in CPU overhead compared to persistent high-rate sampling while maintaining comparable MSE.
4. Practical Implementation: Workflow, Instrumentation, and Data Usage
End-to-end workflows highlight the integration of hot entry profiling in production systems.
4.1 Atys Operational Flow
- Profile Configuration: YAML/TOML config specifies targets, language hints, sampling rate f₀, FSP p, and FDA parameters (λ, θ).
- Controller Launch: Parses config, instructs agents to initialize PMU samplers in containers; no process restart required.
- Stack Sampling: Profiling kernels periodically capture call stacks via hardware interrupts.
- Local Processing: Agents aggregate samples, prune via FSP, and expose Prometheus-formatted metrics.
- Cluster Profiling: Flamegraphs merged for global analysis; dashboard/UI for inspection.
- Deployment: Typically installed via Kubernetes DaemonSet or similar automation; scalable to thousands of instances.
4.2 RealProbe Instrumentation and Profiling
- Annotate Code: Precede functions/loops in C/C++ HLS code with
#pragma HLS RealProbe. - HLS & RTL Synthesis: Control signals extracted, RealProbe IP wired in.
- On-Device Profiling: Each module’s active cycles are logged with timestamped counters; host collects profiling data post-run.
- Log Interpretation: Output logs are parsed to sum up cycles per module/submodule/loop, hot entries are ranked, and bottlenecks identified.
Example Code and Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
#pragma HLS RealProbe void kernel(int *A, int *B, int N) { #pragma HLS pipeline for(int i=0; i<N; ++i) { #pragma HLS RealProbe int x = A[i]; #pragma HLS RealProbe int y = B[i]; #pragma HLS RealProbe int z = compute(x, y); A[i] = z; } } #pragma HLS RealProbe int compute(int a, int b) { return a*b + a - b; } |
Resulting log structure:
1 2 3 4 5 6 |
module, iter, start, end kernel, 0, 1000, 1040 load A, 0, 1000, 1005 load B, 0, 1005, 1010 compute, 0, 1010, 1040 ... |
5. Quantitative Results and Scalability
Empirical evaluation of both frameworks demonstrates accuracy and scalability.
- Atys: At p=99%, FSP reduces aggregation time by 6.8% with 0.58% MAPE for key hotspots. FDA cuts CPU cost by 87.6% at comparable error. The Prometheus server remains efficient, using <250 MB RAM and <6% CPU while scraping 1,000 instances.
- RealProbe: Tracks up to 285 modules with 100% cycle-count accuracy (vs. ILA), 16.98% LUT and 43.15% FF overhead, and <1% DRAM bandwidth even in memory-bound kernels. Maximum-frequency penalty is modest (<5.5% in BRAM mode; <2% in register-only).
6. Deployment Considerations and Limitations
6.1 Atys
- Initialization: An initial sweep calibrates FSP and FDA parameters to meet user-defined error budgets.
- Assumptions: FDA cost amortization presumes workload periodicity; highly non-stationary applications may not yield savings. FSP presumes low-activity threads are non-critical.
- Limitations: Requires appropriate privileges for PMU and agent loading; profile data is subject to approximate sampling error (typically below 1% due to chosen techniques).
- Latency: Prometheus-based metric collection introduces a 15–30 s latency.
6.2 RealProbe
- Intrusiveness: Profiling logic is injected via automation without manual RTL editing.
- Resource Tradeoffs: User-driven DSE is available to balance profiling depth, resource, and DRAM overhead.
- Limitations: Overheads are low but non-zero; BRAM resources can be substituted for LUTs/FFs as needed, and profiling is entirely cycle-accurate by design.
7. Broader Significance and Applications
Hot entry profiling methodologies such as those enabled by Atys (Sun et al., 18 Jun 2025) and RealProbe (Kim et al., 4 Apr 2025) provide actionable insights to both software and hardware system architects. In data-center settings, identifying latent hotspots can yield large cost savings even with sub-1% improvements, as minor inefficiencies are amortized across thousands of microservice instances. In hardware design, cycle-accurate, automated profiling of function or loop execution is essential for targeting bottlenecks and optimizing resource-constrained FPGAs. These frameworks demonstrate that—with automated adaptation, aggregation, and user-transparent instrumentation—large-scale, multidomain systems can be profiled efficiently, enabling informed optimization decisions at minimal operational cost.