An Attempt to Catch Up with JIT Compilers: The False Lead of Optimizing Inline Caches (2502.20547v1)

Published 27 Feb 2025 in cs.PL

Abstract: Context: Just-in-Time (JIT) compilers are able to specialize the code they generate according to a continuous profiling of the running programs. This gives them an advantage when compared to Ahead-of-Time (AoT) compilers that must choose the code to generate once for all. Inquiry: Is it possible to improve the performance of AoT compilers by adding Dynamic Binary Modification (DBM) to the executions? Approach: We added to the Hopc AoT JavaScript compiler a new optimization based on DBM to the inline cache (IC), a classical optimization dynamic languages use to implement object property accesses efficiently. Knowledge: Reducing the number of memory accesses as the new optimization does, does not shorten execution times on contemporary architectures. Grounding: The DBM optimization we have implemented is fully operational on x86_64 architectures. We have conducted several experiments to evaluate its impact on performance and to study the reasons of the lack of acceleration. Importance: The (negative) result we present in this paper sheds new light on the best strategy to be used to implement dynamic languages. It tells that the old days were removing instructions or removing memory reads always yielded to speed up is over. Nowadays, implementing sophisticated compiler optimizations is only worth the effort if the processor is not able by itself to accelerate the code. This result applies to AoT compilers as well as JIT compilers.

Summary

The paper presents a DBM technique that dynamically modifies inline cache assembly in AoT-compiled JavaScript to reduce memory loads.
It details an implementation using hopc’s C code generation, Capstone disassembly, and mprotect to apply two levels of DBM optimization.
Experimental evaluation reveals that while memory reads are reduced, the optimization fails to boost overall execution time on modern processors.

This paper explores whether Ahead-of-Time (AoT) compilers for dynamic languages can improve performance by adding dynamic binary modification (DBM) capabilities to their runtime systems, specifically focusing on optimizing Inline Caches (ICs) in JavaScript. The authors use the hopc JavaScript-to-C compiler as their base and implement a DBM-based optimization targeting the assembly code generated by the C compiler (gcc -O3).

Problem:

Just-in-Time (JIT) compilers achieve high performance for dynamic languages like JavaScript partly due to their ability to specialize code based on runtime profiling. Inline Caches (ICs) are a key optimization used by both JIT and AoT compilers to speed up object property access. ICs cache the "shape" (hidden class) of the object accessed at a specific program point and, on subsequent accesses of objects with the same shape, directly compute the property's memory offset without a costly lookup.

While JIT compilers can generate highly optimized assembly for the IC "hit" path, often directly embedding the property offset as an immediate value in an instruction, AoT compilers like hopc (which compiles to C) generate C code that typically uses global variables for the cached offset. When compiled by a standard C compiler, this results in assembly code for an IC hit that often involves multiple instructions and memory loads (e.g., loading the hidden class offset, then loading the property offset, then loading the property value using the offset). JIT-generated code can often perform the final property load in a single instruction by incorporating the offset directly. The authors hypothesize that reducing these extra memory reads via DBM could make AoT-compiled code competitive with JITs in this crucial area.

Proposed Approach & Implementation:

The authors implemented a DBM mechanism within the hopc runtime. The goal is to dynamically modify the assembly code sequence corresponding to an IC hit after the first miss determines the property's actual offset.

The implementation involves several steps:

Modifying C Code Generation: Hopc's C backend is modified to insert unique C labels (&&IC_LBLxxx) before the memory access code generated for an IC hit (e.g., prop = *(obj + icache.offset);). The address of this label is passed to the cache_miss function when a miss occurs.
Identifying Assembly Sequences: On an IC miss for a specific label, the cache_miss routine receives the label's address. It then uses the Capstone disassembly framework to decode a window of assembly instructions starting from that address. It heuristically scans for instruction patterns typical of the IC hit path:
- Looking for mov instructions that read from memory.
- Checking if the memory read uses RIP-relative addressing (common for accessing global variables like the IC structure).
- Verifying that the memory read is accessing the cached property offset value within the icache structure.

Applying Dynamic Binary Modification (DBM): Once the relevant instruction(s) are identified, they are modified in memory. Two optimization levels are explored:

-O1: The instruction that loads the property offset from the IC structure is replaced with an instruction that loads the offset as an immediate value.

; Original (from C compiled code, simplified)
mov 0x101c(%rip), %rax ; load the offset into rax
mov (%rdi, %rax, 8), %rax ; load the property using the offset in rax
; After -O1 DBM
mov 0x3, %rax         ; load the *actual* offset (0x18/8 = 3) as immediate
nopl (%rax)           ; NOP padding if immediate mov is shorter
mov (%rdi, %rax, 8), %rax ; load the property using the immediate offset

-O2: If the sequence consists of loading the offset into a register followed by loading the property using that register as an index (plus base object address and scale), the two instructions are replaced by a single instruction loading the property directly using a precomputed absolute offset (offset * scale + object base).

; Original (from C compiled code, simplified)
mov 0x101c(%rip), %rax ; load the offset (e.g., 3)
mov (%rdi, %rax, 8), %rax ; load property at offset 3 * 8 = 0x18 from %rdi
; After -O2 DBM
mov 0x18(%rdi ), %rax ; load property directly at offset 0x18 from %rdi
nopl 0L(%rax)         ; NOP padding if direct mov is shorter

The DBM uses NOP instructions for padding if the new instruction(s) are shorter than the original sequence.

Practical Constraints:
- Memory pages containing executable code are typically read-only. The DBM requires changing permissions to read/write/execute using mprotect. This overhead is mitigated by caching unprotected pages.
- Handling variable-size x86_64 instructions is crucial for padding.
- Immediate values (like offsets) must fit within the instruction's immediate field size (typically 4 bytes). This prevents optimizing loads of 8-byte pointers (like hidden classes).

The analysis and modification happen on the first cache miss for a given IC. Subsequent misses for the same IC using a different hidden class only require updating the immediate offset value in the already modified instruction (if applicable), avoiding the full analysis overhead.

Experimental Evaluation & Results:

The authors evaluated the DBM approach using the jsbench JavaScript benchmark suite compiled with hopc + gcc -O3 on various x86_64 architectures (Intel Golden Cove, Gracemont, and AMD Zen 3).

RQ1 (Detection Effectiveness): The DBM successfully identified and could modify >96% of IC sequences generated by gcc -O3. Around 17% were candidates for the -O2 optimization.
RQ2 (Instruction Count): The DBM had a marginal impact on the total executed instruction count (<1% variation for most benchmarks), mainly due to NOP padding.
RQ3 (Memory Reads): The DBM effectively reduced the number of L1 data cache loads by an average of 1.5% across benchmarks, with peaks over 10% reduction for some. This confirmed the DBM was successfully removing the targeted memory accesses.
RQ4 (Overall Performance): Surprisingly, despite reducing memory reads, the DBM optimization showed no significant improvement in execution time. Performance differences were generally within the ±5% variability observed from adding the DBM code infrastructure itself, averaging a 0.03% slowdown across all benchmarks and architectures.

Explanation for Lack of Speedup:

Further analysis revealed that the lack of speedup wasn't due to the DBM overhead (which is minimal and concentrated during a short warmup phase). The core reason is that modern CPU microarchitectures (specifically the ones tested) are already highly effective at mitigating the latency of the memory reads targeted by the DBM optimization. They likely employ advanced techniques like:

Branch Prediction: Correctly predicting the IC hit path.
Data Prefetching: Bringing the icache data (hidden class and offset) into the cache before it's explicitly requested by the load instructions.
Out-of-Order Execution: Executing the load instructions early and hiding their latency by overlapping them with other operations.
Store-to-Load Forwarding: If the IC structure was recently written, the subsequent load might get the data directly from the store buffer.

Essentially, the hardware is already optimizing the critical path, making the software optimization of removing the memory reads redundant in terms of execution time, even though it successfully reduces the number of executed memory load instructions reported by hardware counters.

Conclusion:

The paper concludes that, while their DBM technique is effective at finding and modifying static code to eliminate memory reads in IC sequences, this specific optimization does not translate into performance gains on modern x86_64 processors. This highlights a key challenge in performance optimization: microarchitectural behavior can significantly alter the impact of seemingly beneficial code transformations, and simple metrics like instruction count or memory reads don't always predict real-world performance. The authors suggest that future compiler optimizations, whether AoT or JIT, need to be designed with a better understanding of how modern hardware handles specific code patterns, as naive instruction/memory reduction may not yield expected speedups.