Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode
(2011.13127v3)
Published 26 Nov 2020 in cs.PL
Abstract: Fast compilation is important when compilation occurs at runtime, such as query compilers in modern database systems and WebAssembly virtual machines in modern browsers. We present copy-and-patch, an extremely fast compilation technique that also produces good quality code. It is capable of lowering both high-level languages and low-level bytecode programs to binary code, by stitching together code from a large library of binary implementation variants. We call these binary implementations stencils because they have holes where missing values must be inserted during code generation. We show how to construct a stencil library and describe the copy-and-patch algorithm that generates optimized binary code. We demonstrate two use cases of copy-and-patch: a compiler for a high-level C-like language intended for metaprogramming and a compiler for WebAssembly. Our high-level language compiler has negligible compilation cost: it produces code from an AST in less time than it takes to construct the AST. We have implemented an SQL database query compiler on top of this metaprogramming system and show that on TPC-H database benchmarks, copy-and-patch generates code two orders of magnitude faster than LLVM -O0 and three orders of magnitude faster than higher optimization levels. The generated code runs an order of magnitude faster than interpretation and 14% faster than LLVM -O0. Our WebAssembly compiler generates code 4.9X-6.5X faster than Liftoff, the WebAssembly baseline compiler in Google Chrome. The generated code also outperforms Liftoff's by 39%-63% on the Coremark and PolyBenchC WebAssembly benchmarks.
The paper "Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode" (Xu et al., 2020) introduces a novel compilation technique called copy-and-patch designed for scenarios where compilation occurs at runtime, such as in WebAssembly virtual machines or database query engines. The core idea is to achieve extremely fast compilation while still producing good quality code by stitching together pre-compiled binary snippets from a large library.
The copy-and-patch system consists of two main components: the MetaVar compiler and the copy-and-patch code generator. The key concept enabling this technique is the binary stencil. A binary stencil is a partial binary implementation of a high-level language AST node or a bytecode opcode. These stencils have "holes" where missing values like immediate literals, stack variable offsets, and branch or call targets need to be inserted (patched) at runtime.
The stencil library contains numerous binary stencil variants for each type of AST node or bytecode. These variants are specialized based on factors such as operand types (e.g., integer, float), the location of values (register, stack, constant), and common patterns of AST subtrees or bytecode sequences (called supernodes). By providing many variants, the copy-and-patch code generator can select the most appropriate stencil at runtime, enabling simple optimizations like efficient constant handling and register allocation.
To handle control flow between these discrete binary snippets, the stencils are generated using continuation-passing style (CPS) [steele1977]. Instead of returning control to a caller, a stencil directly jumps (via a tail call compiled to a jump instruction) to the next stencil in the execution sequence. This helps avoid function call overhead between stencils. Register allocation is integrated with this CPS approach and the GHC calling convention [ghcConvention]: temporary values are passed in registers as parameters to the continuation function. Stencil variants with "pass-through" parameters are generated to preserve register values needed by later operations, ensuring that intermediate stencils don't clobber required registers. If a value's lifetime exceeds available registers, stencils that spill to and load from the stack are used.
The MetaVar compiler is responsible for building the stencil library at compiler installation time (not runtime). It takes C++ stencil generators as input. These generators are templated C++ functions using template meta-variables (metavars) to represent the different configurations and variants. Special macros (like DEF_CONSTANT_0, DEF_CONTINUATION_1) are used within the generator code to mark the locations of the "holes" that need patching later. MetaVar uses C++ template metaprogramming to instantiate the generators for valid combinations of metavars, subject to user-defined filter functions. It then leverages the Clang+LLVM compiler infrastructure to compile these instantiated C++ functions into object code. Crucially, MetaVar parses the object files to extract the binary code and the linker relocation records associated with the special macros. This information identifies the locations and types of the missing values, forming the Stencil struct (#1{fig:binary-stencil-struct}). This system abstracts away platform-specific low-level details, making it portable.
The copy-and-patch code generation algorithm operates at runtime to compile a high-level language AST or a bytecode sequence into executable binary code. For high-level languages, the process involves several stages:
Pre-pass AST traversal: Performs light-weight register allocation and plans the stack frame layout for local variables and spilled temporaries. A simplified Sethi-ULLMan algorithm [sethiULLMan] is used for expression temporary allocation, prioritizing low overhead.
AST to CPS Graph Conversion: A second AST traversal selects the most appropriate stencil or supernode stencil for each AST node based on pattern matching and context. This links stencils together into a CPS call graph. Supernodes are used to match common AST subtrees and leverage target-specific instructions, improving code quality.
Binary Code Generation (Copying): The algorithm traverses the CPS call graph, copying the binary code of each selected stencil into a contiguous memory region. By traversing in depth-first order and placing stencils sequentially, it maximizes the number of calls that become fall-throughs, effectively eliding jump instructions.
Patching: The final step iterates through the identified holes in the copied binary code, inserting the actual runtime values (literals from the AST, stack offsets, and computed addresses of other stencils) using the relocation information from the Stencil struct.
The copy-and-patch technique also supports complex features like external function calls (to C++ code, for instance) and C++ exception propagation, important for interoperability in metaprogramming systems like database query compilers.
The paper evaluates copy-and-patch in two main use cases:
WebAssembly (Bytecode Assembler): A copy-and-patch based WebAssembly compiler was implemented and compared against industrial compilers like V8 Liftoff, Wasmer SinglePass/Cranelift/LLVM, Wasmtime Lightbeam/Cranelift, and WAVM. On benchmarks like Coremark and PolyBenchC, copy-and-patch demonstrated significantly lower startup delay (4.9x - 6.5x faster than Liftoff) and better execution performance (39% - 63% faster than Liftoff) (#1{fig:wasm-codegen-normalized}, #1{fig:wasm-execution-polybench-avg}). It effectively displaces existing baseline compilers on the Pareto frontier and narrows the performance gap with some optimizing compilers. The stencil library for WebAssembly is small (\SI{35}{\mega\byte}, 1666 stencils), indicating practicality for memory-constrained environments.
High-Level Language (Metaprogramming/Database Query Compiler): A compiler for a C-like DSL embedded in C++ was built using copy-and-patch. This was used to implement a prototype SQL database query compiler for TPC-H benchmarks. Comparing copy-and-patch with an AST interpreter and LLVM -O0, -O1, -O2, and -O3 optimization levels on microbenchmarks (Fibonacci, Euler Sieve, Quicksort) and TPC-H queries revealed a clear Pareto frontier shift (#1{fig:eval-micro}, #1{fig:tpch-llvm}, #1{fig:tpch-interp}). Copy-and-patch dominates LLVM -O0, producing code two orders of magnitude faster than LLVM -O0 that performs 14% better on average for TPC-H. Compared to interpretation, copy-and-patch has a negligible startup cost similar to AST construction time but generates code that runs an order of magnitude faster (6x - 27x faster on TPC-H). While higher LLVM optimization levels yield faster execution, their compilation cost is three orders of magnitude higher than copy-and-patch, making them viable only for very long-running tasks. The high-level language stencil library is larger (\SI{17.5}{\mega\byte}, 98831 stencils) due to extensive supernode generation but is still smaller than the LLVM library size. Copy-and-patch exhibits near-linear scaling with increasing program size, unlike LLVM (#1{fig:eval-scaling}). An optimization breakdown showed that specialized stencils with direct branches and constants are the primary source of speedup over interpretation, while jump removal and light-weight register allocation within copy-and-patch provide significant further gains (#1{fig:breakdown}).
The paper positions copy-and-patch as a replacement for both interpretation and -O0 optimization in tiered execution environments, offering a better balance between startup delay and execution performance. The system is designed to be extensible by adding new AST node or supernode stencil generators. Future work could include integrating more advanced general-purpose optimizations or domain-specific techniques like type profiling for dynamic languages.