EVM Bytecode Representation
- EVM Bytecode Representation is the canonical low-level format for Ethereum smart contracts, detailing stack operations, control flow, and gas cost semantics.
- Intermediate representations like rule-based logic translation and context-sensitive CFGs enable precise static analysis and automated vulnerability detection.
- Symbolic and constraint-based models, including SMT encoding and zkEVM formulations, support formal verification and efficient smart contract optimization.
The Ethereum Virtual Machine (EVM) bytecode representation is the canonical low-level format in which Ethereum smart contracts are deployed and executed. EVM bytecode encodes stack-based instructions, explicit control flow, dynamic storage and memory operations, and a fine-grained gas cost model. Its representation forms the foundation for execution, static and dynamic analysis, formal verification, similarity detection, and cryptographic proof systems throughout the Ethereum ecosystem.
1. Formal Models and Core Bytecode Structure
The EVM is a 256-bit stack machine; its bytecode is a sequence of raw bytes, each interpreted as opcodes or operands. The standard formal representation models code as an array of bytes, with instructions like ADD, PUSH, JUMP, SSTORE, etc., acting on a tuple state:
where is the program counter, is a LIFO vector supporting up to 1024 $256$-bit words, is a sparse linear address space, is a persistent mapping , is the bytearray, and accounts for resource consumption (Grishchenko et al., 2018, Cassez et al., 2023).
The small-step semantics of EVM bytecode define the operational transitions:
- For arithmetic or stack operations (e.g., ADD, DUP, SWAP), the semantics are encoded as pop/push manipulations on the stack with explicit gas decrement and program counter update.
- Persistent state-altering instructions (e.g., SSTORE) atomically update contract storage.
- Control-flow opcodes, in particular JUMP and JUMPI, are stack-based: jump destinations are computed at runtime, with valid targets signaled by JUMPDEST markers in the bytecode (Grishchenko et al., 2018, Cassez et al., 2023).
Exceptions (out-of-gas, stack underflow, invalid jump, unknown opcode) immediately halt execution and yield error states (Cassez et al., 2023). This formalization enables rigorous, mechanically-verified reasoning about execution traces, safety, and correctness (Grishchenko et al., 2018, Cassez et al., 2023).
2. Intermediate Representations and Static Analyses
To make EVM bytecode amenable to automated analysis, multiple intermediate representations have been proposed, each preserving different aspects of bytecode semantics.
Rule-Based Representation (RBR) (EthIR): EVM bytecode is translated into a logic-style program using Horn clauses, where basic-blocks become block predicates with explicit stack, local, storage, and environment parameters. Each transition reflects stack assignments and guarded jumps, flattening the implicit stack into named registers and exposing data/control flow explicitly. This facilitates high-level resource and property analyses with off-the-shelf Horn-clause solvers and supports direct, semantics-preserving simulation (Albert et al., 2018).
Control Flow Graphs (CFGs):
- Stack-sensitive CFGs (Albert et al., 2020): Block cloning is employed, constructing distinct basic-block replicas for each possible entry stack state at a given program location. This preserves precise control transfers even in the presence of indirect jumps, enabling sound path reasoning.
- Reuse-sensitive CFGs (Wang et al., 20 May 2025): Modern EVM compilers (e.g., Solidity/Vyper) aggressively reuse code fragments. Naive CFG constructions conflate semantically distinct paths, leading to infeasible paths and spurious data-flow joins. Reuse-sensitive CFGs distinguish the context (origin of jump operand) in which each block is entered, applying taint analysis to clone blocks for each unique jump context. This approach eliminates fake loops/joins and supports precise vulnerability patterns detection, with F1-scores above 99% in practice.
Tables: Key Representation Approaches
| Representation | Main Abstraction | Salient Feature |
|---|---|---|
| Horn-Clause RBR (EthIR) | Logic program | Flattened stack, explicit control/data flow |
| Stack-sensitive CFG | CFG per stack state | Soundness via per-stack-state block cloning |
| Reuse-sensitive CFG | CFG per reuse ctx | Block (re)cloning on jump-token taint context |
Both block-based and context-sensitive representations are essential to avoid infeasible paths, over-approximations, or missed vulnerabilities (Albert et al., 2020, Wang et al., 20 May 2025).
3. Semantic Feature Extraction and Machine Learning Embeddings
Data-driven approaches leverage bytecode representations for vulnerability detection and code similarity.
Eth2Vec (Ashizawa et al., 2021):
- Disassembles EVM bytecode into a linearized token stream: opcode and operand tokens, hierarchically organized (contract → functions → basic blocks → instructions → tokens).
- Adapts PV-DM (Paragraph Vector–Distributed Memory), embedding opcodes and operands into vectors, with context windows used for token prediction.
- Learns function-level and contract-level embeddings; the contract embedding is a mean of function embeddings.
- Similarity between contracts is computed via cosine similarity , enabling detection of vulnerable templates even under code rewriting.
This architecture bypasses the need for manual feature engineering and achieves robust vulnerability detection, with precision and F1 exceeding SVMs on AST features.
Stable-Semantic Graph (SSG) (Chen et al., 17 Nov 2025):
- Models each EVM function as a heterogeneous, directed graph , where nodes are stable control instructions (storage, call, log, return) or data-flow variables, and edges encode control-flow and data dependencies.
- Edges are of three types: control-control (SSG-SCFG), data-data (taint/backward data flow), and control-data (usage/defs).
- SSG offers strong robustness under compiler variations or code reuse by focusing on semantically stable instruction types, cross-version isomorphism, and excluding low-level noise (e.g., transient stack ops).
- Embedding SSGs with a heterogeneous GNN produces function vectors with empirical AUC of 0.963 for true-binary similarity detection, outperforming instruction- and CFG-based baselines.
4. Symbolic and Constraint-Based Representations
EVM bytecode’s precise operational semantics enable its encoding into symbolic and constraint-based representations for formal analysis, equivalence checking, and automated optimization.
SMT-Level Encoding and Superoptimization (Nagele et al., 2020):
- The machine state at step comprises stack, stack pointer, memory, storage, halt flag, gas, and program counter.
- For each instruction , a formula encodes:
- Gas update ():
- Stack-pointer manipulation ():
- Stack preservation, halting condition, PC update, and stack effect
- Storage and memory manipulation for SSTORE, SLOAD, etc.
- The combined formula for a program is the conjunction of these per-step constraints plus initial state axioms.
- To verify equivalence or optimize with respect to gas, constraints of the form are asserted, with additional constraints to ensure that candidate bytecode is strictly less gas-expensive.
- SMT solvers (e.g., Z3) act as equivalence checkers and synthesizers, automatically searching for cheaper but equivalent bytecode sequences.
This symbolic encoding provides the backbone for superoptimizers and translation-validation frameworks in EVM toolchains.
5. EVM Bytecode in Cryptographic Proof Systems (zkEVM)
Zero-knowledge proofs for EVM execution (zkEVM) require representing EVM bytecode and its semantics as algebraic constraints over finite fields (Hassanzadeh-Nazarabadi et al., 6 Oct 2025):
- Arithmetization (Constraint Formulation):
- R1CS: Each opcode becomes a sequence of quadratic constraints; resource-intensive for wide 256-bit operations; SSTORE, SHA3, and CALL require tens to hundreds of thousands of constraints per instance.
- PLONKish: Employ selector polynomials and custom gates per opcode, supporting within-row and cross-row consistency (stack, memory, pc). Reduces constraint count via lookups and permutation arguments.
- AIR: Encodes global transition polynomials across state vectors; modularity suffers for EVM's highly heterogenous opcode set.
- Constraint Dispatch: Selector-based (per-opcode activation), sparse polynomial activation, and ROM-based lookups (explicit program counter/opcode mapping with committed code).
- Compatibility Spectrum:
- Type 1: Full bytecode and gas semantic equivalence.
- Type 2/2.5: EVM opcode fidelity with relaxed gas or precompile rules.
- Type 3–4: Source or IR compatibility, mapping Solidity to ZK-friendly custom ISAs.
The choice of encoding impacts constraint complexity, prover cost, auditability, and cross-tool equivalence. No deployed system yet achieves machine-verified semantic equivalence for over 140 EVM opcodes and all gas/success/exception behaviors.
6. Design Considerations and Practical Implications
The diversity of EVM bytecode representations reflects the competing requirements of analysis soundness, optimization potential, symbolic reasoning, and compatibility with polynomial proof systems.
- Stack flattening, explicit block and context parameters, and per-instruction formal rule sets are essential for analyzability and formal proofs (Grishchenko et al., 2018, Albert et al., 2018, Cassez et al., 2023).
- Precise CFGs must account for compiler-induced code reuse to prevent infeasible paths and masking of reentrancy or tx.origin vulnerabilities (Wang et al., 20 May 2025).
- Data-driven representations (Eth2Vec, SSG) enable scalable, rewrite-resilient similarity and vulnerability detection despite compiler changes or opcode-level obfuscation (Ashizawa et al., 2021, Chen et al., 17 Nov 2025).
- The translation of EVM bytecode into constraints or algebraic representations (for SMT-based verification or zkEVMs) requires faithful emulation of stack, gas, storage, exception, and halting behaviors, with formalization in proof assistants (F*) or verification-oriented languages (Dafny) serving as ground truth (Nagele et al., 2020, Cassez et al., 2023, Hassanzadeh-Nazarabadi et al., 6 Oct 2025).
A plausible implication is that future advances in EVM analysis tooling will increasingly depend on compositional, context-sensitive representations, leveraging both symbolic and learned embeddings, in parallel with machine-checked formalizations to achieve both precision and scalability.