Bottom-Up Reverse Engineering
- Bottom-up reverse engineering is a method that infers system structures, relations, and semantics solely from low-level artifacts without relying on pre-defined models.
- It employs data-driven pipelines and formal algorithmsāsuch as those in language modeling, binary decompilation, database schema mining, and UI layout analysisāto extract and synthesize high-level abstractions.
- This approach enables scalable, explainable, and adaptable system understanding even when traditional design documentation or formal specifications are unavailable.
Bottom-up reverse engineering denotes methodologies that reconstruct high-level abstractions or system specifications exclusively by analyzing low-level artifacts or raw behaviors, rather than assuming access to design documentation, formal specifications, or strong top-down priors. Unlike traditional top-down REāwhich maps observed data to hand-crafted modelsābottom-up approaches infer structure, constraints, and semantics inductively from the multiplicity of observations or executions, rendering them particularly suitable when system internals, documentation, or theory are unavailable or unreliable. These methodologies have emerged as the dominant paradigm across language understanding, binary analysis, software architecture recovery, database schema reconstruction, and graphical user interface modeling, unified by a data-driven, incremental assembly of increasingly abstract, explainable, and manipulable representations.
1. Foundational Principles and Definitions
In āTowards Explainable and Language-Agnostic LLMs: Symbolic Reverse Engineering of Language at Scale,ā bottom-up reverse engineering is characterized as inferring the building blocks of a system (e.g., language categories, relations, compositional rules) purely from empirical observation, with no presumption of pre-existing, innate grammar or specification. This stands in contrast to top-down methodologies such as Universal Grammar, static architectural blueprints, or fixed database schemas. Motivating principles include distributional semantics, Fregeās context principle, and the philosophy that conceptual knowledge is strictly contextual and compositional, anchored in usage patterns and statistical co-occurrence (Saba, 2023).
In binary decompilation, provenance-guided superset decompilation (PGSD) implements bottom-up RE by monotonically accumulating relational facts about a program, derived from bytecode upwards through a hierarchy of intermediate representations (IRs), with all candidate abstractions preserved until final selection. Ambiguity is not resolved prematurely but carried explicitly with provenance through the pipeline (Liu et al., 30 Mar 2026).
Database reverse engineering, as exemplified by NoWARs (āNormalization With Association Rulesā), similarly hinges on inferring conceptual schema by mining functional dependencies from raw instance data, using algorithms like Apriori to identify 1.0-confidence association rules, which then determine candidate keys and drive normalization (Pannurat et al., 2010).
A unifying theme is the absence of privileged a priori abstraction. Instead, higher-level structure emerges through robust, reproducible algorithms that process empirical data, producing explicit, testable, and extensible intermediate artifacts at each stage.
2. Formal Frameworks and Representational Substrates
Explicit formalism is crucial for bottom-up RE. In language, Sabaās framework relies on:
- The predicate , representing that the predicate sensibly applies to concept (e.g., ).
- A fixed set of primitive, language-agnostic relations such as , , , , and their cognates. Each relation admits nominalization from linguistic usage (e.g., articulate articulation; ) (Saba, 2023).
In binary analysis, PGSD defines a finite poset 0 of IR levels (e.g., Asm 1 Mach 2 LTL 3 RTL 4 Clight 5 C99), and a relation store 6 mapping IR relations and tuples to polynomial-provenance semiring annotations in 7, encoding all derivational histories for each candidate abstraction. The system guarantees that no feasible derivation is lost (completeness) by monotonic fixpoint iteration, and ambiguity is handled explicitly in a superset lattice (Liu et al., 30 Mar 2026).
For layout analysis, ReverseORC captures widget geometry across window sizes, building recursive tree decompositions (Row/Column) and inferring OR-constraint systems in the form:
8
where each 9 is a conjunction of linear inequalities modeling a discrete resizing mode (Jiang et al., 2022).
Database RE defines the key measures:
- Support: 0
- Confidence: 1, with rule-filtering heuristics to ensure only 1.0-confidence dependencies are realized in schema synthesis (Pannurat et al., 2010).
3. Methodologies, Pipelines, and Core Algorithms
Generic bottom-up RE pipelines share a class of workflow primitives:
Language (Symbolic Reverse Engineering):
- Extract candidate concepts (nouns) and predicates (verbs, adjectives) from a corpus.
- Compute 2 by contextual masking or large-scale masked language modeling (e.g., āThe [c] is [MASK]ā¦ā), using statistical or LLM-informed co-occurrence.
- Each validated pair induces a nominalized, primitive relation, assembling symbolic vectors for each concept.
- Algorithms compute similarity over all dimensions, e.g.,
3
where 4 (Saba, 2023)
Binary Decompilation (PGSD):
- Disassemble raw bytes; parse to Asm.
- Incrementally lift each instruction through defined IR levels, recording all candidate abstractions at each node, with provenance annotations.
- Structure the intermediate representations in a superset relation store and defer ambiguity resolution.
- Final C99 output is selected by a greedy, error-minimizing pass that invokes Clang as an oracle to ensure type-checking, with candidate swaps guided by compilation error reduction (Liu et al., 30 Mar 2026).
Software Architecture (RE + LLM):
- Parse source code (e.g., C++) to extract structural graphs (class and dependency diagrams).
- Cluster via degree centrality, threshold filtering, or community detection; encode as PlantUML for human and LLM consumption.
- LLM is prompted with the entire diagram and abstracting instruction to propose the minimal core set (ācore componentsā) and ignore auxilliary artifacts.
- Generate behavioral state-machines for key classes using few-shot prompting, parsing explicit state variables and guarded transitions in code (Hatahet et al., 7 Nov 2025).
Database Schema Mining (NoWARs):
- Given an un-normalized table, run Apriori to enumerate all frequent itemsets and then derive all association rules with 5 confidence.
- Apply rule-filtering to retain maximal-support keys per functional dependency.
- Group filtered rules into 3NF relations, ensuring attribute coverage. Output schema replaces the original with fully normalized relations (Pannurat et al., 2010).
UI Layout Analysis (ReverseORC):
- Sample the UI at a small, adaptively chosen set of window sizes.
- For each, extract widget bounding boxes and build a tree (Row/Column decomposition).
- Diff adjacent trees to identify structural edits, and match edit-sets to known OR-constraint layout patterns (e.g., optional sublayout, pivot, flow).
- Synthesize a minimal set of OR-constraints covering all observed behaviors, sufficient for layout regeneration or extension (Jiang et al., 2022).
4. Applications and Case Studies
Empirical validation spans multiple domains:
- Language Modeling: The symbolic framework uniquely addresses non-invertibility, lack of compositionality, extensional ambiguity, scope misanalysis, and truth evaluation inherent in neural LLMs. A metonymy case study (āThe loud omelet wants a beer.ā) illustrates correct type unification and sense recovery (Saba, 2023).
- Binary Decompilation: PGSD/Manifold demonstrates output parity with Ghidra, IDA Pro, angr, and RetDec on GNU coreutils, yielding fewer compiler errors (e.g., total error count: Manifold 6 vs Ghidra 7) and matching call-graph F1 (Manifold: 8 vs Ghidra: 9) (Liu et al., 30 Mar 2026).
- Software Architecture: Automated SAD recovery in C++ projects reduces manual reconstruction effort from days to minutes; LLM abstraction reduces class-diagram nodes by 0, and state-machine extraction achieves up to 1 transition/state recovery in benchmark systems (Hatahet et al., 7 Nov 2025).
- Database Reconstruction: NoWARs, over datasets like REGISTER, Video_Rental, etc., achieves an order-of-magnitude reduction from raw rules (e.g., 2 rules 3 4 used), with derived schemas matching those created by expert designers (Pannurat et al., 2010).
- UI Layout: ReverseORC synthesizes resizable GUIs in under a second, reproducing or adapting complex layouts across platforms (e.g., MS Word Ribbon, BBC News); automatically detects legacy layout errors and enables by-example design/extension (Jiang et al., 2022).
5. Comparison with Top-Down and Subsymbolic Approaches
Bottom-up RE differs fundamentally from approaches that start with fixed design intentions, reference models, or human-expert supervision:
- In symbolic language engineering, top-down approaches envision an innate combinatorial system (e.g., Montague semantics), whereas bottom-up methods ground meaning in observed usage, statistical co-occurrence, and explicit symbolic relations, achieving both scalability and interpretability (Saba, 2023).
- Decompilation with PGSD retains possible interpretations up the abstraction hierarchy, avoiding premature commitment, in contrast to monolithic toolchains that collapse ambiguities early, risking unsoundness or information loss (Liu et al., 30 Mar 2026).
- Database schema mining from instance-level data captures actual usage patterns, rather than relying on incomplete design documents or outdated schemas (Pannurat et al., 2010).
- UI layout reverse engineering via OR-constraints generalizes from observed behaviors, making it robust to missing, undocumented, or proprietary layout specification (Jiang et al., 2022).
A consistent advantage is invertibility, explicitness, explainability, and greater resilience to bias or incomplete information, at the cost of increased complexity in candidate management and ambiguity resolution.
6. Scalability, Extensibility, and Limitations
Scaling is facilitated by modular composition, deferred resolution, and pruning heuristics:
- Symbolic language reverse engineering leverages corpus-scale masked querying, and the core primitive set is language-agnostic, supporting cross-linguistic application (Saba, 2023).
- PGSDās append-only, monotonic relation store, combined with modular nano-passes and provenance semantics, supports parallelism and easy extensibility: new IRs or analyses can be injected as independent passes; e.g., adding variable-length array support requires only a new pass without perturbing the pipeline (Liu et al., 30 Mar 2026).
- Association rule mining in databases is subject to exponential complexity with large attribute sets, mitigated somewhat by aggressive rule filtering (Pannurat et al., 2010).
- UI layout inference by exemplar sampling and structural differencing achieves 5 complexity in practice and can transfer generated layouts between platforms without recoding (Jiang et al., 2022).
Nevertheless, worst-case costs can grow quickly with attribute explosion or extreme ambiguity. Database mining is limited to 3NF and strict (confidence 1.0) dependencies; language RE assumes sufficient coverage in the corpus and reliable masking or co-occurrence; decompilation depends on viable fixpoint iteration for large-scale binaries.
7. Outlook and Synthesis
Bottom-up reverse engineering, in its many instantiations, systematically builds high-level, explainable, and extensible models by accumulating, filtering, and synthesizing from the lowest-level artifacts available. Recent advances unify symbolic methods, modular relational frameworks, and neural prompting to reconstruct systems beyond the reach of legacy top-down or manual approaches. Future progress is likely to target improved memory/performance trade-offs (e.g., proof pruning), richer domain coverage (4NF, new architectures), and tighter integration of statistical and symbolic induction pipelines (Saba, 2023, Liu et al., 30 Mar 2026, Pannurat et al., 2010, Jiang et al., 2022, Hatahet et al., 7 Nov 2025). The paradigm demonstrates the learnability and transferability of complex behaviors and structures from empirical data alone, under rigorous, explainable pipelines.