Propositionalisation Technique Overview
- Propositionalisation technique is a family of methods that converts structured, relational, or temporal data into Boolean or attribute-value forms to facilitate reasoning and learning.
- It enables neural-symbolic integration, SAT-based verification, and interpretable feature extraction by systematically mapping complex representations to flat feature sets.
- Practical algorithms balance computational complexity with accuracy, as validated by competitive performance in domains like neural-symbolic systems, software product line analysis, and clinical informatics.
The propositionalisation technique encompasses a family of algorithmic procedures that convert structured, relational, integer, or logic-based representations into propositional (attribute–value, Boolean, or “flattened”) forms to facilitate reasoning, learning, or analysis by downstream systems. It is foundational in several subfields: neural-symbolic learning, SAT-based software verification, and interpretable feature extraction from relational or time series data. Below, major approaches and their technical details are summarized, as developed in key works addressing symbolic–neural integration (Tran, 2017), software product line variability (Krafczyk et al., 2021), and interpretable feature engineering for time series and clinical data (Gay et al., 2021, Lemaire et al., 29 Jan 2026).
1. Theoretical Formulations and Domains of Propositionalisation
Propositionalisation is formally defined as the process of mapping elements from expressive, structured, or higher-order representations—such as first-order logic, integer arithmetic, or relational schemas—into Boolean or vector-valued propositional features. The most common motivations are:
- Logical inference compatibility: mapping knowledge bases (KBs) or constraints to Boolean logic for probabilistic/neural models or SAT-solvers.
- Data flattening: converting multi-relational/time-dependent data into fixed-length attribute–value vectors suitable for standard machine learning algorithms.
Three principal domains and their typical formal bases are:
| Domain | Input Formalism | Propositionalisation Target |
|---|---|---|
| Symbolic logic to neural models | Propositional logic, Horn rules | RBM energy functions, binary units |
| Integer-based program variability | Integer expressions, ASTs | Pure Boolean (SAT) formulas |
| Relational/time series feature mining | Relational tables, time series | Flat numerical/categorical vectors |
In each case, the core challenge is to preserve interpretability, satisfiability, or logical content while obtaining an explicit propositional representation.
2. Symbolic Knowledge Propositionalisation for Neural Models
Propositionalisation in neural-symbolic learning is exemplified by the mapping of propositional logic KBs into energy-based models such as restricted Boltzmann machines (RBMs) (Tran, 2017). The procedure comprises:
- Formula transformation: Each knowledge base formula is converted to strict disjunctive normal form (SDNF), ensuring that at most one conjunctive clause holds for any assignment.
- Hidden unit assignment: Each clause receives a dedicated hidden RBM unit , which activates if and only if holds.
- Energy and weights: For each clause,
with , and weights for positive literals, for negative, hidden bias .
- Complexity advantage: For implications or Horn clauses, SDNF expansion yields only hidden units per clause, in contrast to the exponential blowup or universal-approximator constructions in prior art.
- End-to-end system (CRILP): Background rules are encoded via this mapping, additional hidden units model unknown patterns, parameters are learned by hybrid generative–discriminative loss, and inference is approximate via Gibbs sampling or direct calculation.
This method achieves compact and sound symbolic–neural integration, outperforming traditional ILP systems in several benchmarks (Tran, 2017).
3. Integer-Based Variability Propositionalisation for SAT-based Analysis
When analyzing software product lines or code with integer-based configuration, propositionalisation enables overloading SAT-based tools originally designed for Boolean logic (Krafczyk et al., 2021). The approach operates as follows:
- Finite domain encoding: Each variable (with range ) is mapped to “one-hot” Boolean variables , representing for each , along with for “definedness.”
- Atomic conversion: Expressions such as are rewritten as , as .
- Arithmetic and logical composition: For cross-variable expressions (e.g., ), all satisfying value assignments are enumerated and expressed as disjunctions of conjunctions over .
- Handling large/unbounded domains: For variables with unmanageable ranges, the encoding is weakened to , causing potential over-approximation of satisfiability.
- Correctness: For purely finite domains, the propositional formula is satisfiable iff the original integer formula is, under the encoding’s one-hot constraints. For unbounded cases, only an over-approximation is guaranteed.
The method efficiently supports SAT-based feature analyses without modifying the solver, contingent on practical domain size for the integer variables (Krafczyk et al., 2021).
4. Relational and Temporal Data Propositionalisation for Interpretable Feature Construction
Propositionalisation is a cornerstone in relational learning and time series analysis with interpretable models (Gay et al., 2021, Lemaire et al., 29 Jan 2026). The process comprises:
- Relational schema design: Data is decomposed into a root entity table and multiple secondary tables representing temporal or multi-view measurements.
- Aggregation and selection operators: A language of feature constructors is defined, including unary aggregates (mean, std, min, max, sum, median, count, last) and selection predicates (interval or value-based filters).
- Feature flattening: Random (or MDL-prioritised) candidate features are constructed by composing aggregates and selection operators; each feature corresponds to an SQL-like query. The result is a flat matrix where rows are entities and columns are computed feature values.
- Feature selection: A Bayesian model selection or MDL compression criterion is applied: for each feature, a discretisation is built jointly over the feature and target (for regression or classification), and features are selected if their “level” or compression gain exceeds a threshold.
- Downstream modeling: The selected flat features are used with standard regressors or classifiers. In the clinical context (Lemaire et al., 29 Jan 2026), a selective Naive Bayes classifier with feature weights optimised under MDL regularization yields both high accuracy and robust interpretability.
- Interpretability: Constructions intrinsically support univariate, global, local (per-instance, by Shapley value), and counterfactual explanations, due to the explicit origin and semantics of each generated feature.
Experimental validation demonstrates significant gains in interpretability and competitive predictive accuracy on both time series regression (Gay et al., 2021) and clinical event prediction (Lemaire et al., 29 Jan 2026) benchmarks.
5. Practical Algorithms, Complexity, and Limitations
The algorithms underlying modern propositionalisation are tailored to target use cases and computational constraints:
- SDNF-based logic mapping: Efficient for Horn and implication rules (linear-size expansion), intractable for arbitrary formulas.
- Finite domain enumeration: Propositional encodings scale linearly with the number of conditions for small ; the enumeration of value-tuples in expressions with deep compositionality or high arity leads to complexity and necessitates pragmatic bounding strategies.
- Relational feature construction: Overall complexity is for features and samples, with additional overhead for Bayesian/MDL feature selection. Extreme arithmetic compositionality or unrestricted domains are mitigated by fallback strategies (e.g., using coarse over-approximation or ignoring intractable features).
- Limitations:
- Exponential blowup for unconstrained formulas or unrestricted integer domains.
- Over-approximation in integer encoding can admit spurious configurations.
- Flat feature harvesting can return zero informative features, requiring fallback to trivial predictors.
- Extension to full first-order logic or richer aggregation languages remains an open research area.
6. Applications and Empirical Outcomes
Modern propositionalisation methodologies have been validated across diverse domains:
- Neural-symbolic systems: CRILP achieves superior or comparable accuracy to logic programming baselines on DNA, ILP, and relational benchmarks (Tran, 2017).
- Software product line analysis: Enables SAT-based methods to handle integer variability directly, requiring only linear preprocessing overhead for typical industrial scenarios (Krafczyk et al., 2021).
- Time series regression and clinical informatics: Relational and aggregation-based propositionalisation produces small, interpretable feature sets that often improve accuracy relative to raw or deep feature baselines; in temporal sepsis modeling, AUCs above 0.97 are attained using less than 200 features, with the full suite of interpretability explanations natively supported (Gay et al., 2021, Lemaire et al., 29 Jan 2026).
These successes underscore the dual role of propositionalisation as both an enabler of scalable automated reasoning and a systematic pathway to interpretable, high-utility feature engineering.
7. Summary Table: Key Propositionalisation Paradigms
| Domain | Encoding Method | Strengths | Limitations |
|---|---|---|---|
| Symbolic→Neural (RBM) | SDNF/Horn → RBM hidden units | Linear scaleup for Horn, energy function mapping | Hard for full logic, approximate inf. |
| Integer→Boolean (SAT) | One-hot per value, AST walk/enumerate | SAT integration for small domains | Enumeration blowup, over-approx. |
| Relational→Flat features | Aggregation/selection, MDL selection | Interpretable features, supports time series | May yield no features, limited by agg lang. |