Unsupervised Hierarchical Symbolic Regression
- UHiSR is a computational paradigm that automatically extracts interpretable, multi-level mathematical models from raw data without supervised guidance.
- It constructs models in successive stages through recursive feature transformation and descriptor pooling, reducing combinatorial complexity and supporting knowledge transfer.
- The approach integrates domain constraints such as dimensional analysis and feature sparsification to ensure physical plausibility and improve prediction accuracy across diverse scientific fields.
Unsupervised Hierarchical Symbolic Regression (UHiSR) is a computational paradigm focused on automatically extracting interpretable, compositional mathematical models from data without supervised guidance, by organizing learned structures hierarchically. UHiSR aims to balance expressivity, interpretability, and computational efficiency by building models that capture relationships at multiple levels of abstraction, enabling knowledge transfer across related tasks, facilitating physical insight, and reducing combinatorial search complexity.
1. Conceptual Foundations of UHiSR
UHiSR advances classical symbolic regression by structuring learning and model formation in a hierarchy: primitive features are recursively transformed into complex descriptors, which are themselves used as inputs to subsequent modeling layers. This layered construction reflects the compositional nature of scientific knowledge, where base quantities are combined forming intermediate physical parameters, and these in turn govern bulk behavior.
In UHiSR, discovery proceeds without explicit labeling or strong prior supervision; instead, the models self-organize, often guided only by domain-informed constraints or unsupervised clustering. Hierarchical strategies explicitly leverage multi-level relationships, improving both efficiency and interpretability compared to flat symbolic regression approaches (Foppa et al., 2022).
2. Hierarchical Structure and Model Construction
UHiSR methods decompose the search for analytic relationships into successive stages. At each stage, symbolic regression (SR) is performed over an expanded feature set:
- Stage 1: SR identifies descriptors from primary features (e.g., atomic parameters in materials science).
- Stage 2: The outputs or selected subexpressions (components of prior models) are injected as new primary features into a subsequent SR round.
- Stage N: The process is repeated, stacking complexity as needed.
A canonical form for descriptor-based models used in hierarchical SR is:
where are coefficients and are recursively built descriptor components (Foppa et al., 2022). Recycling previously discovered expressions as input features enables efficiency and targeted complexity growth.
A plausible implication is that this staged approach not only controls combinatorial explosion but also functions as a form of architectural regularization by enforcing model reuse and hierarchy.
3. Algorithmic Frameworks and Constraints
Several algorithmic ingredients distinguish UHiSR:
- Descriptor Pooling and Operator Rungs: Candidate expressions are generated via recursive application of algebraic and transcendental operators, with the hierarchical depth (“rung”) controlling pool complexity.
- Dimensional Analysis and Domain Constraints: Physical plausibility is enforced by integrating linear constraints for dimensional consistency at every hierarchical level (Austel et al., 2020, Tenachi et al., 2023). For example, the structure:
ensures, say, force expressions respect units.
- Greedy Expansion and Pruning: SymTree-style heuristics incrementally expand models within a constrained representation, always selecting terms that yield improved error metrics and aggressively pruning terms with negligible coefficients (Franca, 2018).
- Compressed Sensing and Feature Sparsification: SISSO utilizes regularization and independence screening to extract a minimal set of correlated features from an exponential candidate pool (Foppa et al., 2022).
- Genetic and Quality-Diversity Search: Evolutionary strategies including MAP-Elites diversify candidates across multiple structural niches, mitigating premature convergence and bloat (Bruneton et al., 2019).
4. Interpretability and Knowledge Transfer
UHiSR explicitly links discovered mathematical forms with physical or domain knowledge. Each hierarchical descriptor or subexpression corresponds to a human-interpretable component (e.g., lattice constant, cohesive energy, or polarity index):
- Domain-Informed Feature Extraction: Models are built from features such as electron affinity, orbital radii, or counts of functional groups, not opaque fingerprints (Lou et al., 25 Jan 2024).
- Transfer Across Properties: Discovered descriptors for one property (e.g., lattice constant ) are reused in modeling another (e.g., bulk modulus ), elucidating inter-property physical relationships (Foppa et al., 2022).
- Conciseness and Transparency: Final equations are concise, with each term traceable to constituent variables, as in:
where and are interpretable polarity indices (Lou et al., 25 Jan 2024).
5. Computational Efficiency and Scalability
The hierarchical paradigm dramatically reduces the effective size of the search space. By first compressing raw features into distilled indices, subsequent SR processes operate on lower-dimensional and more meaningful inputs. SISSO-based hierarchical SR results in orders-of-magnitude reduction in considered features for second-stage modeling (Foppa et al., 2022).
Greedy search techniques (e.g., SymTree) and combinatorial pruning further mitigate computational cost. Dimensional analysis constraints efficiently eliminate spurious solutions early (Austel et al., 2020). Nonetheless, combinatorial complexity and outlier robustness remain open challenges, especially as depth or operator set grows.
6. Applications, Benchmark Results, and Implications
UHiSR has found direct application in:
- Materials Science: Identifying compositional dependencies for perovskites' bulk properties and enabling data-driven discovery for flexible electronics (Foppa et al., 2022).
- Chemistry: Uncovering interpretable structure–property relationships in chromatography using hierarchical neural networks and SR for latent variable extraction and governing equation discovery (Lou et al., 25 Jan 2024).
- Physics and Astronomy: Class SR frameworks recover analytic laws governing unit classes, e.g., stellar orbit potentials, by discovering universal expressions optimized across multi-dataset input (Tenachi et al., 2023).
- Interdisciplinary Science: Hierarchical SR methodologies are adaptable to chemical, biological, and engineering contexts where multi-level dependencies require explicit structure.
Extracted equations are shown to be accurate and interpretable; hierarchical knowledge transfer between physical parameters is consistently demonstrated (Foppa et al., 2022). Domain-guided variable selection also enhances out-of-sample prediction.
A plausible implication is that as databases and operator sets scale, further algorithmic innovation will be needed to balance expressiveness, tractability, and domain validity, potentially via adaptive rung selection or automated operator curation.
7. Current Challenges and Future Directions
Notable challenges include:
- Combinatorial Explosion: The initial candidate pool remains large, particularly before hierarchy is invoked; more sophisticated operator filtering or ML-based pre-selection may further enhance scalability.
- Feature Redundancy and Correlation Management: Including correlated features may sometimes aid modeling but may also introduce redundancy; systematic redundancy control is necessary.
- Out-of-Distribution Generalization: UHiSR models can produce outliers when training regime does not encompass all chemical or physical diversity; strategies for extrapolation and uncertainty quantification are under development.
- Scalability and Complexity Control: Adaptive hierarchical layer depth and integrated validation are needed as applications expand to high-throughput and broader fields.
Further research is expected in integrating UHiSR outputs with traditional black-box machine learning, boosting both predictive accuracy and interpretability, and in automating the generation of composite hierarchical features.
In summary, Unsupervised Hierarchical Symbolic Regression provides a principled approach to extracting interpretable, multi-level analytic models from data. By combining hierarchical decomposition, domain-guided constraints, and efficient search and pruning, UHiSR enables robust scientific discovery, knowledge transfer, and mechanistic insight across diverse disciplines. Ongoing research addresses scalability, redundancy management, and the integration of hierarchical symbolic models with broader machine learning frameworks.