Extended-Connectivity Fingerprint (ECFP)
- ECFP is a circular substructure fingerprinting method that iteratively hashes atomic neighborhoods to encode chemical motifs for molecular analysis.
- It uses a fixed-radius approach (typically ECFP4) to generate variable-length substructure sets folded into fixed-length vectors for machine learning applications.
- Advancements like the Sort & Slice algorithm yield collision-free, interpretable fingerprints that improve prediction performance and mitigate hash collisions.
Extended-Connectivity Fingerprint (ECFP) is a circular substructure-based molecular fingerprinting method that has become a foundational feature extraction technique in computational chemistry and molecular machine learning. ECFP encodes local molecular environments by iteratively hashing atomic neighborhoods up to a fixed radius, yielding variable-length sets of substructure identifiers that are typically folded into fixed-length vectors for use in statistical and machine learning models. The method is widely used for structure-activity prediction, virtual screening, and quantitative structure–activity relationship (QSAR) studies due to its capacity to represent both detailed local chemical motifs and broad scaffold diversity (Dablander et al., 2024, Notwell et al., 2023). Recent methodological advances, such as the introduction of the Sort & Slice pooling algorithm, have addressed major limitations in classical ECFP vectorisation, notably the issue of hash-based bit collisions.
1. ECFP Construction and Algorithmic Details
The ECFP algorithm represents a molecule as a graph , with atoms and bonds . Each atom is initially assigned a unique integer identifier , encoding atomic invariants such as atomic number, isotope, heavy-atom neighbor count, formal charge, and ring status. Iteratively for (where is the radius parameter), each atom gathers a multiset of tuples containing the bond type and neighboring hash values. The atomic environment at each radius is defined as
where $\|\$ denotes concatenation. A hash function compresses each atomic environment into a new identifier . The set of all such hashed identifiers across all atoms and radii forms the set of detected substructures for the molecule:
These identifiers are then folded into a fixed-length bit vector of length 0 by mapping each identifier to index 1 and setting the corresponding bit (or incrementing a count for the count variant).
The standard parameterisation ECFP4 uses a radius 2, yielding a molecular diameter of 4 bonds, and a typical bit vector length of 1,024 or 2,048 (Notwell et al., 2023).
2. Mathematical Framework for Substructure Pooling
Let 3 denote the universe of possible circular substructures (i.e., unique ECFP fragment identifiers). The substructure enumeration process is a function:
4
assigning to each molecule 5 the subset 6 comprising its detected substructures.
A substructure-pooling operator 7 of output dimension 8 is any set function:
9
that converts 0 into a real (practically binary or count) vector 1. In conventional ECFP, 2 is implemented by hash-based folding; in enhanced schemes, alternative pooling methods are deployed (Dablander et al., 2024).
3. Hash-Based Folding and Its Limitations
Classical ECFP pooling employs a data-agnostic hash function 3, mapping substructures to fingerprint indices. The fingerprint 4 is then defined componentwise as:
5
or equivalently:
6
Hash-based folding introduces collisions when 7 or the total number of observed substructures 8 exceeds 9, degrading interpretability and potentially predictive power. Bit collisions prevent direct mapping of fingerprint bits to unique chemical fragments, complicating downstream analyses and feature attribution (Dablander et al., 2024).
4. The Sort & Slice Algorithm for Collision-Free Pooling
Sort & Slice is a recently proposed, collision-free pooling algorithm that vectorises ECFP substructures by solely employing substructure frequencies in the training set. For each substructure 0, its support count on the training set 1 is 2. The algorithm proceeds by sorting all observed substructures by descending 3 (using the numerical identifier 4 to break ties), and selecting the 5 most prevalent (top-6) substructures, denoted 7. Each molecule 8 is then vectorised by a binary fingerprint 9, where 0 if and only if 1.
This process is formally specified as: 2 Sort & Slice thus guarantees collision-free bit assignment, as each bit position corresponds to one and only one substructure. This directly addresses the interpretability and information loss limitations of hash-based folding and requires only basic prevalence statistics, making implementation straightforward (Dablander et al., 2024).
5. Comparative Substructure Selection Methods
In addition to Sort & Slice, two supervised substructure selection schemes have been benchmarked:
- Filtered Fingerprints: Substructures observed only once in the training set are filtered out. Redundant substructures with identical training-set support as smaller patterns (non-closed) are also removed. The remaining substructures are then ranked by their 3 contingency p-value with respect to the labels; the top 4 are selected.
- Mutual-Information Maximisation (MIM): Each substructure is scored by its empirical mutual information with the (possibly binarised) labels; duplicate-support features are merged for redundancy control, and the top 5 are retained.
Both methods result in one-hot encodings of the selected fragment set, ensuring collision-free assignments but potentially injecting task-specific label information into the fingerprint (Dablander et al., 2024).
6. Empirical Validation and Performance
Large-scale benchmark experiments distinguished the effectiveness of pooling methods for ECFP-based molecular property prediction. Key findings include:
- Sort & Slice outperformed hash-folding on five molecular prediction tasks (lipophilicity, aqueous solubility, SARS-CoV-2 6 binding, Ames mutagenicity, ERα antagonism), across different data splits (random, scaffold-based), and for both random forest and MLP models.
- Relative improvement: On lipophilicity with ECFP4 (1024 bits) and MLP, Sort & Slice achieved mean absolute error (MAE) ≈ 0.60 versus hash-folding's MAE ≈ 0.68, an 711.4 % MAE reduction (Dablander et al., 2024).
- Factor dependence: Gains from Sort & Slice increase when 8 is small (higher hash collision rate), when ECFP diameter 9 is large (more unique substructures), and when standard atom invariants are used (greater substructure diversity).
- Supervised selection: Filtered Fingerprints generally outperform standard hashing but not Sort & Slice, while MIM yields inconsistent improvements and trails Sort & Slice in nearly all settings.
Complementary work underscores that even classical hash-based ECFP with random forests, SVMs, or GBDT models outperforms modern D-MPNNs and chemical LLMs on ADMET property prediction, but these results do not employ collision-free pooling (Notwell et al., 2023).
7. Recommendations and Implementation Considerations
Current consensus recommends replacing hash-based folding with Sort & Slice for ECFP vectorisation in supervised molecular machine learning. The method provides full feature interpretability, eliminates bit collisions, and requires minimal computational resources. Increasing the fingerprint length 0 is only marginally beneficial beyond a certain point, as Sort & Slice prioritises the most informative substructures even at smaller 1. Supervised selection (filtering or MIM) may not confer additional predictive value and can risk overfitting due to label dependence.
Limitations of Sort & Slice include the potential exclusion of rare but highly predictive fragments; hybrid strategies that combine frequency-based with supervised refinement may alleviate this. Possible advances involve trainable set functions for substructure-pooling, e.g., deep set architectures for end-to-end fragment embedding and pooling, as well as adaptive or task-specific selection of 2.
In summary, ECFP with Sort & Slice pooling constitutes a robust, interpretable, and empirically superior strategy for molecular feature extraction in cheminformatics and supervised molecular machine learning (Dablander et al., 2024, Notwell et al., 2023).