Papers
Topics
Authors
Recent
Search
2000 character limit reached

Extended-Connectivity Fingerprint (ECFP)

Updated 31 March 2026
  • ECFP is a circular substructure fingerprinting method that iteratively hashes atomic neighborhoods to encode chemical motifs for molecular analysis.
  • It uses a fixed-radius approach (typically ECFP4) to generate variable-length substructure sets folded into fixed-length vectors for machine learning applications.
  • Advancements like the Sort & Slice algorithm yield collision-free, interpretable fingerprints that improve prediction performance and mitigate hash collisions.

Extended-Connectivity Fingerprint (ECFP) is a circular substructure-based molecular fingerprinting method that has become a foundational feature extraction technique in computational chemistry and molecular machine learning. ECFP encodes local molecular environments by iteratively hashing atomic neighborhoods up to a fixed radius, yielding variable-length sets of substructure identifiers that are typically folded into fixed-length vectors for use in statistical and machine learning models. The method is widely used for structure-activity prediction, virtual screening, and quantitative structure–activity relationship (QSAR) studies due to its capacity to represent both detailed local chemical motifs and broad scaffold diversity (Dablander et al., 2024, Notwell et al., 2023). Recent methodological advances, such as the introduction of the Sort & Slice pooling algorithm, have addressed major limitations in classical ECFP vectorisation, notably the issue of hash-based bit collisions.

1. ECFP Construction and Algorithmic Details

The ECFP algorithm represents a molecule as a graph G=(V,E)G = (V,E), with atoms iVi \in V and bonds (i,j)E(i, j) \in E. Each atom is initially assigned a unique integer identifier hi(0)h_i^{(0)}, encoding atomic invariants such as atomic number, isotope, heavy-atom neighbor count, formal charge, and ring status. Iteratively for t=1,,rt = 1, \ldots, r (where rr is the radius parameter), each atom gathers a multiset of tuples containing the bond type and neighboring hash values. The atomic environment at each radius is defined as

Ui(t)=[hi(t1)sort(Si(t1))]U_i^{(t)} = [ h_i^{(t-1)} \,\|\, \text{sort}(S_i^{(t-1)}) ]

where $\|\$ denotes concatenation. A hash function compresses each atomic environment into a new identifier hi(t)=HASH(Ui(t))h_i^{(t)} = \text{HASH}(U_i^{(t)}). The set of all such hashed identifiers across all atoms and radii forms the set of detected substructures for the molecule:

I={hi(t):iV,t=0r}I = \{ h_i^{(t)} \,:\, i\in V,\, t=0\ldots r \}

These identifiers are then folded into a fixed-length bit vector of length iVi \in V0 by mapping each identifier to index iVi \in V1 and setting the corresponding bit (or incrementing a count for the count variant).

The standard parameterisation ECFP4 uses a radius iVi \in V2, yielding a molecular diameter of 4 bonds, and a typical bit vector length of 1,024 or 2,048 (Notwell et al., 2023).

2. Mathematical Framework for Substructure Pooling

Let iVi \in V3 denote the universe of possible circular substructures (i.e., unique ECFP fragment identifiers). The substructure enumeration process is a function:

iVi \in V4

assigning to each molecule iVi \in V5 the subset iVi \in V6 comprising its detected substructures.

A substructure-pooling operator iVi \in V7 of output dimension iVi \in V8 is any set function:

iVi \in V9

that converts (i,j)E(i, j) \in E0 into a real (practically binary or count) vector (i,j)E(i, j) \in E1. In conventional ECFP, (i,j)E(i, j) \in E2 is implemented by hash-based folding; in enhanced schemes, alternative pooling methods are deployed (Dablander et al., 2024).

3. Hash-Based Folding and Its Limitations

Classical ECFP pooling employs a data-agnostic hash function (i,j)E(i, j) \in E3, mapping substructures to fingerprint indices. The fingerprint (i,j)E(i, j) \in E4 is then defined componentwise as:

(i,j)E(i, j) \in E5

or equivalently:

(i,j)E(i, j) \in E6

Hash-based folding introduces collisions when (i,j)E(i, j) \in E7 or the total number of observed substructures (i,j)E(i, j) \in E8 exceeds (i,j)E(i, j) \in E9, degrading interpretability and potentially predictive power. Bit collisions prevent direct mapping of fingerprint bits to unique chemical fragments, complicating downstream analyses and feature attribution (Dablander et al., 2024).

4. The Sort & Slice Algorithm for Collision-Free Pooling

Sort & Slice is a recently proposed, collision-free pooling algorithm that vectorises ECFP substructures by solely employing substructure frequencies in the training set. For each substructure hi(0)h_i^{(0)}0, its support count on the training set hi(0)h_i^{(0)}1 is hi(0)h_i^{(0)}2. The algorithm proceeds by sorting all observed substructures by descending hi(0)h_i^{(0)}3 (using the numerical identifier hi(0)h_i^{(0)}4 to break ties), and selecting the hi(0)h_i^{(0)}5 most prevalent (top-hi(0)h_i^{(0)}6) substructures, denoted hi(0)h_i^{(0)}7. Each molecule hi(0)h_i^{(0)}8 is then vectorised by a binary fingerprint hi(0)h_i^{(0)}9, where t=1,,rt = 1, \ldots, r0 if and only if t=1,,rt = 1, \ldots, r1.

This process is formally specified as: t=1,,rt = 1, \ldots, r2 Sort & Slice thus guarantees collision-free bit assignment, as each bit position corresponds to one and only one substructure. This directly addresses the interpretability and information loss limitations of hash-based folding and requires only basic prevalence statistics, making implementation straightforward (Dablander et al., 2024).

5. Comparative Substructure Selection Methods

In addition to Sort & Slice, two supervised substructure selection schemes have been benchmarked:

  • Filtered Fingerprints: Substructures observed only once in the training set are filtered out. Redundant substructures with identical training-set support as smaller patterns (non-closed) are also removed. The remaining substructures are then ranked by their t=1,,rt = 1, \ldots, r3 contingency p-value with respect to the labels; the top t=1,,rt = 1, \ldots, r4 are selected.
  • Mutual-Information Maximisation (MIM): Each substructure is scored by its empirical mutual information with the (possibly binarised) labels; duplicate-support features are merged for redundancy control, and the top t=1,,rt = 1, \ldots, r5 are retained.

Both methods result in one-hot encodings of the selected fragment set, ensuring collision-free assignments but potentially injecting task-specific label information into the fingerprint (Dablander et al., 2024).

6. Empirical Validation and Performance

Large-scale benchmark experiments distinguished the effectiveness of pooling methods for ECFP-based molecular property prediction. Key findings include:

  • Sort & Slice outperformed hash-folding on five molecular prediction tasks (lipophilicity, aqueous solubility, SARS-CoV-2 t=1,,rt = 1, \ldots, r6 binding, Ames mutagenicity, ERα antagonism), across different data splits (random, scaffold-based), and for both random forest and MLP models.
  • Relative improvement: On lipophilicity with ECFP4 (1024 bits) and MLP, Sort & Slice achieved mean absolute error (MAE) ≈ 0.60 versus hash-folding's MAE ≈ 0.68, an t=1,,rt = 1, \ldots, r711.4 % MAE reduction (Dablander et al., 2024).
  • Factor dependence: Gains from Sort & Slice increase when t=1,,rt = 1, \ldots, r8 is small (higher hash collision rate), when ECFP diameter t=1,,rt = 1, \ldots, r9 is large (more unique substructures), and when standard atom invariants are used (greater substructure diversity).
  • Supervised selection: Filtered Fingerprints generally outperform standard hashing but not Sort & Slice, while MIM yields inconsistent improvements and trails Sort & Slice in nearly all settings.

Complementary work underscores that even classical hash-based ECFP with random forests, SVMs, or GBDT models outperforms modern D-MPNNs and chemical LLMs on ADMET property prediction, but these results do not employ collision-free pooling (Notwell et al., 2023).

7. Recommendations and Implementation Considerations

Current consensus recommends replacing hash-based folding with Sort & Slice for ECFP vectorisation in supervised molecular machine learning. The method provides full feature interpretability, eliminates bit collisions, and requires minimal computational resources. Increasing the fingerprint length rr0 is only marginally beneficial beyond a certain point, as Sort & Slice prioritises the most informative substructures even at smaller rr1. Supervised selection (filtering or MIM) may not confer additional predictive value and can risk overfitting due to label dependence.

Limitations of Sort & Slice include the potential exclusion of rare but highly predictive fragments; hybrid strategies that combine frequency-based with supervised refinement may alleviate this. Possible advances involve trainable set functions for substructure-pooling, e.g., deep set architectures for end-to-end fragment embedding and pooling, as well as adaptive or task-specific selection of rr2.

In summary, ECFP with Sort & Slice pooling constitutes a robust, interpretable, and empirically superior strategy for molecular feature extraction in cheminformatics and supervised molecular machine learning (Dablander et al., 2024, Notwell et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Extended-Connectivity Fingerprint (ECFP).