MOFid: Dual-Encoding MOF Descriptor

Updated 26 October 2025

MOFid is a chemically informed string that merges SMILES encoding for chemical building blocks with RCSR topology codes to represent MOF structures.
It enables scalable, language-model-driven generative design by integrating property prediction and reinforcement learning pipelines.
MOFid bypasses computationally intensive 3D geometry methods, offering a compact and universal representation for inverse MOF discovery.

MOFid refers to a chemically informed, text-based representation for metal–organic frameworks (MOFs) that integrates both chemical building block information and global topological information into a single sequential string. MOFid was specifically developed to enable scalable, language-model-driven generative design of MOFs, and is central to high-throughput, deep learning–based exploration of reticular chemistry (Badrinarayanan et al., 30 May 2025). Owing to its dual encoding of local chemistry (via SMILES notations for organic and inorganic SBUs) and global framework topology (via standardized codes from resources like the Reticular Chemistry Structure Resource, RCSR), MOFid serves as the “molecular language” for advanced generative modeling pipelines, including reinforcement learning–driven frameworks for inverse materials design.

1. Structure and Semantic Content of MOFid

MOFid is formulated as a string comprising two primary sections:

Chemical Building Block Encoding: Organic and inorganic SBUs are encoded by SMILES strings where possible. This component preserves atomic identities, local bonding, and immediate coordination environments within the framework.
Topological Encoding: The net-level connectivity (the arrangement of SBUs into a crystalline lattice) is encoded using RCSR topology codes (such as “pcu,” “nbo,” etc.), which specify the extended 3D network. Additional catenation or interpenetration states may be appended.

Formally, the MOFid is written as: $\text{MOFid} = [\text{organic\_components}].[\text{inorganic\_components}] \;\text{\tt <SEP>}\;[\text{topology}].[\text{catenation}]$ The syntax enforces a unique, parsable descriptor that is directly usable by sequence-based generative models.

2. Role in Generative and Predictive Frameworks

MOFid is integral to advanced generative design pipelines, notably the RL-augmented, transformer-based framework described in (Badrinarayanan et al., 30 May 2025). The design loop consists of:

Generative Model (GPT):
- Trained on a large corpus of MOFid sequences, the generative pretraining objective:
$\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})$

enables the model to learn both chemical and topological “grammar” for valid MOF construction.
Property Predictor (MOFormer):
- MOFormer, a transformer-based regression model, takes MOFid as input and predicts MOF properties (e.g., gas adsorption, band gaps) in a supervised manner:
$\mathcal{L}_{\text{finetune}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$

Because MOFid encodes both composition and topology, MOFormer can capture structure–property relationships rapidly and at scale without the need for expensive geometry optimization.

Reinforcement Learning (RL) Optimization:
- Generated MOFids are evaluated by MOFormer and scored by a multi-objective reward function:
$R_{\text{total}}(m) = \beta_{\text{target}} R_{\text{target}}(m) + \alpha_n R_{\text{novelty}}(m) + \alpha_v R_{\text{validity}}(m) + \alpha_d R_{\text{diversity}}(m)$

Sequence generation policy is updated via policy gradient methods (e.g., REINFORCE), formally:

$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T_i} \nabla_\theta \log \pi_\theta(a_t^i \mid s_t^i) \; (R_i - \mu) \; \gamma^{t-1}$
This loop biases the generative model toward synthesizable, valid, and property-optimized MOFs.

3. Comparison with Traditional Representations

Traditional methods for MOF representation in computational chemistry rely either on explicit atomic coordinates or graph-based encodings. Such approaches require:

Conversion of crystallographic CIFs to 3D atomic models.
Extraction of large, often non-unique atom graphs or connectivity matrices.
Computationally intensive preprocessing (e.g., DFT optimizations).

In contrast, the MOFid representation:

Compresses all necessary chemical and topological information into a fixed-length string format.
Bypasses the need for coordinate-based geometry, significantly accelerating virtual screening and property prediction.
Is inherently suitable for LLM architectures, leveraging advances in NLP for chemical structure generation.

4. Advantages for Data-Driven and LLM–Driven MOF Discovery

MOFid unlocks several key capabilities:

Scalability: By reducing MOF representation to a string, LLMs (e.g., GPT) can be employed to efficiently explore the immense combinatorial space of possible frameworks.
Expressiveness: Preservation of both SBU chemistry and global connectivity ensures that generated candidates correspond to physically reasonable, potentially synthesizable MOFs.
Integrability: MOFid serves as a universal interface for generative models, property predictors, and RL agents, allowing for closed-loop optimization and inverse design workflows.
Data Efficiency: As demonstrated by comparisons to structure-based models, language-model-driven pipelines using MOFid can outperform or match graph and 3D geometry–based predictors, especially in low-data regimes.

5. Technical Details in Model Training and Reward Engineering

In generative model pretraining, the next-character (or word) prediction objective is optimized over sequences of MOFid tokens, adhering to chemical grammar and allowable topology codes. During RL-guided exploration:

Reward functions are composite, typically balancing target property optimization with chemical validity, novelty (e.g., measured by Tanimoto distance between MOFids), and diversity.
Validity checks ensure that the generated MOFid encodes chemically correct building blocks and topologically consistent frameworks.
Diversity reward ensures broad exploration of chemical space and mitigates mode collapse.

6. Broader Implications and Future Directions

The MOFid framework, due to its dual encoding and amenability to language modeling, advances the state-of-the-art in computational reticular chemistry by:

Enabling rapid, targeted generation of functional MOFs.
Supporting integration of property prediction and uncertainty estimation directly within the sequence modeling pipeline.
Providing a basis for further extensions, e.g., conditional MOF generation, transfer learning across reticular materials, and hybrid generative–simulation workflows.

A plausible implication is that similar “chemically and topologically informed string encodings” may be generalized to other classes of crystalline or porous materials, extending the successes of MOFid-driven generative modeling to a broad array of material discovery problems.

Summary Table: Key Elements of the MOFid Pipeline

Component	Representation	Function
MOFid	SMILES + RCSR code string	Compact, chem./topo-aware MOF descriptor
GPT Generator	Autoregressive transformer	MOFid sequence generation (pretrained/RL)
MOFormer	Transformer regressor	Property prediction from MOFid string
RL Module	Reward-based optimization	Targets property, validity, diversity

In summary, MOFid is a central enabler for generative, property-optimized design of MOFs, integrating chemical and topological information in a machine-learning-compatible string format that underpins transformer-based, reinforcement learning–augmented materials discovery pipelines (Badrinarayanan et al., 30 May 2025).

PDF Markdown Chat (Pro)

References (1)

MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to MOFid.