GIMLET: PDE Discovery & Molecule Learning

Updated 4 July 2026

GIMLET is a term for two separate research frameworks: one for continuum modeling via thermodynamic closures and one for instruction-based molecular prediction.
The thermodynamic framework leverages neural parameterization, automatic differentiation, and variational principles to accurately infer unknown closure terms in fluid dynamics PDEs.
The molecular framework employs a unified graph-text transformer with decoupled attention to achieve zero-shot prediction in molecule property tasks.

GIMLET is an acronym that denotes two unrelated research frameworks in recent arXiv literature. In continuum modeling, it refers to Generalizable and Interpretable Model Learning through Embedded Thermodynamics, a data-driven framework for discovering constitutive relations in models of fluid flow and scalar transport by embedding a nonequilibrium thermodynamic variational structure into neural parameterizations of free-energy and dissipation functionals (Shiratori et al., 22 Dec 2025). In molecular machine learning, it refers to Graph Instruction based MolecuLe zEro-shoT learning, a unified graph-text transformer for instruction-based molecule zero-shot learning that encodes molecular graphs and natural-language task descriptions within a single T5-style architecture (Zhao et al., 2023). The shared acronym can obscure the fact that the two systems address different scientific objects, use different inductive biases, and target different notions of generalization.

1. Disambiguation and nomenclature

The thermodynamics-oriented GIMLET is explicitly formulated for gray-box PDE discovery in continuum systems where the PDE structure is known but some constitutive terms are unknown. Its stated target is “continuum PDEs for fluid flow and scalar transport where some terms are known and others (constitutive/closure relations) are missing,” and its demonstrations include the viscous Burgers equation, the Kuramoto–Sivashinsky equation, and the incompressible Navier–Stokes equations for both Newtonian and non-Newtonian fluids (Shiratori et al., 22 Dec 2025).

The molecular-learning GIMLET is instead a unified graph-text model for instruction-conditioned molecular prediction. Its motivating problem is label insufficiency in molecule property prediction, and its central question is whether natural-language instructions can specify molecule-related tasks in a zero-shot setting. The model is described as using “a single T5-style LLM to encode both” molecular graphs and natural-language instructions, with generalized position embedding and decoupled attention to support transfer to novel tasks (Zhao et al., 2023).

A common misconception would be to treat GIMLET as a single technical framework spanning both scientific machine learning and molecular representation learning. The literature does not support that reading. The two uses share only the acronym; their governing objects, architectures, and training objectives are distinct.

2. Thermodynamic GIMLET: gray-box PDE discovery through embedded variational structure

In the continuum-modeling usage, GIMLET assumes that temporal derivative, convective transport, and pressure-gradient contributions are known, while closure terms are unknown and must be inferred from data. The framework is rooted in a variational principle from nonequilibrium thermodynamics in which dynamics are determined by a free-energy functional and a dissipation functional, and the unknown constitutive terms arise as functional derivatives of those functionals with respect to the state variables (Shiratori et al., 22 Dec 2025).

For a two-component viscous fluid in Eulerian description, the stated state variables are density $\rho(\mathbf{x},t)$ , velocity $\mathbf{u}(\mathbf{x},t)$ , entropy $s(\mathbf{x},t)$ , and mass fraction $\phi(\mathbf{x},t)$ . The free energy of the mixture is written as

$G[\phi] \equiv \int_V g(\phi)\,dV,$

with chemical potential

$\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$

and the dissipation potential is

$\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$

Using a variational principle following Fukagawa and Fujitani, the momentum and scalar transport equations are expressed so that the unknown reversible and irreversible closures appear as $(\nabla \phi)\,\delta G/\delta \phi$ and $-\delta\Theta/\delta\mathbf{u}$ , while the diffusive flux is written as $\mathbf{j}=-M\nabla(\delta G/\delta\phi)$ in the isothermal case with constant $\mathbf{u}(\mathbf{x},t)$ 0 (Shiratori et al., 22 Dec 2025).

In the paper’s abstract notation, the target form is

$\mathbf{u}(\mathbf{x},t)$ 1

with $\mathbf{u}(\mathbf{x},t)$ 2 and $\mathbf{u}(\mathbf{x},t)$ 3 appropriately defined. This formulation makes the distinction between known operator and unknown closure term explicit. The significance is methodological: unlike sparse regression or symbolic identification approaches, GIMLET does not begin with a predefined library of candidate functions, but with a thermodynamic parameterization of admissible constitutive structure.

3. Neural parameterization, automatic differentiation, and thermodynamic consistency

The free-energy and dissipation functionals are parameterized using neural networks, and their functional derivatives are obtained via automatic differentiation. For a scalar field $\mathbf{u}(\mathbf{x},t)$ 4, the free-energy functional is written as

$\mathbf{u}(\mathbf{x},t)$ 5

and for a vector field $\mathbf{u}(\mathbf{x},t)$ 6, the dissipation functional is

$\mathbf{u}(\mathbf{x},t)$ 7

The paper states that $\mathbf{u}(\mathbf{x},t)$ 8 is implemented via FreeEnergyNet and $\mathbf{u}(\mathbf{x},t)$ 9 via DissipationNet, with practical truncation often taken up to first or second derivatives, such as $s(\mathbf{x},t)$ 0 and $s(\mathbf{x},t)$ 1 (Shiratori et al., 22 Dec 2025).

FreeEnergyNet uses an Integrable Neural Network (INN) whose output is analytically differentiable with respect to its inputs. For the Kuramoto–Sivashinsky example, the free-energy density is decomposed as $s(\mathbf{x},t)$ 2, with each component represented by an INN and the functional derivative assembled as

$s(\mathbf{x},t)$ 3

DissipationNet uses a Convex Integrable Neural Network (CINN), with nonnegative weights $s(\mathbf{x},t)$ 4 and convex nondecreasing activation $s(\mathbf{x},t)$ 5 such as softplus, to guarantee convexity of the dissipation density in its arguments (Shiratori et al., 22 Dec 2025).

The functional derivatives are written as

$s(\mathbf{x},t)$ 6

and

$s(\mathbf{x},t)$ 7

These are computed by automatic differentiation through a joint graph in which a PINN maps coordinates $s(\mathbf{x},t)$ 8 to field variables and FreeEnergyNet and DissipationNet consume field and gradient values to produce $s(\mathbf{x},t)$ 9 or $\phi(\mathbf{x},t)$ 0 (Shiratori et al., 22 Dec 2025).

Thermodynamic consistency is enforced structurally and through the loss design. Structurally, the INN yields a well-defined scalar potential and the CINN makes $\phi(\mathbf{x},t)$ 1 convex in its arguments. In training, the total loss is

$\phi(\mathbf{x},t)$ 2

where $\phi(\mathbf{x},t)$ 3 is the PDE residual loss, $\phi(\mathbf{x},t)$ 4 is the data loss, and $\phi(\mathbf{x},t)$ 5 contains regularization terms including $\phi(\mathbf{x},t)$ 6, $\phi(\mathbf{x},t)$ 7, $\phi(\mathbf{x},t)$ 8, and $\phi(\mathbf{x},t)$ 9. The paper states that this construction enforces thermodynamic consistency by design, ensuring monotonic decay of the total free energy and non-negative entropy production (Shiratori et al., 22 Dec 2025).

4. Optimization protocol, benchmarks, and interpretability in the thermodynamic formulation

The training pipeline combines measurement data and residual collocation points. The measurement dataset is

$G[\phi] \equiv \int_V g(\phi)\,dV,$ 0

typically from high-fidelity CFD data or experimental data, and “only state variables are required (no fluxes or stresses).” Residual points

$G[\phi] \equiv \int_V g(\phi)\,dV,$ 1

are randomly sampled in space-time and used to evaluate PDE residuals via automatic differentiation. Optimization is two-stage: Adam for 1000 epochs for all cases, followed by Self-Scaled Broyden (SSBroyden) for 10,000–200,000 epochs. Residual points are resampled every 1000 epochs using Residual-based Adaptive Distribution (RAD) (Shiratori et al., 22 Dec 2025).

The reported demonstrations cover four benchmark problems, each with a discovery dataset A and a generalization dataset B.

System	Unknown quantity	Representative result
1D viscous Burgers	$G[\phi] \equiv \int_V g(\phi)\,dV,$ 2	Correlation coefficient between learned $G[\phi] \equiv \int_V g(\phi)\,dV,$ 3 and $G[\phi] \equiv \int_V g(\phi)\,dV,$ 4 on dataset A: $G[\phi] \equiv \int_V g(\phi)\,dV,$ 5
1D Kuramoto–Sivashinsky	$G[\phi] \equiv \int_V g(\phi)\,dV,$ 6 and $G[\phi] \equiv \int_V g(\phi)\,dV,$ 7	$G[\phi] \equiv \int_V g(\phi)\,dV,$ 8; $G[\phi] \equiv \int_V g(\phi)\,dV,$ 9
2D incompressible Navier–Stokes, Newtonian	$\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 0	Correlation between true and learned $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 1 on dataset A: $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 2
2D incompressible Navier–Stokes, non-Newtonian	nonlinear rheological $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 3	Reynolds scaling factor converges to $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 4 with relative error $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 5

For 1D viscous Burgers, the PDE is

$\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 6

and the learned dissipation “matches $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 7 almost exactly, well beyond the observed range of $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 8.” For Kuramoto–Sivashinsky, the learned $\mu = \frac{1}{\rho}\frac{\delta G}{\delta \phi},$ 9 and $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 0 recover the quadratic forms associated with the unknown $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 1 and $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 2, and transfer from an early time window to a later, more complex one. For Newtonian Navier–Stokes, the dissipation density

$\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 3

is recovered from a lid-driven cavity at $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 4 and transferred to flow past a cylinder at $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 5, with learned $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 6 and relative error $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 7. For the Bird–Carreau–Yasuda non-Newtonian case, the learned $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 8 matches the shape of the nonlinear rheology over the gradient range seen in dataset A (Shiratori et al., 22 Dec 2025).

The framework’s interpretability claim follows directly from the fact that $\Theta[\mathbf{u},\phi] \equiv \int_V \theta(\mathbf{u},\phi)\,dV.$ 9 and $(\nabla \phi)\,\delta G/\delta \phi$ 0 are scalar physical potentials. The paper identifies $(\nabla \phi)\,\delta G/\delta \phi$ 1 as a generalized free energy and $(\nabla \phi)\,\delta G/\delta \phi$ 2 as a generalized Rayleigh dissipation functional, and emphasizes that one can inspect scalar plots such as $(\nabla \phi)\,\delta G/\delta \phi$ 3, $(\nabla \phi)\,\delta G/\delta \phi$ 4, or projections of $(\nabla \phi)\,\delta G/\delta \phi$ 5. A plausible implication is that interpretability here is not a post hoc attribution device but a property of the model class itself.

5. Molecular GIMLET: unified graph-text transformer for instruction-based zero-shot learning

In molecular machine learning, GIMLET is defined as a model that “unifies LLMs for both graph and text data” for instruction-based molecule zero-shot learning. The input consists of a molecule graph $(\nabla \phi)\,\delta G/\delta \phi$ 6 and a natural-language instruction $(\nabla \phi)\,\delta G/\delta \phi$ 7 describing a task $(\nabla \phi)\,\delta G/\delta \phi$ 8, and the output is a label string generated by a T5-style decoder (Zhao et al., 2023).

The architecture uses a single transformer encoder for both modalities. For hidden states

$(\nabla \phi)\,\delta G/\delta \phi$ 9

with the first $-\delta\Theta/\delta\mathbf{u}$ 0 tokens corresponding to graph nodes and the next $-\delta\Theta/\delta\mathbf{u}$ 1 tokens to instruction text tokens, attention incorporates a generalized position bias

$-\delta\Theta/\delta\mathbf{u}$ 2

Here $-\delta\Theta/\delta\mathbf{u}$ 3 is defined by three cases: relative index difference for text-text pairs, graph shortest-path distance for graph-graph pairs, and a special $-\delta\Theta/\delta\mathbf{u}$ 4 code for graph-text or text-graph pairs (Zhao et al., 2023).

The model also introduces a decoupled attention mask via

$-\delta\Theta/\delta\mathbf{u}$ 5

so graph tokens can attend only to graph tokens, while instruction tokens can attend to both graph and text tokens. The stated purpose is to preserve task-independent graph representations while allowing instructions to read graph information and build task-conditioned representations (Zhao et al., 2023).

Input construction proceeds by mapping each node to a node embedding and each instruction subword to a text embedding, concatenating them into

$-\delta\Theta/\delta\mathbf{u}$ 6

then applying transformer layers with generalized graph-text positional biases. The encoder output is consumed by a T5-style decoder that autoregressively generates answers such as “Yes” or “No” for classification or numeric strings for regression (Zhao et al., 2023).

The model’s conceptual claim is that no separate GNN encoder is required. Instead, graph structure is injected through shortest-path distances and edge-type biases inside the transformer’s positional machinery. This suggests that the unification is architectural rather than merely multimodal late fusion.

6. Dataset construction, evaluation protocol, and empirical profile of the molecular formulation

The pretraining corpus combines ChEMBL bioassay activity with ChEMBL property data. The paper reports 1,048 classification tasks and approximately 365k molecules for bioactivity, together with 13 regression tasks on the same approximately 365k molecules for physico-chemical properties. It selects 80% of tasks and 80% of molecules for pretraining, with the remaining tasks and molecules forming ChEMBL Zero-Shot (Zhao et al., 2023).

Instructions are constructed from human-readable task descriptions. The stated pipeline is: collect raw textual descriptions, apply a mixed summarization strategy using template-based summarization and GPT-3.5-turbo-based summarization, then construct a final instruction by concatenating a concise explanation with a direct question specifying the desired output. These instructions are “reviewed and lightly edited by a PhD-level biologist.” Labels are rendered as strings: “Yes” or “No” for classification, and decimal numbers, typically rounded to 2 digits, for regression (Zhao et al., 2023).

The supervised pretraining objective is length-normalized text generation:

$-\delta\Theta/\delta\mathbf{u}$ 7

The backbone is a T5-style encoder-decoder with 64M parameters, using only “basic features” as in Hu et al. (2019), concretely the first two dimensions of node and edge features from ogb.smiles2graph. Downstream train-validation-test splits use scaffold split with ratio 0.8 / 0.1 / 0.1 (Zhao et al., 2023).

Its zero-shot results are reported against KVPLM, MoMu, and Galactica variants, as well as supervised graph baselines.

Benchmark group	GIMLET result	Comparison stated in the paper
BACE / HIV / MUV average	0.667 ROC-AUC	Outperforms all zero-shot molecule-text baselines
Tox21 / ToxCast average	0.601 ROC-AUC	Improves Avg Tox by 5–10 AUC points over the best zero-shot baseline
BBBP / CYP450 average	0.653 ROC-AUC	On CYP450, 0.713 is described as close to supervised GIN at 0.821
ChEMBL Zero-Shot	0.786 ROC-AUC	Far ahead of all baselines
PCBA	0.621 ROC-AUC	Substantially outperforms zero-shot baselines
ESOL / Lipo / FreeSolv average	2.527 RMSE	Only model reported to handle regression in a zero-shot instruction-based manner

The paper reports that GIMLET “successfully outputs correctly formatted numbers for >98% of test samples.” It also states that the model significantly outperforms molecule-text baselines in instruction-based zero-shot learning and achieves close results to supervised GNN models on tasks such as ToxCast and MUV (Zhao et al., 2023).

Ablation studies isolate three factors. First, removing the unified transformer in favor of a separate GIN encoder lowers average ROC-AUC. Second, removing the decoupled attention mask harms performance, especially on Bio and Pha tasks. Third, replacing full instructions with task names and simple questions reduces Avg Bio from 0.667 to 0.600, Avg Tox from 0.601 to 0.554, and Avg Pha from 0.653 to 0.577. The paper additionally reports robustness to instruction rephrasing generated by GPT-3.5 and improved performance under few-shot linear-only tuning of the last mapping to vocabulary (Zhao et al., 2023).

Taken together, the two GIMLET frameworks exemplify different strategies for encoding scientific structure into learnable models. One embeds thermodynamic admissibility into closure discovery for PDEs; the other embeds graph geometry and instruction semantics into a unified transformer for molecular zero-shot prediction. The shared acronym therefore indexes a naming coincidence rather than a common methodological lineage.

Markdown Report Issue Upgrade to Chat

References (2)

GIMLET: Generalizable and Interpretable Model Learning through Embedded Thermodynamics (2025)

GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GIMLET.