CHILI Dataset for Inorganic Nanomaterials

Updated 29 December 2025

CHILI is a benchmark dataset offering graph-structured representations of inorganic nanomaterials with detailed crystallographic and physical property data.
It comprises two subsets—CHILI-3K with ~3,000 mono-metallic oxide graphs and CHILI-100K with over 100,000 diverse nanomaterial structures—for robust GNN training.
The dataset supports varied tasks such as node classification, edge regression, and structure prediction, and underpins advanced architectures using KAN layers for state-of-the-art performance.

The CHILI (Chemically-Informed Large-scale Inorganic Nanomaterials Dataset) dataset is a benchmark resource for graph-based machine learning on inorganic nanomaterials, providing graph-structured representations of crystal structures, physical properties, and problem-specific task labels. Designed to address the scarcity of large, structurally and chemically diverse inorganic datasets, CHILI catalyzes research on the predictive and generative capabilities of graph neural networks (GNNs) for complex inorganic systems, surpassing the previous focus on small organic molecules (Friis-Jensen et al., 2024).

1. Dataset Structure and Versions

CHILI contains two principal subsets:

CHILI-3K: Approximately 3,000 carefully stratified inorganic nanomaterial graphs, each corresponding to a mono-metallic oxide nanoparticle. Structures are selected from 12 prototypical crystal types. Each graph contains between 7 and 14,793 nodes (atoms); the median is 1,377 nodes per graph. Edges (representing interatomic proximity) range from 7 to 118,258 per graph (median: 7,212). The dataset contains 6,959,085 nodes and 49,624,440 edges in total.
CHILI-100K: Over 100,000 nanomaterial graphs (104,408 total), ranging from 2 to 21,427 nodes per structure (median: 1,054), for a total of 183,398,463 nodes and 1,251,841,365 edges. This large collection spans all seven crystal systems (triclinic to cubic) and includes up to 7 distinct elements per material, drawn from 68 metals and 11 non-metals. CHILI-100K is derived from experimentally determined crystal structures, facilitating broad generalization (Volzhin et al., 22 Dec 2025, Friis-Jensen et al., 2024).

Both datasets split data into 80% training, 10% validation, and 10% test, maintaining stratification on crystal system (Volzhin et al., 22 Dec 2025).

2. Graph Construction and Features

Nanomaterial instances in CHILI are encoded as undirected graphs $G=(V,E)$ :

Nodes ( $V$ ): Atoms, described by feature vectors including atomic number (one-hot), atomic mass, electronegativity, and local coordination metrics. The node feature matrix is $X\in\mathbb{R}^{N\times K}$ .
Edges ( $E$ ): Unordered pairs of atoms separated by $<6~\text{Å}$ , with an adjacency matrix $A\in\{0,1\}^{N\times N}$ . Each pair can have a continuous edge feature (e.g., interatomic distance).
Metadata: Each graph includes the crystal system label (out of up to 7), space group label (out of 230), other crystallographic parameters, and nanoparticle size.
Geometric data: Absolute ( $\mathbb{R}^{N\times 3}$ ) and fractional coordinates are included, as well as simulated experimental signals (e.g., XRD, SAXS), computed via the Debye scattering equation (Friis-Jensen et al., 2024).

3. Task Definitions and Benchmarks

CHILI supports a broad range of supervised learning tasks. In both CHILI-3K and CHILI-100K, the main categories are:

Node-level tasks:
- Atomic-number classification (118 classes).
- Absolute atomic position regression (MAE).
Edge-level tasks:
- Edge-attribute regression (predicting $d_{uv}$ , MSE).
Graph-level tasks:
- Crystal system classification (weighted F $_1$ , 7 classes).
- Space group classification (weighted F $_1$ , up to 230 classes).
- Small/Wide Angle X-ray Scattering (SAXS, SANS) regression (MSE, 2 $\times$ 300 points).
- X-ray/neutron diffraction (XRD, ND) regression (MSE, 2 $\times$ 580 points).
- Pair distribution function regression (xPDF, nPDF) (MSE, 2 $\times$ 6000 points).

Structure-prediction (inverse design) tasks are also defined: recovering unit cell or atomic coordinates from experimental signals (Friis-Jensen et al., 2024, Volzhin et al., 22 Dec 2025).

Task Table (Selected)

Task Type	Output	Metric (CHILI)
Node classification	atomic_number	weighted F $_1$
Node regression	abs. position	MAE (Å)
Edge regression	edge distance	MSE
Graph classification	crystal system	weighted F $_1$
Graph regression	SAXS, XRD, xPDF	MSE

Benchmarking experiments show that high-capacity GNNs exploiting edge features, such as EdgeCNN, consistently outperform vanilla GCN/GIN models on various property prediction tasks. Several regression problems, notably absolute position and high-dimensional xPDF regression, remain challenging with performance hovering around naive baselines. Classification tasks, conversely, are well learned, especially on CHILI-3K, where set balance aids model training (Friis-Jensen et al., 2024).

4. Application in Graph Neural Networks and Kolmogorov–Arnold Layers

CHILI has served as an experimental environment for advanced GNN architectures. Notably, Kolmogorov–Arnold Graph Neural Networks (KAGNNs), which substitute traditional point-wise MLP nonlinearities with B-spline-based univariate functions (KAN layers), have established new state-of-the-art results on several CHILI tasks (Volzhin et al., 22 Dec 2025).

KAN layers exploit Kolmogorov’s superposition theorem, yielding highly expressive, smooth point-wise nonlinearity in message-passing GNNs. The “KAN-trick” replaces every nonlinearity/MLP in a GNN pipeline by a KAN layer $\Phi_\ell$ , supporting architectures such as:

KAGCN: KAN-augmented Graph Convolutional Network.
KAGIN: KAN-augmented Graph Isomorphism Network.
KAEdgeCNN: KAN-augmented EdgeCNN.

Among these, KAGCN achieves up to 0.995 F $_1$ on crystal system classification in CHILI-3K, and KAEdgeCNN attains 0.966 F $_1$ on space group classification, both substantially outperforming their vanilla counterparts (e.g., vanilla GCN ≤ 0.367 F $_1$ ) (Volzhin et al., 22 Dec 2025). These models are robust to variations in hyperparameters (layers 1–3, hidden dims 16–64, grid size 3–5, spline order 3–5, learning rate $10^{-4}$ – $10^{-2}$ ).

A major insight from KAGNN results is that the expressivity benefit of KAN layers largely neutralizes architectural differences between GCN, GIN, and EdgeCNN, unifying their performance once KAN is applied. This suggests the critical bottleneck in prior GNN architectures was the point-wise MLP (Volzhin et al., 22 Dec 2025).

5. Dataset Strengths, Limitations, and Benchmark Observations

The CHILI datasets are distinguished by scale (6 million to 183 million nodes), elemental and structural diversity, and the inclusion of both property-prediction and structure-generation tasks. Key strengths:

Classification tasks such as crystal system and space group, especially on the balanced CHILI-3K subset, are straightforward for expressive GNNs.
Edge-feature learning is critical; EdgeCNN models excel on property tasks.
Regression targets (notably SAXS, XRD, xPDF) are noisy and difficult; even advanced GNNs approach mean-prediction baselines. High-dimensional regression and global structure prediction (sub-Å positional accuracy) are unresolved challenges.

Limitations include under-sampling of rare space groups in CHILI-100K and the lack of sub-Å precision for inverse scattering problems. The dataset highlights a need for both more robust 3D coordinate learning in GNNs and generative/inverse-design methods accommodating large, periodic graphs (Friis-Jensen et al., 2024, Volzhin et al., 22 Dec 2025).

6. Future Directions and Recommendations

The CHILI datasets set a new baseline for inorganic nanomaterial GNN benchmarks. Recommendations for practitioners include:

Model development: Replace all MLPs and nonlinearities in message-passing pipelines with KAN layers, tune the spline and model capacity, and expect large gains in classification.
Interpretable features: KAN spline weights yield interpretable univariate functions; visualization can reveal chemically meaningful relations between features.
Scalability: Be aware of memory demands; grid size and spline order directly increase parameter count.

Open challenges include generative modeling for variable-size, periodic graphs, integration of scattering data into inverse-design frameworks, and improvement of regression accuracy for structural and high-dimensional physical targets. A plausible implication is that progress in graph ML on CHILI will accelerate materials discovery workflows by enabling more accurate structure-property predictions in the inorganic domain (Friis-Jensen et al., 2024, Volzhin et al., 22 Dec 2025).