DataForge Pipeline: Automated GB Analysis

Updated 27 August 2025

DataForge Pipeline is a modular system that converts high-dimensional atomic configurations into an 8-dimensional feature space for efficient grain boundary classification.
It employs advanced density-based and parallel K-means clustering to accurately identify structural motifs without requiring pre-defined cluster counts.
The pipeline integrates automation and batch processing to rapidly analyze thousands of GB structures, accelerating materials discovery and design.

The DataForge Pipeline is a modular, automated analysis pipeline developed for the classification and recognition of grain boundary (GB) structures generated by evolutionary algorithms, particularly those employing the USPEX method. Designed to operate on thousands of high-dimensional atomic configurations, it systematically transforms raw atomic coordinate data into a low-dimensional engineered feature space, applies robust clustering algorithms, and automates the overall workflow to achieve accurate, efficient structural classification in materials science.

1. Architecture and Workflow

The DataForge Pipeline comprises four principal components:

Feature Engineering Module: Transforms each input GB atomic configuration (originally located in a 3N-dimensional coordinate space) into an 8-dimensional feature vector based on physically and structurally meaningful excess properties.
Density-Based Clustering Module: Utilizes local density and distance-to-higher-density metrics to identify cluster centers within the engineered feature space, enabling unsupervised classification that does not require pre-specified cluster counts.
Parallel K-Means Clustering Module: Employs parallelized K-means clustering (specifically accelerated via OpenACC or CUDA) to partition the feature space among clusters by iteratively updating cluster assignments and centroids.
Automation Infrastructure: Orchestrates batch processing and job scheduling (e.g., with Linux Crontab) to automate feature extraction and clustering analyses over large ensembles of structures, minimizing manual intervention and maximizing throughput.

This architecture enables the efficient processing of thousands of GB structures, facilitating automated structure analysis and improved physical understanding of GB phases.

2. Feature Engineering for Grain Boundaries

Feature engineering in the DataForge Pipeline is focused on mapping each complex atomic structure into an interpretable, lower-dimensional feature space. The process involves:

Calculation of Excess Properties:
- Free Energy [γ]ₙ: Extracted from the GB free energy relation, capturing energetic differences with respect to bulk.
$\gamma A = E - TS - \sigma_{33}V = [E]_N - T[S]_N - \sigma_{33}[V]_N$ - Excess Atomic Volume [V]ₙ and Stress Components (τ₁₁, τ₂₂): Quantified as deviations from bulk properties using comparative expressions:

$[V]_N = \frac{1}{A} [V - V^{bulk}(N/N^{bulk})]$

$\tau_{11,22} = \frac{1}{A} [\sigma_{11,22}V - \sigma_{11,22}^{bulk}V^{bulk}(N/N^{bulk})]$ - Excess Steinhardt Order Parameters ([Q₄]ₙ, [Q₆]ₙ, [Q₈]ₙ, [Q₁₂]ₙ): Capture local atomic ordering and structural motif differences with respect to bulk, using Voronoi constructions for atomic volumes.

$[Q]_N = \frac{1}{A} (Q - Q^{bulk}(N/N^{bulk}))$

Each GB structure is ultimately encoded as:

$f = ([\gamma]_N, [V]_N, \tau_{11}, \tau_{22}, [Q_4]_N, [Q_6]_N, [Q_8]_N, [Q_{12}]_N)$

This compact feature representation preserves essential physical, mechanical, and geometric information, enabling subsequent clustering algorithms to discriminate GB phases and structural families.

3. Clustering Algorithms

3.1 Density-Based Clustering (Rodriguez-Laio Approach)

Each feature vector $f^i$ is analyzed for its local density $\rho_i$ and its distance $\delta_i$ to higher-density points:

Local Density:

$\rho_i = \sum_j \chi(d_{ij} - d_c)$

where $d_{ij} = \|f^i - f^j\|_2$ and $d_c$ is a cutoff chosen based on typical nearest-neighbor separation.

Distance to Higher Density Points:

$\delta_i = \min_{j: \rho_j > \rho_i}(d_{ij})$

The $(\rho, \delta)$ decision graph reveals clusters as centers with anomalously high $\rho$ and $\delta$ , yielding natural classification of GB structures into groups (e.g., Kite, Split Kite, Extended Kite) without an a priori cluster count. This method is robust for clusters with arbitrary geometries and readily exposes outliers.

3.2 Parallel K-Means Clustering

Standard K-means is implemented with parallel acceleration:

Assignment:

$S_i^{(t)} = \{x_p : \|x_p - m_i^{(t)}\|_2^2 \leq \|x_p - m_j^{(t)}\|_2^2 \ \forall j \}$

Centroid Update:

$m_i^{(t+1)} = \frac{1}{|S_i^{(t)}|} \sum_{x \in S_i^{(t)}} x$

Parallelization is achieved via OpenACC or CUDA, exploiting data-level independence in distance calculations. This enables scalable, rapid clustering of thousands of GB structures.

4. Automation and Efficiency Strategy

The pipeline is fully automated:

Input Acquisition: Raw atomic positions from evolutionary predictions are parsed automatically.
Batch Feature Calculation: Scripts compute eight excess properties per structure in a highly parallel fashion.
Automated Clustering Execution: Density-based and parallel K-means clustering routines assign cluster labels en masse.
Post-processing: Automated generation of decision graphs and feature maps allow immediate validation and interpretability.
Job Scheduling: Linux Crontab and related tools manage scheduled batch operations, maximizing computational resource utilization.

Compared to manual, “eye detection” approaches, this pipeline achieves dramatically higher classification throughput and accuracy, while parallelization in clustering steps ensures tractable processing times for large data volumes.

5. Applications and Scientific Implications

The DataForge Pipeline has direct utility in computational materials science:

Accelerated Discovery: Facilitates high-throughput mapping of the GB phase space, revealing new structural motifs and phases.
Structure-Property Mapping: Links physically meaningful engineered features to mechanical and thermal GB behaviors, aiding prediction and design.
High-Throughput Studies: Enables systematic exploration of GB response to environmental perturbations (e.g., temperature, pressure).
Database Construction: Forms the analytical backbone for data-driven materials property databases and machine learning–assisted structural classification.
Experimental Guidance: Informs experimental design by computationally predicting interface structures most likely to exhibit desirable macroscopic properties.

These capabilities support broad-scale simulation and experimental campaigns, enhancing both the depth and practicality of materials discovery and optimization.

6. Key Equations and Algorithmic Steps

Critical mathematical formulations in the pipeline include:

Algorithmic Step	Formula (LaTeX notation)	Purpose
GB Free Energy	$\gamma A = E - TS - \sigma_{33}V = [E]_N - T[S]_N - \sigma_{33}[V]_N$	Quantifies energetic differences of GB vs. bulk
Excess Volume/Stress	$[V]_N$ , $\tau_{11}$ , $\tau_{22}$ [see formulas above]	Measures mechanical property deviations
Steinhardt Order Param	$[Q]_N = \frac{1}{A}(Q - Q^{bulk}(N/N^{bulk}))$	Captures local order motifs
Density-Based Clustering	$\rho_i = \sum_j \chi(d_{ij} - d_c)$ ; $\delta_i = \min_{j: \rho_j > \rho_i}(d_{ij})$	Identifies clusters in feature space
K-Means Assignment/Update	$S_i^{(t)}$ and $m_i^{(t+1)}$ [see formulas above]	Iteratively assigns and refines cluster membership

This formalization underpins both the physical interpretability and computational tractability of the pipeline methodology.

7. Conclusion

The DataForge Pipeline exemplifies a rigorously engineered integration of feature extraction, unsupervised clustering, and workflow automation for the classification of GB atomic structures. By mapping high-dimensional atomic configurations into a concise and physically meaningful feature space, and employing parallelized clustering, the pipeline achieves highly efficient, accurate, and scalable classification. Its implications extend across computational materials science, providing foundational infrastructure for high-throughput simulation studies, data-driven materials informatics, and fundamental investigations into the structural phases and transitions of polycrystalline systems.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DataForge Pipeline.