Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 30 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 454 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

DataForge Pipeline: Automated GB Analysis

Updated 27 August 2025
  • DataForge Pipeline is a modular system that converts high-dimensional atomic configurations into an 8-dimensional feature space for efficient grain boundary classification.
  • It employs advanced density-based and parallel K-means clustering to accurately identify structural motifs without requiring pre-defined cluster counts.
  • The pipeline integrates automation and batch processing to rapidly analyze thousands of GB structures, accelerating materials discovery and design.

The DataForge Pipeline is a modular, automated analysis pipeline developed for the classification and recognition of grain boundary (GB) structures generated by evolutionary algorithms, particularly those employing the USPEX method. Designed to operate on thousands of high-dimensional atomic configurations, it systematically transforms raw atomic coordinate data into a low-dimensional engineered feature space, applies robust clustering algorithms, and automates the overall workflow to achieve accurate, efficient structural classification in materials science.

1. Architecture and Workflow

The DataForge Pipeline comprises four principal components:

  • Feature Engineering Module: Transforms each input GB atomic configuration (originally located in a 3N-dimensional coordinate space) into an 8-dimensional feature vector based on physically and structurally meaningful excess properties.
  • Density-Based Clustering Module: Utilizes local density and distance-to-higher-density metrics to identify cluster centers within the engineered feature space, enabling unsupervised classification that does not require pre-specified cluster counts.
  • Parallel K-Means Clustering Module: Employs parallelized K-means clustering (specifically accelerated via OpenACC or CUDA) to partition the feature space among clusters by iteratively updating cluster assignments and centroids.
  • Automation Infrastructure: Orchestrates batch processing and job scheduling (e.g., with Linux Crontab) to automate feature extraction and clustering analyses over large ensembles of structures, minimizing manual intervention and maximizing throughput.

This architecture enables the efficient processing of thousands of GB structures, facilitating automated structure analysis and improved physical understanding of GB phases.

2. Feature Engineering for Grain Boundaries

Feature engineering in the DataForge Pipeline is focused on mapping each complex atomic structure into an interpretable, lower-dimensional feature space. The process involves:

  • Calculation of Excess Properties:

    • Free Energy [γ]ₙ: Extracted from the GB free energy relation, capturing energetic differences with respect to bulk.

    γA=ETSσ33V=[E]NT[S]Nσ33[V]N\gamma A = E - TS - \sigma_{33}V = [E]_N - T[S]_N - \sigma_{33}[V]_N - Excess Atomic Volume [V]ₙ and Stress Components (τ₁₁, τ₂₂): Quantified as deviations from bulk properties using comparative expressions:

    [V]N=1A[VVbulk(N/Nbulk)][V]_N = \frac{1}{A} [V - V^{bulk}(N/N^{bulk})]

    τ11,22=1A[σ11,22Vσ11,22bulkVbulk(N/Nbulk)]\tau_{11,22} = \frac{1}{A} [\sigma_{11,22}V - \sigma_{11,22}^{bulk}V^{bulk}(N/N^{bulk})] - Excess Steinhardt Order Parameters ([Q₄]ₙ, [Q₆]ₙ, [Q₈]ₙ, [Q₁₂]ₙ): Capture local atomic ordering and structural motif differences with respect to bulk, using Voronoi constructions for atomic volumes.

    [Q]N=1A(QQbulk(N/Nbulk))[Q]_N = \frac{1}{A} (Q - Q^{bulk}(N/N^{bulk}))

Each GB structure is ultimately encoded as:

f=([γ]N,[V]N,τ11,τ22,[Q4]N,[Q6]N,[Q8]N,[Q12]N)f = ([\gamma]_N, [V]_N, \tau_{11}, \tau_{22}, [Q_4]_N, [Q_6]_N, [Q_8]_N, [Q_{12}]_N)

This compact feature representation preserves essential physical, mechanical, and geometric information, enabling subsequent clustering algorithms to discriminate GB phases and structural families.

3. Clustering Algorithms

3.1 Density-Based Clustering (Rodriguez-Laio Approach)

Each feature vector fif^i is analyzed for its local density ρi\rho_i and its distance δi\delta_i to higher-density points:

  • Local Density:

ρi=jχ(dijdc)\rho_i = \sum_j \chi(d_{ij} - d_c)

where dij=fifj2d_{ij} = \|f^i - f^j\|_2 and dcd_c is a cutoff chosen based on typical nearest-neighbor separation.

  • Distance to Higher Density Points:

δi=minj:ρj>ρi(dij)\delta_i = \min_{j: \rho_j > \rho_i}(d_{ij})

The (ρ,δ)(\rho, \delta) decision graph reveals clusters as centers with anomalously high ρ\rho and δ\delta, yielding natural classification of GB structures into groups (e.g., Kite, Split Kite, Extended Kite) without an a priori cluster count. This method is robust for clusters with arbitrary geometries and readily exposes outliers.

3.2 Parallel K-Means Clustering

Standard K-means is implemented with parallel acceleration:

  • Assignment:

Si(t)={xp:xpmi(t)22xpmj(t)22 j}S_i^{(t)} = \{x_p : \|x_p - m_i^{(t)}\|_2^2 \leq \|x_p - m_j^{(t)}\|_2^2 \ \forall j \}

  • Centroid Update:

mi(t+1)=1Si(t)xSi(t)xm_i^{(t+1)} = \frac{1}{|S_i^{(t)}|} \sum_{x \in S_i^{(t)}} x

Parallelization is achieved via OpenACC or CUDA, exploiting data-level independence in distance calculations. This enables scalable, rapid clustering of thousands of GB structures.

4. Automation and Efficiency Strategy

The pipeline is fully automated:

  • Input Acquisition: Raw atomic positions from evolutionary predictions are parsed automatically.
  • Batch Feature Calculation: Scripts compute eight excess properties per structure in a highly parallel fashion.
  • Automated Clustering Execution: Density-based and parallel K-means clustering routines assign cluster labels en masse.
  • Post-processing: Automated generation of decision graphs and feature maps allow immediate validation and interpretability.
  • Job Scheduling: Linux Crontab and related tools manage scheduled batch operations, maximizing computational resource utilization.

Compared to manual, “eye detection” approaches, this pipeline achieves dramatically higher classification throughput and accuracy, while parallelization in clustering steps ensures tractable processing times for large data volumes.

5. Applications and Scientific Implications

The DataForge Pipeline has direct utility in computational materials science:

  • Accelerated Discovery: Facilitates high-throughput mapping of the GB phase space, revealing new structural motifs and phases.
  • Structure-Property Mapping: Links physically meaningful engineered features to mechanical and thermal GB behaviors, aiding prediction and design.
  • High-Throughput Studies: Enables systematic exploration of GB response to environmental perturbations (e.g., temperature, pressure).
  • Database Construction: Forms the analytical backbone for data-driven materials property databases and machine learning–assisted structural classification.
  • Experimental Guidance: Informs experimental design by computationally predicting interface structures most likely to exhibit desirable macroscopic properties.

These capabilities support broad-scale simulation and experimental campaigns, enhancing both the depth and practicality of materials discovery and optimization.

6. Key Equations and Algorithmic Steps

Critical mathematical formulations in the pipeline include:

Algorithmic Step Formula (LaTeX notation) Purpose
GB Free Energy γA=ETSσ33V=[E]NT[S]Nσ33[V]N\gamma A = E - TS - \sigma_{33}V = [E]_N - T[S]_N - \sigma_{33}[V]_N Quantifies energetic differences of GB vs. bulk
Excess Volume/Stress [V]N[V]_N, τ11\tau_{11}, τ22\tau_{22} [see formulas above] Measures mechanical property deviations
Steinhardt Order Param [Q]N=1A(QQbulk(N/Nbulk))[Q]_N = \frac{1}{A}(Q - Q^{bulk}(N/N^{bulk})) Captures local order motifs
Density-Based Clustering ρi=jχ(dijdc)\rho_i = \sum_j \chi(d_{ij} - d_c); δi=minj:ρj>ρi(dij)\delta_i = \min_{j: \rho_j > \rho_i}(d_{ij}) Identifies clusters in feature space
K-Means Assignment/Update Si(t)S_i^{(t)} and mi(t+1)m_i^{(t+1)} [see formulas above] Iteratively assigns and refines cluster membership

This formalization underpins both the physical interpretability and computational tractability of the pipeline methodology.

7. Conclusion

The DataForge Pipeline exemplifies a rigorously engineered integration of feature extraction, unsupervised clustering, and workflow automation for the classification of GB atomic structures. By mapping high-dimensional atomic configurations into a concise and physically meaningful feature space, and employing parallelized clustering, the pipeline achieves highly efficient, accurate, and scalable classification. Its implications extend across computational materials science, providing foundational infrastructure for high-throughput simulation studies, data-driven materials informatics, and fundamental investigations into the structural phases and transitions of polycrystalline systems.