ManufactuBERT: Optimized NLP for Manufacturing

Updated 14 November 2025

ManufactuBERT is a Transformer-based NLP model optimized for manufacturing through continual pretraining and semantic deduplication, achieving state-of-the-art performance.
It employs a two-stage deduplication process—MinHash and SemDeDup—to reduce corpus redundancy by 80%, ensuring efficient and robust training.
Extended pretraining and efficient resource use yield up to a 35.3% reduction in training steps and significant energy savings on manufacturing-centric benchmarks.

ManufactuBERT is a specialized Transformer-based LLM optimized for NLP applications in the manufacturing domain. Employing continual pretraining of a RoBERTa backbone on an extensive, curated, and semantically deduplicated manufacturing corpus, ManufactuBERT demonstrates superior performance on a suite of manufacturing-centric NLP benchmarks while substantially improving training efficiency. The approach combines systematic data filtering with multi-stage deduplication, enabling state-of-the-art performance and resource savings, and offers a reproducible pipeline extendable to other specialized domains.

1. Model Architecture and Pretraining Objective

ManufactuBERT utilizes the RoBERTa-base architecture, consisting of 12 Transformer layers, 768-dimensional hidden representations, and 12 self-attention heads, without introducing architectural changes such as adapters or custom attention modules. The key deviation from Liu et al. (2019) lies in extending continual pretraining steps from 12,500 to 17,500.

The model is trained exclusively with the Masked Language Modeling (MLM) objective. The MLM loss for a batch is defined as: $\mathcal{L}_{\text{MLM}} = -\,\frac{1}{|M|} \sum_{i \in M} \log p_\theta\bigl(x_i \mid x_{\setminus M}\bigr)$ where $M$ denotes masked positions, $x_{\setminus M}$ the sequence with masked tokens replaced by [MASK], and $p_\theta$ the predicted distribution. The masking probability per token is set at 0.15. The next sentence prediction objective is omitted.

2. Manufacturing Domain Corpus: Filtering and Deduplication

2.1 Domain-Specific Filtering

ManufactuBERT pretraining draws from FineWeb, a web-scale (≈15 T tokens) corpus with existing quality filters. A FastText classifier is trained to select manufacturing-relevant documents, with positive examples curated from Elsevier abstracts (industrial and manufacturing journals), arXiv (cond-mat, physics, and engineering subdomains), Wikipedia (manufacturing-related articles), and BigPatent (patents containing "manufacturing"). Selection criteria and sample sizes are summarized below:

Source	Selection Criterion	#Documents
Elsevier abstracts	Industrial & Manufacturing journals	27,000
arXiv	cond-mat, physics, eess w/ keywords	2,300
Wikipedia	"Manufacturing" & related pages	5,000
BigPatent	Patents containing "manufacturing"	26,000

Negative samples are randomly selected FineWeb documents at a 10:1 ratio. The classifier threshold yields a filtered Manu-FineWeb corpus of approximately 10 billion tokens (21 million documents).

2.2 Multi-Stage Deduplication Pipeline

Residual near-duplicate content is mitigated via a two-stage process:

Stage A: MinHash Deduplication—Employs 20 buckets with 20 signatures each, efficiently removing exact and near-lexical duplicates.
Stage B: Semantic Deduplication (SemDeDup)—Documents are segmented into non-overlapping 512-token chunks, embedded with all-MiniLM-L6-v2, and aggregated into single 384-dimensional document vectors. K-means clustering ( $K=1000$ ) groups content, and within-cluster documents with cosine distance $<0.15$ to their nearest neighbor are removed.

The combined process reduces the corpus by ~80%, yielding Manu-FineWeb-Dedup of 2 billion tokens across 4.5 million documents.

The corpus construction workflow can be schematized as:

FineWeb (15T tokens) → FastText filter → Manu-FineWeb (10B tokens, 21M docs) → MinHash deduplication → SemDeDup → Manu-FineWeb-Dedup (2B tokens, 4.5M docs)

3. Training Regime, Efficiency, and Resource Utilization

3.1 Hyperparameters and Computational Setup

Initialization from public roberta-base checkpoint.
AdamW optimizer with weight decay 0.1; linear warmup (6% of steps) to peak learning rate $5\times 10^{-4}$ , then linear decay.
Batch size: 16 per GPU; gradient accumulation for effective batch size of 2048 tokens across 8 GPUs.
Maximum steps: 17,500; checkpointing every 500 steps.
Hardware: 8 x NVIDIA V100-32GB GPUs; wall-clock time for full run: 51 hours.

3.2 Effect of Semantic Deduplication on Convergence

Two model variants are trained:

ManufactuBERT: no semantic deduplication, entire Manu-FineWeb (10B tokens).
ManufactuBERTD: SemDeDup, Manu-FineWeb-Dedup (2B tokens).

Evaluations on FabNER demonstrate that ManufactuBERTD matches the final F1 of ManufactuBERT after 11,308 steps—a 35.3% reduction in training steps: $\frac{17{,}500 - 11{,}308}{17{,}500}\approx0.353=35.3\%$

3.3 Energy Consumption and Efficiency

Assuming each of 8 V100s operates at 250W:

Full training energy for 17,500 steps: $E_{\rm full}\approx102,200$ Wh.
Early convergence at 11,308 steps: $E_{\rm conv}\approx66,032$ Wh.
Additional deduplication pipeline cost: 2,606 Wh.

Net energy savings: $\frac{102{,}200 - (66{,}032 + 2{,}606)}{102{,}200} \approx 32.8\%$

A plausible implication is that semantic deduplication not only reduces corpus size and redundancy but also yields significant resource and environmental benefits during domain pretraining.

4. Performance Evaluation: Downstream Tasks and Statistical Significance

4.1 Benchmark Tasks

Evaluation covers six NER corpora and three RE/SC tasks, with GLUE serving as a general-domain baseline:

NER: FabNER, Materials Synthesis, SOFC & SOFC-Slot, MatScholar, ChemdNER
RE/SC: Materials Synthesis (RE), SOFC (SC), BigPatent (9-class SC)

4.2 Results Overview

ManufactuBERTD achieves highest performance across NER and RE/SC averages.

NER Results (µ-F1):

Model	Avg. F1
RoBERTa	80.37
SciBERT	81.18
MatSciBERT	82.16
ManufactuBERT	82.45
ManufactuBERTD	82.63

RE/SC Results (µ-F1):

Model	Avg. F1
RoBERTa	84.07
SciBERT	84.69
MatSciBERT	84.67
ManufactuBERT	84.85
ManufactuBERTD	85.03

ManufactuBERTD sets new state-of-the-art on 4 of 9 tasks. The adoption of SemDeDup consistently confers +0.2–0.3 F1 gains, despite an 80% reduction in pretraining data, indicating that model performance is not strictly proportional to corpus size but is highly dependent on corpus quality and relevance.

4.3 Statistical Assessment

The Almost Stochastic Order (ASO) test is employed (five fine-tuning seeds, $\alpha=0.05$ , Bonferroni correction, $T=0.5$ ). ManufactuBERTD is stochastically dominant over:

ManufactuBERT on all 9 tasks,
MatSciBERT on 5 tasks,
RoBERTa and NeoBERT on all 9 tasks.

4.4 General-Domain Language Performance

GLUE benchmark comparisons indicate that both ManufactuBERT variants maintain general-language capabilities:

Model	GLUE Avg.
RoBERTa	86.35
SciBERT	78.13
MatSciBERT	77.00
ManufactuBERT	81.84
ManufactuBERTD	81.78

Both variants exhibit ≤4.6 points average loss versus RoBERTa but outperform SciBERT/MatSciBERT by 3.6–4.8 points, demonstrating that manufacturing specialization does not severely diminish general-domain utility.

5. Reproducibility and Public Release

ManufactuBERT provides a reproducible pipeline:

Data Filtering: FastText classifier with code and sampling scripts.
Deduplication: Datatrove library for MinHash and custom SemDeDup; all deduplication hyperparameters declared—MinHash (20 buckets, 20 signatures), SemDeDup ( $K=1000$ , $T=0.15$ ).
Training: HuggingFace Transformers with fixed seeds, regular checkpointing.

Model checkpoints (ManufactuBERT and ManufactuBERTD), as well as the filtered and deduplicated corpora, will be released via HuggingFace at https://huggingface.co/cea-list-ia for downstream use and further research.

6. Implications and Extensibility

The ManufactuBERT paradigm demonstrates the efficacy of careful domain-specific filtering followed by semantic deduplication for constructing compact, high-utility pretraining corpora. This approach yields performance improvements, reduced training duration and energy costs, and reproducibility. A plausible implication is that the outlined methodology could serve as a generalizable template for developing similarly optimized LLMs in other specialized scientific or technical domains. The public release of models, code, and corpora ensures access for both comparative benchmarking and domain adaptation studies.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ManufactuBERT.