TSpec-LLM: 3GPP Telecom Standards LLM Dataset
- TSpec-LLM is a comprehensive open-source dataset comprising 30,137 documents from 3GPP Releases 8 to 19, offering detailed telecom specifications.
- It utilizes a robust, reproducible pipeline with mirroring, format conversion, and rigorous metadata segmentation to prepare documents for LLM tasks.
- The dataset supports retrieval-augmented generation by segmenting texts into token chunks and leveraging embedding-based indexing for precise question answering.
TSpec-LLM is an open-source dataset designed to facilitate LLM understanding of 3GPP (3rd Generation Partnership Project) technical specifications, as used in modern telecommunication standards. It comprises the full corpus of 3GPP documents spanning Release 8 (1999) through Release 19 (2023), corresponding to the evolution from 3G through 4G/LTE, LTE-Advanced, and 5G NR. This unique resource provides both raw and preprocessed technical content tailored for information retrieval and LLM fine-tuning scenarios, addressing the intensive demands of telecom standards comprehension (Nikbakht et al., 2024).
1. Dataset Scope and Statistics
TSpec-LLM encapsulates the entirety of publicly available 3GPP specifications over a 24-year window, covering the main document series and technical reports relevant to mobile and wireless systems. The dataset includes:
- 30,137 documents distributed across releases 8–19.
- Aggregate word count: 535 million.
- Average document length: ~17,770 words.
- Document formats: original Microsoft Word (.docx) and preprocessed Markdown (.md).
- Series coverage: 3G, 4G/LTE, LTE-Advanced, 5G NR.
The following table summarizes per-release statistics (values approximate):
$\begin{array}{c|c|c|c|c} \text{Release} & \text{Year} & \#\text{Docs} & \text{Size (MB)} & \text{Word Count (M)} \ \hline 8 & 1999 & 1,200 & 150 & 4.2 \ 9 & 2000 & 1,450 & 180 & 5.0 \ 10 & 2004 & 1,800 & 210 & 6.1 \ 11 & 2008 & 2,100 & 240 & 7.0 \ 12 & 2014 & 2,500 & 300 & 8.8 \ 13 & 2016 & 2,900 & 340 & 9.9 \ 14 & 2017 & 3,200 & 380 & 11.2 \ 15 & 2018 & 3,700 & 430 & 12.5 \ 16 & 2019 & 3,900 & 460 & 13.1 \ 17 & 2020 & 4,100 & 490 & 13.8 \ 18 & 2021 & 2,800 & 330 & 9.4 \ 19 & 2023 & 1,587 & 180 & 6.0 \ \hline \textbf{Total} & 1999\text{–}2023 & 30,137 & 3,940 & 535 \end{array}$
This coverage ensures inclusion of the full context for feature evolution, technical parameterization, and standard-building processes in cellular systems.
2. Data Acquisition and Preprocessing Pipeline
The assembly of TSpec-LLM utilizes a robust, reproducible workflow:
- Mirroring: Employs the download3gpp tool (v0.7.0) to recursively mirror the entire 3GPP repository for Releases 8–19.
- Document Conversion: Batch conversion of .docx files to Markdown using headless LibreOffice; command syntax:
libreoffice --headless --convert-to markdown filename.docx. - Cleaning: Metadata headers and footers stripped; all Office Math ML equations are transformed into LaTeX inside
; table contents and captions are fully retained. - Section Segmentation: Markdown heading hierarchy mirrors 3GPP section numbers. Regex (
^(\#+)\s+Section\s+([\d\.]+)) maps heading depth to section/subsection indices. - Metadata Structure: Each Markdown file’s front-matter YAML encodes:
doc_id(e.g., “3GPP TS 38.901”)release(integer)series(integer)title(verbatim specification title)- Directory tree structure:
{Release}_{N}/Series_{M}/TS_{X}.docx, with parallel Markdown.
The final artifact is a publicly accessible Hugging Face repository (rasoul-nikbakht/TSpec-LLM), occupying ~15 GB and reflecting the official specification hierarchy.
3. Retrieval-Augmented Generation (RAG) Setup
TSpec-LLM supports retrieval-augmented workflows for question answering and automated document parsing.
- Chunking Strategy: Each Markdown document is windowed with a fixed size tokens and 100-token overlap; chunks serve as “nodes” in retrieval.
- Embeddings: Chunks are embedded using OpenAI text-embedding-ada-002, interfaced via Google Generative Language Semantic Retriever.
- Indexing: LlamaIndex provides an in-memory vector store; the baseline configuration (“naive-RAG”) forgoes IVF or vector quantization.
- Similarity Metric: Cosine similarity for vectors :
- Inference Pipeline:
- Input query is embedded as .
- Top- closest chunks are retrieved.
- Retrieved texts concatenated:
- Prompt construction: “You are a telecom standards expert. Use the following extracted excerpts to answer the question below.\n\n[Context]\n\nQuestion: …\nAnswer:”
- Prompt submitted to the LLM backbone.
This framework enables direct context-enhanced QA from the source specifications, reducing ambiguity and hallucinations typical of base LLMs.
4. LLM Baseline Evaluation
Comprehensive model validation is performed via structured multiple-choice technical questions derived from the 3GPP corpus.
- Models Used: GPT-3.5 Turbo, Gemini 1.0 Pro, GPT-4.
- Prompt Template:
- Direct question (“Select one option (A/B/C/D)”).
- Temperature: 0.0, top-p: 1.0, max tokens: 512.
- Context Window:
- Non-RAG: question + choices only.
- RAG: top- chunks (1,000 tokens) plus question.
- Evaluation Metric: accuracy
(Evaluated over 100 questions.)
Results table:
$\begin{array}{l|cc} \text{Model} & \text{Base} & \text{+ naive-RAG} \ \hline \mathrm{GPT-3.5} & 44\% & 71\% \ \mathrm{Gemini\ 1.0\ Pro} & 46\% & 72\% \ \mathrm{GPT-4} & 51\% & 75\% \end{array}$
- Difficulty Analysis:
- RAG+TSpec-LLM accuracy: 93% (easy), 65% (intermediate), 68% (hard).
- Non-RAG: ~80%, 45%, 30% respectively.
- Model Confidence: For Gemini+RAG: 72 answers were submitted at probability 1.0; 6 false positives with 0.8–0.9 confidence; errors predominantly at 0.6.
- Example:
- Q: “What is the maximum directional gain of an antenna element?”
- GPT-4 base: “10 dBi.”
- RAG context retrieval: Table 7.3-1 from TR 38.901 “8 dBi.”
- Comparative Baseline:
- Naive-RAG on SPEC5G yields 60% accuracy versus 75% on TSpec-LLM.
5. Dataset Access and Usage Protocols
TSpec-LLM is readily available for download, loading, and model fine-tuning:
- Accessing:
- Install Git LFS.
- Clone:
git clone https://huggingface.co/datasets/rasoul-nikbakht/TSpec-LLM - Structure:
1 2 3 4
TSpec-LLM/ Release_15/Series_38/TS_38.901.docx Release_15/Series_38/TS_38.901.md ...
- Loading for RAG/fine-tuning:
- Read each Markdown, window into 1024-token segments with overlap, compute embeddings.
- Use LlamaIndex or FAISS for retrieval-based applications.
- Fine-tuning Recommendations (for open-source LLMs like Phi-3, LLaMA2):
- Learning rate:
- Batch size: 4–8
- Epochs: 2–3
- LoRA adapters: rank 8–16
- Sequence length: 2,048 tokens
- Early stopping on held-out technical QA.
- Best Practices:
- Retain LaTeX-formatted tables/equations for improved retrieval on technical queries.
- Utilize metadata (release/series/docID) for granular index partitioning and efficient lookup.
6. Significance and Prospective Applications
TSpec-LLM represents a comprehensive, structured corpus for LLM adaptation in telecom standards interpretation (Nikbakht et al., 2024). A plausible implication is that such datasets will become foundational for specialized QA systems in regulated engineering domains. Its retrieval-oriented chunking and diligent semantic annotation set a precedent for domain-specific dataset construction, while its documented impact on model accuracy—especially for technically difficult queries—demonstrates the necessity of full-context specification corpora. TSpec-LLM can be leveraged for:
- Retrieval-augmented QA for telecom standards engineering.
- Direct instruction tuning and pre-training of open-source LLMs on telco corpora.
- Document parsing and automated report generation from 3GPP artifacts.
This approach points toward scalable methods of automating standards comprehension, with extension opportunities to other highly technical domains that rely on sprawling documentation.