Papers
Topics
Authors
Recent
Search
2000 character limit reached

TSpec-LLM: 3GPP Telecom Standards LLM Dataset

Updated 1 February 2026
  • TSpec-LLM is a comprehensive open-source dataset comprising 30,137 documents from 3GPP Releases 8 to 19, offering detailed telecom specifications.
  • It utilizes a robust, reproducible pipeline with mirroring, format conversion, and rigorous metadata segmentation to prepare documents for LLM tasks.
  • The dataset supports retrieval-augmented generation by segmenting texts into token chunks and leveraging embedding-based indexing for precise question answering.

TSpec-LLM is an open-source dataset designed to facilitate LLM understanding of 3GPP (3rd Generation Partnership Project) technical specifications, as used in modern telecommunication standards. It comprises the full corpus of 3GPP documents spanning Release 8 (1999) through Release 19 (2023), corresponding to the evolution from 3G through 4G/LTE, LTE-Advanced, and 5G NR. This unique resource provides both raw and preprocessed technical content tailored for information retrieval and LLM fine-tuning scenarios, addressing the intensive demands of telecom standards comprehension (Nikbakht et al., 2024).

1. Dataset Scope and Statistics

TSpec-LLM encapsulates the entirety of publicly available 3GPP specifications over a 24-year window, covering the main document series and technical reports relevant to mobile and wireless systems. The dataset includes:

  • 30,137 documents distributed across releases 8–19.
  • Aggregate word count: 535 million.
  • Average document length: ~17,770 words.
  • Document formats: original Microsoft Word (.docx) and preprocessed Markdown (.md).
  • Series coverage: 3G, 4G/LTE, LTE-Advanced, 5G NR.

The following table summarizes per-release statistics (values approximate):

$\begin{array}{c|c|c|c|c} \text{Release} & \text{Year} & \#\text{Docs} & \text{Size (MB)} & \text{Word Count (M)} \ \hline 8 & 1999 & 1,200 & 150 & 4.2 \ 9 & 2000 & 1,450 & 180 & 5.0 \ 10 & 2004 & 1,800 & 210 & 6.1 \ 11 & 2008 & 2,100 & 240 & 7.0 \ 12 & 2014 & 2,500 & 300 & 8.8 \ 13 & 2016 & 2,900 & 340 & 9.9 \ 14 & 2017 & 3,200 & 380 & 11.2 \ 15 & 2018 & 3,700 & 430 & 12.5 \ 16 & 2019 & 3,900 & 460 & 13.1 \ 17 & 2020 & 4,100 & 490 & 13.8 \ 18 & 2021 & 2,800 & 330 & 9.4 \ 19 & 2023 & 1,587 & 180 & 6.0 \ \hline \textbf{Total} & 1999\text{–}2023 & 30,137 & 3,940 & 535 \end{array}$

This coverage ensures inclusion of the full context for feature evolution, technical parameterization, and standard-building processes in cellular systems.

2. Data Acquisition and Preprocessing Pipeline

The assembly of TSpec-LLM utilizes a robust, reproducible workflow:

  • Mirroring: Employs the download3gpp tool (v0.7.0) to recursively mirror the entire 3GPP repository for Releases 8–19.
  • Document Conversion: Batch conversion of .docx files to Markdown using headless LibreOffice; command syntax: libreoffice --headless --convert-to markdown filename.docx.
  • Cleaning: Metadata headers and footers stripped; all Office Math ML equations are transformed into LaTeX inside ......; table contents and captions are fully retained.
  • Section Segmentation: Markdown heading hierarchy mirrors 3GPP section numbers. Regex (^(\#+)\s+Section\s+([\d\.]+)) maps heading depth to section/subsection indices.
  • Metadata Structure: Each Markdown file’s front-matter YAML encodes:
    • doc_id (e.g., “3GPP TS 38.901”)
    • release (integer)
    • series (integer)
    • title (verbatim specification title)
    • Directory tree structure: {Release}_{N}/Series_{M}/TS_{X}.docx, with parallel Markdown.

The final artifact is a publicly accessible Hugging Face repository (rasoul-nikbakht/TSpec-LLM), occupying ~15 GB and reflecting the official specification hierarchy.

3. Retrieval-Augmented Generation (RAG) Setup

TSpec-LLM supports retrieval-augmented workflows for question answering and automated document parsing.

  • Chunking Strategy: Each Markdown document is windowed with a fixed size W=1024W=1024 tokens and 100-token overlap; chunks serve as “nodes” in retrieval.
  • Embeddings: Chunks are embedded using OpenAI text-embedding-ada-002, interfaced via Google Generative Language Semantic Retriever.
  • Indexing: LlamaIndex provides an in-memory vector store; the baseline configuration (“naive-RAG”) forgoes IVF or vector quantization.
  • Similarity Metric: Cosine similarity for vectors u,vRDu,v \in \mathbb{R}^{D}:

sim(u,v)=uvu  v[1,1]\mathrm{sim}(u,v) = \frac{u^\top v}{\|u\|\;\|v\|} \in [-1,1]

  • Inference Pipeline:
  1. Input query qq is embedded as eqe_q.
  2. Top-KK closest chunks {ci}i=1K\{c_i\}_{i=1}^K are retrieved.
  3. Retrieved texts concatenated:

    [Context]=c1cK\texttt{[Context]} = c_1 \, \| \, \cdots \, \| \, c_K

  4. Prompt construction: “You are a telecom standards expert. Use the following extracted excerpts to answer the question below.\n\n[Context]\n\nQuestion: …\nAnswer:”
  5. Prompt submitted to the LLM backbone.

This framework enables direct context-enhanced QA from the source specifications, reducing ambiguity and hallucinations typical of base LLMs.

4. LLM Baseline Evaluation

Comprehensive model validation is performed via structured multiple-choice technical questions derived from the 3GPP corpus.

  • Models Used: GPT-3.5 Turbo, Gemini 1.0 Pro, GPT-4.
  • Prompt Template:
    • Direct question (“Select one option (A/B/C/D)”).
    • Temperature: 0.0, top-p: 1.0, max tokens: 512.
  • Context Window:
    • Non-RAG: question + choices only.
    • RAG: top-KK chunks (\sim1,000 tokens) plus question.
  • Evaluation Metric: accuracy

Accuracy=#correct answers#total questions\mathrm{Accuracy} = \frac{\#\text{correct answers}}{\#\text{total questions}}

(Evaluated over 100 questions.)

Results table:

$\begin{array}{l|cc} \text{Model} & \text{Base} & \text{+ naive-RAG} \ \hline \mathrm{GPT-3.5} & 44\% & 71\% \ \mathrm{Gemini\ 1.0\ Pro} & 46\% & 72\% \ \mathrm{GPT-4} & 51\% & 75\% \end{array}$

  • Difficulty Analysis:
    • RAG+TSpec-LLM accuracy: 93% (easy), 65% (intermediate), 68% (hard).
    • Non-RAG: ~80%, 45%, 30% respectively.
  • Model Confidence: For Gemini+RAG: 72 answers were submitted at probability 1.0; 6 false positives with 0.8–0.9 confidence; errors predominantly at <<0.6.
  • Example:
    • Q: “What is the maximum directional gain of an antenna element?”
    • GPT-4 base: “10 dBi.”
    • RAG context retrieval: Table 7.3-1 from TR 38.901 \rightarrow “8 dBi.”
  • Comparative Baseline:
    • Naive-RAG on SPEC5G yields 60% accuracy versus 75% on TSpec-LLM.

5. Dataset Access and Usage Protocols

TSpec-LLM is readily available for download, loading, and model fine-tuning:

  • Accessing:
  1. Install Git LFS.
  2. Clone: git clone https://huggingface.co/datasets/rasoul-nikbakht/TSpec-LLM
  3. Structure:
    1
    2
    3
    4
    
    TSpec-LLM/
      Release_15/Series_38/TS_38.901.docx
      Release_15/Series_38/TS_38.901.md
      ...
  • Loading for RAG/fine-tuning:
    • Read each Markdown, window into 1024-token segments with overlap, compute embeddings.
    • Use LlamaIndex or FAISS for retrieval-based applications.
  • Fine-tuning Recommendations (for open-source LLMs like Phi-3, LLaMA2):
    • Learning rate: 13×1051 \sim 3\times 10^{-5}
    • Batch size: 4–8
    • Epochs: 2–3
    • LoRA adapters: rank 8–16
    • Sequence length: 2,048 tokens
    • Early stopping on held-out technical QA.
  • Best Practices:
    • Retain LaTeX-formatted tables/equations for improved retrieval on technical queries.
    • Utilize metadata (release/series/docID) for granular index partitioning and efficient lookup.

6. Significance and Prospective Applications

TSpec-LLM represents a comprehensive, structured corpus for LLM adaptation in telecom standards interpretation (Nikbakht et al., 2024). A plausible implication is that such datasets will become foundational for specialized QA systems in regulated engineering domains. Its retrieval-oriented chunking and diligent semantic annotation set a precedent for domain-specific dataset construction, while its documented impact on model accuracy—especially for technically difficult queries—demonstrates the necessity of full-context specification corpora. TSpec-LLM can be leveraged for:

  • Retrieval-augmented QA for telecom standards engineering.
  • Direct instruction tuning and pre-training of open-source LLMs on telco corpora.
  • Document parsing and automated report generation from 3GPP artifacts.

This approach points toward scalable methods of automating standards comprehension, with extension opportunities to other highly technical domains that rely on sprawling documentation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TSpec-LLM Dataset.