SPBERTQA: Span Pre-training & QA Retrieval
- SPBERTQA is a QA approach that integrates span selection pre-training with a two-stage BM25+SBERT retrieval pipeline to improve answer extraction.
- The framework adapts BERT with additional span selection losses and pointer-based prediction, leading to enhanced performance on SQuAD, Natural Questions, and HotpotQA.
- Its two-stage retrieval component efficiently ranks candidate sentences, proving especially effective for low-resource domains like Vietnamese medical texts.
SPBERTQA refers in the literature both to advances in question answering (QA) pre-training objectives for LLMs and to practical two-stage QA retrieval systems. The term is most notably applied to a span selection pre-training objective for BERT and a BM25+SBERT-based answer retrieval pipeline, depending on context. This entry covers the principal variants as described in "Span Selection Pre-training for Question Answering" (Glass et al., 2019) and "SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts" (Nguyen et al., 2022).
1. Span Selection Pre-training for Question Answering
The SPBERTQA pre-training paradigm introduced in (Glass et al., 2019) seeks to adapt the inductive bias of BERT from token-level cloze tasks toward the extractive QA paradigm, where the model must select answer spans from natural passages given a question.
Construction and Objective
Each training instance involves:
- A query (natural sentence with answer masked as a special [BLANK] token).
- A passage (Wikipedia paragraph retrieved using Lucene BM25).
- An answer span contained in matching the originally masked portion of .
The SPBERTQA pre-training loss supplements BERT's original masked LLM (MLM) and next sentence prediction (NSP) with a span selection loss. This encompasses:
- Pointer-based start/end prediction over the concatenated and .
- (Optionally) a binary answerability classification for "impossible" instances where no answer string is present.
Mathematically, for each instance: with the total loss
where in practice.
Data Creation
Instances are synthesized from Wikipedia by automatically selecting answer spans via heuristics (POS/NER), removing superficial lexical signals, and retrieving passages with BM25 (). About 30% of instances are "impossible" with no exact answer present.
Architectural Modifications
A single new [BLANK] token is added to the vocabulary, and linear heads are introduced for start/end prediction and, optionally, answerability. The transformer encoder remains unchanged.
Pre-training Regimen
The model is initialized from BERT (base or large), and trained for 100M span-selection instances (batch size 256, LR 0), requiring about 40% more compute than standard BERT pre-training.
2. Downstream QA Adaptation and Results
SPBERTQA is directly fine-tuned on standard extractive QA benchmarks:
- SQuAD 1.1/2.0: Performance increases over BERT-base and BERT-large, with +SSPT-large achieving F1 of 92.75/86.86 (SQuAD 1.1 F1/EM) and 85.03/82.07 (SQuAD 2.0 F1/EM), a substantial gain over non-span-pretrained models.
- Natural Questions: +SSPT-large yields short-answer F1 of 54.2 vs. 52.7 for BERT-large.
- HotpotQA: Joint reasoning with local/global encoding produces answer F1 of 86.17 and supporting fact F1 of 79.39 (base/large settings).
A pronounced advantage is observed in limited-resource settings: on SQuAD, the span-pretrained model gains 5–6 F1 over BERT at 10% supervised data.
| Model | SQuAD 1.1 F1 | NaturalQ Short F1 | HotpotQA Ans F1 | Low-data Gain |
|---|---|---|---|---|
| BERT-large | 90.97 | 52.7 | 85.27 | – |
| +SSPT-large | 92.75 | 54.2 | 86.17 | +5–6 F1 |
3. BM25+SBERT-Based Answer Selection for Low-Resource Languages
A separate system named SPBERTQA is proposed in (Nguyen et al., 2022) for Vietnamese medical QA using a two-stage extractive pipeline:
- BM25-based sentence retrieval: Passages (avg. 495 words, often exceeding transformer length limits) are split into sentences, then ranked via BM25. For each passage, the top 1 sentences with highest BM25 overlap to the query are selected.
- SBERT semantic re-ranking: Both the question and the K-sentence candidate passage are encoded by Sentence-BERT (Siamese PhoBERT) trained with Multiple Negatives Ranking (MNR) loss. Embeddings use mean pooling; similarity is cosine distance.
Key features of the architecture:
- Input truncation at 256 tokens, using VnCoreNLP for segmentation.
- Optimal performance with PhoBERT as the base encoder following contrastive (MNR) fine-tuning.
- Retrieval and re-ranking metrics are Precision@K and mean Average Precision.
On the ViHealthQA Vietnamese medical test set (2013 examples):
| Model | P@1 (%) | P@10 (%) | mAP (%) |
|---|---|---|---|
| BM25 baseline | 44.96 | 70.09 | 56.93 |
| BM25-SXLMR | 46.05 | 79.04 | 53.85 |
| SPBERTQA | 50.92 | 83.76 | 62.25 |
SPBERTQA surpasses all bag-of-words and multilingual BERT/XLM-R baselines, especially at low lexical overlap (P@1 of 50–80% even with 0–3 shared words), verifying its robust lexical-gap generalization (Nguyen et al., 2022).
4. Analysis, Limitations, and Future Directions
For the span selection pre-training approach:
- The method tightly aligns model pre-training with downstream extractive QA, requiring models to find spans (not generate answers), resulting in more efficient transfer and learning, as demonstrated empirically by significant F1/EM gains and faster convergence in low-resource scenarios (Glass et al., 2019).
- There is no modification to the core transformer; all changes are at the input or output head level.
For the retrieval-based system:
- The two-stage mechanism addresses token-length limitations in input representations and the lexical-semantic gap in Vietnamese medical QA.
- Limitations include the absence of explicit answer span highlighting in retrieved passages; only sentence-level extraction is provided. A plausible implication is that integrating a machine reading comprehension span predictor atop the system could deliver concise answer extraction.
Both lines of work demonstrate the effectiveness of specialized pre-training (for span selection) and monolingual semantic encoding (contrastive PhoBERT in Vietnamese) for QA.
5. Related Directions and Distinction from Other SPBERT Variants
SPBERTQA in the span selection context is distinct from the SPBERT model trained on SPARQL for knowledge-graph QA (Tran et al., 2021), which addresses answer generation and SPARQL query construction using a different pre-training regime. However, SPBERTQA (span selection) aligns more closely with classical reading comprehension objectives, while the BM25+SBERT-based SPBERTQA exemplifies dense retrieval and semantic ranking optimizations in low-resource cQA.
6. Applications and Domain Transfer
The span-selection paradigm is particularly potent for extractive question answering, multi-hop reasoning (HotpotQA), and settings with incomplete lexical overlap between query and source. The BM25+SBERT SPBERTQA pipeline is deployed for end-user health information seeking (Vietnamese QA) and provides a template for building domain-adapted cQA back-ends in other languages with limited annotated resources.
Future work for SPBERTQA systems includes integrated span highlighting, real-time deployment for conversational agents, and extension to slot filling and speech-based SLU domains.
References:
- "Span Selection Pre-training for Question Answering" (Glass et al., 2019)
- "SPBERTQA: A Two-Stage Question Answering System Based on Sentence Transformers for Medical Texts" (Nguyen et al., 2022)
- For contrast with pre-training on SPARQL syntax: "SPBERT: An Efficient Pre-training BERT on SPARQL Queries for Question Answering over Knowledge Graphs" (Tran et al., 2021)