- The paper introduces span-level masking and a span-boundary objective to better capture contiguous text spans during pre-training.
- Experiments demonstrate significant improvements, including 3.3%-5.4% F1 gains on SQuAD and a 6.6-point boost in coreference resolution.
- Ablation studies confirm that randomly masking contiguous spans consistently enhances performance across various NLP benchmarks.
SpanBERT: Improving Pre-training by Representing and Predicting Spans
SpanBERT presents an innovative approach to pre-training by refining the representation and prediction of text spans. This method extends the original BERT by introducing modifications in both its masking scheme and training objectives, particularly targeting tasks that involve span-level reasoning such as question answering and coreference resolution.
Key Contributions
SpanBERT introduces two primary modifications:
- Masking Contiguous Spans: Rather than masking individual random tokens, SpanBERT masks contiguous spans of text. This approach forces the model to predict entire spans based on the surrounding context rather than relying on individual token predictions.
- Span-Boundary Objective (SBO): This novel objective trains the model to predict the content of a masked span using only the representations of its boundary tokens. This reinforces the model's ability to encode span-level information, which can be efficiently accessed during the fine-tuning phase.
Experimental Results
SpanBERT's efficacy is demonstrated across various NLP benchmarks:
- SQuAD 1.1 and 2.0: SpanBERT achieves 94.6% F1 on SQuAD 1.1 and 88.7% F1 on SQuAD 2.0, outperforming BERT by 3.3% and 5.4%, respectively.
- OntoNotes Coreference Resolution: SpanBERT sets a new state-of-the-art on this task with a score of 79.6% F1, an improvement of 6.6 percentage points over the previous best model.
- TACRED Relation Extraction: The model attains 70.8% F1, demonstrating strong performance against benchmarks.
- GLUE Benchmark: SpanBERT shows improvements in tasks such as QNLI and RTE, with QNLI accuracy reaching 94.3% and RTE improving by 6.9% over the baseline, resulting in an overall GLUE average increase to 82.8%.
Comparative Baselines
Three BERT variants were used as baselines for comparison:
- Google BERT: The original pre-trained models reported by Devlin et al.
- Our BERT: A reimplementation with improved preprocessing and optimization.
- Our BERT-1seq: A single-sequence trained version without the next sentence prediction (NSP) task.
Observations from Ablation Studies
Ablation studies highlight the advantages of SpanBERT's design choices:
- Masking Schemes: Random span masking outperformed linguistically-informed schemes (e.g., named entities, noun phrases) in most tasks, underscoring the robustness of random span selection.
- Auxiliary Objectives: Removing NSP and employing single-sequence training generally yielded better results. Additionally, integrating the SBO with span masking consistently improved performance across tasks, particularly in coreference resolution.
Theoretical and Practical Implications
SpanBERT's advancements emphasize the importance of effective pre-training strategies for enhancing downstream task performance. By focusing on span-level pre-training, SpanBERT not only improves accuracy in span-intensive tasks but also shows general applicability across diverse NLP benchmarks.
Future Directions
Several potential avenues can be explored based on SpanBERT's contributions:
- Broader Application: Applying span-based pre-training to other types of spans such as syntactic structures or semantic roles may uncover further performance gains.
- Cross-lingual Pre-training: Extending the span-based pre-training approach to multilingual corpora could enhance cross-lingual understanding and performance.
- Large-scale Training: Leveraging larger corpora and increased computational resources could further elevate the performance ceilings observed with SpanBERT.
Conclusion
SpanBERT proposes a refined approach to pre-training that effectively captures and utilizes span-level information, showcasing significant improvements across various NLP tasks. The method's design not only advances the state-of-the-art in span-related benchmarks but also provides a strong foundation for future research in pre-trained LLMs.