ELECTRA Large Discriminator SQuAD2 (512)
- ahotrod/electra_large_discriminator_squad2_512 is a CBQA model built on a 24-layer ELECTRA large discriminator architecture fine-tuned on SQuAD v2.0, enabling precise span selection and effective no-answer classification.
- The model achieves an aggregate accuracy of 43% across eight diverse QA datasets, with exceptional performance in biomedical contexts evidenced by a 96.45% accuracy on biomedical_cpgQA.
- Advanced pretraining using replaced-token detection, combined with a genetic algorithm ensemble and robust input representation, enhances its cross-domain generalizability and precision.
ahotrod/electra_large_discriminator_squad2_512 is a context-based question answering (CBQA) model built upon the ELECTRA “large” discriminator architecture and fine-tuned on the SQuAD v2.0 dataset. It employs a 24-layer Transformer encoder with significant parameterization and achieves high accuracy in diverse, multi-domain QA benchmarks. The model is particularly noted for outperforming other contemporary models in aggregate accuracy across eight QA datasets, measured without additional fine-tuning, and shows distinctive strengths in biomedical and factoid domains. Its design integrates advanced pretraining objectives and robust fine-tuning regimes, facilitating precise span prediction and effective no-answer classification.
1. Model Architecture and Pretraining
ahotrod/electra_large_discriminator_squad2_512 utilizes the ELECTRA “large” discriminator, as described in Clark et al. (2020). This architecture comprises 24 Transformer encoder layers, each with a hidden size and an intermediate feed-forward dimension of 4096. Every layer leverages 16 self-attention heads, with each attention head operating on a $64$-dimensional subspace.
Pretraining is achieved via the replaced-token detection (RTD) objective: a smaller masked-language-model generator proposes replacements for masked tokens, and the discriminator predicts, for each token , whether it is original or replaced. The RTD loss is expressed as:
where indicates the replaced/original label for the -th token, and is the discriminator’s output probability.
Fine-tuning on SQuAD v2.0 introduces three classification heads atop the discriminator: start-position softmax, end-position softmax, and a binary “no-answer” classifier operating on the [CLS] token. The composite fine-tuning loss is:
with typically set to 1. This regime supports span selection and explicit no-answer prediction.
2. Input Representation and Context Handling
The model applies WordPiece tokenization with a vocabulary size of approximately 28,000 and a maximum sequence length of 512 tokens. Input sequences are constructed by concatenating:
- [CLS] token
- Tokenized question
- [SEP] token
- Tokenized context
- [SEP] token
Segment embeddings distinguish question from context, and positional embeddings are assigned up to length 512. All tokens are encoded via the 24-layer stack, and fine-tuning heads consume the resulting hidden representations at the relevant indices (start, end, [CLS]).
3. Performance Metrics Across QA Datasets
In a comparative benchmark spanning eight datasets, ahotrod/electra_large_discriminator_squad2_512 yields an aggregate accuracy of 43%, outperforming 46 other CBQA models without further fine-tuning (Muneeb et al., 29 Nov 2025). Dataset-specific top-line accuracies include:
| Dataset | Accuracy (%) |
|---|---|
| bioasq10b-factoid | 87.30 |
| biomedical_cpgQA | 96.45 |
| QuAC | 70.10 |
| ScienceQA | 62.20 |
| Atlas-math-sets | 55.80 |
| QA Dataset (Wikipedia) | 52.70 |
| JournalQA | 48.90 |
| IELTS reading | 45.50 |
Span-selection tasks report F1 scores from 82% on QuAC down to 60% on linguistically complex passages. Accuracy is defined as the ratio of correct predictions to total examples in each dataset.
4. Computational Complexity and Inference Characteristics
The self-attention mechanism in each Transformer layer gives rise to a theoretical time complexity of for sequence length and hidden dimension , while the feed-forward network scales as . Overall, the per-layer complexity is . Empirical results on a single V100 GPU for batch size 1 yield:
- 128-token context: ≈ 40 ms
- 256 tokens: ≈ 80 ms
- 512 tokens: ≈ 160 ms
This demonstrates near-quadratic scaling with respect to sequence length.
5. Effects of Answer Length and Context Complexity
Analysis within the aggregate paper establishes that accuracy decreases linearly with increasing answer length :
Each additional token in the gold answer reduces exact-match accuracy by approximately 1.2%. Context complexity—quantified by dependency-tree depth and lexical entropy—also impacts results. Passages with average depth > 4 and entropy > 5.0 bits/token incur a further 5–8% drop in accuracy relative to simpler contexts.
6. Genetic Algorithm Ensemble Enhancement
To augment robustness across QA domains, a genetic algorithm (GA) ensemble combines weighted model outputs. This approach involves:
- Population size: 50 individuals
- Tournament selection size: 5
- Crossover probability (two-point crossover)
- Mutation probability per weight
Each individual encodes real-valued weights for model , and ensemble predictions are determined by the argmax of the weighted sum of per-token logits. Over 100 GA generations, ensemble accuracy is improved from 43% to 45.2%, yielding an absolute gain of 2.2% (Muneeb et al., 29 Nov 2025).
7. Domain-Specific Competence and Use Cases
ahotrod/electra_large_discriminator_squad2_512 demonstrates superior accuracy, notably achieving 96.45% on biomedical_cpgQA and 87.30% on bioasq10b-factoid, indicating particular aptitude for biomedical QA applications. Performance on dialog-style (QuAC: 70.10%) and mathematics-focused datasets (Atlas-math-sets: 55.80%) further confirms cross-domain generalizability. A plausible implication is that robust RTD pretraining and SQuAD v2 fine-tuning are effective for specialized factual extraction tasks, even without domain-specific retraining. Practical use in information retrieval, user support, and educational platforms is substantiated by metric-driven outcomes across varied contexts (Muneeb et al., 29 Nov 2025).