Papers
Topics
Authors
Recent
2000 character limit reached

ELECTRA Large Discriminator SQuAD2 (512)

Updated 6 December 2025
  • ahotrod/electra_large_discriminator_squad2_512 is a CBQA model built on a 24-layer ELECTRA large discriminator architecture fine-tuned on SQuAD v2.0, enabling precise span selection and effective no-answer classification.
  • The model achieves an aggregate accuracy of 43% across eight diverse QA datasets, with exceptional performance in biomedical contexts evidenced by a 96.45% accuracy on biomedical_cpgQA.
  • Advanced pretraining using replaced-token detection, combined with a genetic algorithm ensemble and robust input representation, enhances its cross-domain generalizability and precision.

ahotrod/electra_large_discriminator_squad2_512 is a context-based question answering (CBQA) model built upon the ELECTRA “large” discriminator architecture and fine-tuned on the SQuAD v2.0 dataset. It employs a 24-layer Transformer encoder with significant parameterization and achieves high accuracy in diverse, multi-domain QA benchmarks. The model is particularly noted for outperforming other contemporary models in aggregate accuracy across eight QA datasets, measured without additional fine-tuning, and shows distinctive strengths in biomedical and factoid domains. Its design integrates advanced pretraining objectives and robust fine-tuning regimes, facilitating precise span prediction and effective no-answer classification.

1. Model Architecture and Pretraining

ahotrod/electra_large_discriminator_squad2_512 utilizes the ELECTRA “large” discriminator, as described in Clark et al. (2020). This architecture comprises 24 Transformer encoder layers, each with a hidden size d=1024d = 1024 and an intermediate feed-forward dimension of 4096. Every layer leverages 16 self-attention heads, with each attention head operating on a $64$-dimensional subspace.

Pretraining is achieved via the replaced-token detection (RTD) objective: a smaller masked-language-model generator proposes replacements for masked tokens, and the discriminator predicts, for each token xix_i, whether it is original or replaced. The RTD loss is expressed as:

LRTD=i=1N[zilogpθ(zi=1x)+(1zi)log(1pθ(zi=1x))]L_{\text{RTD}} = -\sum_{i=1}^N \left[z_i \log p_\theta(z_i = 1|x) + (1-z_i) \log (1 - p_\theta(z_i = 1|x))\right]

where zi{0,1}z_i \in \{0,1\} indicates the replaced/original label for the ii-th token, and pθp_\theta is the discriminator’s output probability.

Fine-tuning on SQuAD v2.0 introduces three classification heads atop the discriminator: start-position softmax, end-position softmax, and a binary “no-answer” classifier operating on the [CLS] token. The composite fine-tuning loss is:

LFT=LCEstart+LCEend+λLCEno-ansL_{\text{FT}} = L_{\text{CE}}^{\text{start}} + L_{\text{CE}}^{\text{end}} + \lambda L_{\text{CE}}^{\text{no-ans}}

with λ\lambda typically set to 1. This regime supports span selection and explicit no-answer prediction.

2. Input Representation and Context Handling

The model applies WordPiece tokenization with a vocabulary size of approximately 28,000 and a maximum sequence length of 512 tokens. Input sequences are constructed by concatenating:

  • [CLS] token
  • Tokenized question
  • [SEP] token
  • Tokenized context
  • [SEP] token

Segment embeddings distinguish question from context, and positional embeddings are assigned up to length 512. All tokens are encoded via the 24-layer stack, and fine-tuning heads consume the resulting hidden representations at the relevant indices (start, end, [CLS]).

3. Performance Metrics Across QA Datasets

In a comparative benchmark spanning eight datasets, ahotrod/electra_large_discriminator_squad2_512 yields an aggregate accuracy of 43%, outperforming 46 other CBQA models without further fine-tuning (Muneeb et al., 29 Nov 2025). Dataset-specific top-line accuracies include:

Dataset Accuracy (%)
bioasq10b-factoid 87.30
biomedical_cpgQA 96.45
QuAC 70.10
ScienceQA 62.20
Atlas-math-sets 55.80
QA Dataset (Wikipedia) 52.70
JournalQA 48.90
IELTS reading 45.50

Span-selection tasks report F1 scores from 82% on QuAC down to 60% on linguistically complex passages. Accuracy is defined as the ratio of correct predictions to total examples in each dataset.

4. Computational Complexity and Inference Characteristics

The self-attention mechanism in each Transformer layer gives rise to a theoretical time complexity of O(L2d)O(L^2 \cdot d) for sequence length LL and hidden dimension dd, while the feed-forward network scales as O(Ld2)O(L \cdot d^2). Overall, the per-layer complexity is O(L2d+Ld2)O(L^2 d + L d^2). Empirical results on a single V100 GPU for batch size 1 yield:

  • 128-token context: ≈ 40 ms
  • 256 tokens: ≈ 80 ms
  • 512 tokens: ≈ 160 ms

This demonstrates near-quadratic scaling with respect to sequence length.

5. Effects of Answer Length and Context Complexity

Analysis within the aggregate paper establishes that accuracy decreases linearly with increasing answer length \ell:

Accuracy()A0κ,  k0.012\text{Accuracy}(\ell) \approx A_0 - \kappa \ell,\ \ k \approx 0.012

Each additional token in the gold answer reduces exact-match accuracy by approximately 1.2%. Context complexity—quantified by dependency-tree depth and lexical entropy—also impacts results. Passages with average depth > 4 and entropy > 5.0 bits/token incur a further 5–8% drop in accuracy relative to simpler contexts.

6. Genetic Algorithm Ensemble Enhancement

To augment robustness across QA domains, a genetic algorithm (GA) ensemble combines weighted model outputs. This approach involves:

  • Population size: 50 individuals
  • Tournament selection size: 5
  • Crossover probability Pc=0.8P_c = 0.8 (two-point crossover)
  • Mutation probability Pm=0.05P_m = 0.05 per weight

Each individual encodes real-valued weights wjw_j for model jj, and ensemble predictions are determined by the argmax of the weighted sum of per-token logits. Over 100 GA generations, ensemble accuracy is improved from 43% to 45.2%, yielding an absolute gain of 2.2% (Muneeb et al., 29 Nov 2025).

7. Domain-Specific Competence and Use Cases

ahotrod/electra_large_discriminator_squad2_512 demonstrates superior accuracy, notably achieving 96.45% on biomedical_cpgQA and 87.30% on bioasq10b-factoid, indicating particular aptitude for biomedical QA applications. Performance on dialog-style (QuAC: 70.10%) and mathematics-focused datasets (Atlas-math-sets: 55.80%) further confirms cross-domain generalizability. A plausible implication is that robust RTD pretraining and SQuAD v2 fine-tuning are effective for specialized factual extraction tasks, even without domain-specific retraining. Practical use in information retrieval, user support, and educational platforms is substantiated by metric-driven outcomes across varied contexts (Muneeb et al., 29 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ahotrod/electra_large_discriminator_squad2_512.