ELECTRA Large Discriminator SQuAD2 (512)

Updated 6 December 2025

ahotrod/electra_large_discriminator_squad2_512 is a CBQA model built on a 24-layer ELECTRA large discriminator architecture fine-tuned on SQuAD v2.0, enabling precise span selection and effective no-answer classification.
The model achieves an aggregate accuracy of 43% across eight diverse QA datasets, with exceptional performance in biomedical contexts evidenced by a 96.45% accuracy on biomedical_cpgQA.
Advanced pretraining using replaced-token detection, combined with a genetic algorithm ensemble and robust input representation, enhances its cross-domain generalizability and precision.

ahotrod/electra_large_discriminator_squad2_512 is a context-based question answering (CBQA) model built upon the ELECTRA “large” discriminator architecture and fine-tuned on the SQuAD v2.0 dataset. It employs a 24-layer Transformer encoder with significant parameterization and achieves high accuracy in diverse, multi-domain QA benchmarks. The model is particularly noted for outperforming other contemporary models in aggregate accuracy across eight QA datasets, measured without additional fine-tuning, and shows distinctive strengths in biomedical and factoid domains. Its design integrates advanced pretraining objectives and robust fine-tuning regimes, facilitating precise span prediction and effective no-answer classification.

1. Model Architecture and Pretraining

ahotrod/electra_large_discriminator_squad2_512 utilizes the ELECTRA “large” discriminator, as described in Clark et al. (2020). This architecture comprises 24 Transformer encoder layers, each with a hidden size $d = 1024$ and an intermediate feed-forward dimension of 4096. Every layer leverages 16 self-attention heads, with each attention head operating on a $64$-dimensional subspace.

Pretraining is achieved via the replaced-token detection (RTD) objective: a smaller masked-language-model generator proposes replacements for masked tokens, and the discriminator predicts, for each token $x_i$ , whether it is original or replaced. The RTD loss is expressed as:

$L_{\text{RTD}} = -\sum_{i=1}^N \left[z_i \log p_\theta(z_i = 1|x) + (1-z_i) \log (1 - p_\theta(z_i = 1|x))\right]$

where $z_i \in \{0,1\}$ indicates the replaced/original label for the $i$ -th token, and $p_\theta$ is the discriminator’s output probability.

Fine-tuning on SQuAD v2.0 introduces three classification heads atop the discriminator: start-position softmax, end-position softmax, and a binary “no-answer” classifier operating on the [CLS] token. The composite fine-tuning loss is:

$L_{\text{FT}} = L_{\text{CE}}^{\text{start}} + L_{\text{CE}}^{\text{end}} + \lambda L_{\text{CE}}^{\text{no-ans}}$

with $\lambda$ typically set to 1. This regime supports span selection and explicit no-answer prediction.

2. Input Representation and Context Handling

The model applies WordPiece tokenization with a vocabulary size of approximately 28,000 and a maximum sequence length of 512 tokens. Input sequences are constructed by concatenating:

[CLS] token
Tokenized question
[SEP] token
Tokenized context
[SEP] token

Segment embeddings distinguish question from context, and positional embeddings are assigned up to length 512. All tokens are encoded via the 24-layer stack, and fine-tuning heads consume the resulting hidden representations at the relevant indices (start, end, [CLS]).

3. Performance Metrics Across QA Datasets

In a comparative benchmark spanning eight datasets, ahotrod/electra_large_discriminator_squad2_512 yields an aggregate accuracy of 43%, outperforming 46 other CBQA models without further fine-tuning (Muneeb et al., 29 Nov 2025). Dataset-specific top-line accuracies include:

Dataset	Accuracy (%)
bioasq10b-factoid	87.30
biomedical_cpgQA	96.45
QuAC	70.10
ScienceQA	62.20
Atlas-math-sets	55.80
QA Dataset (Wikipedia)	52.70
JournalQA	48.90
IELTS reading	45.50

Span-selection tasks report F1 scores from 82% on QuAC down to 60% on linguistically complex passages. Accuracy is defined as the ratio of correct predictions to total examples in each dataset.

4. Computational Complexity and Inference Characteristics

The self-attention mechanism in each Transformer layer gives rise to a theoretical time complexity of $O(L^2 \cdot d)$ for sequence length $L$ and hidden dimension $d$ , while the feed-forward network scales as $O(L \cdot d^2)$ . Overall, the per-layer complexity is $O(L^2 d + L d^2)$ . Empirical results on a single V100 GPU for batch size 1 yield:

128-token context: ≈ 40 ms
256 tokens: ≈ 80 ms
512 tokens: ≈ 160 ms

This demonstrates near-quadratic scaling with respect to sequence length.

5. Effects of Answer Length and Context Complexity

Analysis within the aggregate paper establishes that accuracy decreases linearly with increasing answer length $\ell$ :

$\text{Accuracy}(\ell) \approx A_0 - \kappa \ell,\ \ k \approx 0.012$

Each additional token in the gold answer reduces exact-match accuracy by approximately 1.2%. Context complexity—quantified by dependency-tree depth and lexical entropy—also impacts results. Passages with average depth > 4 and entropy > 5.0 bits/token incur a further 5–8% drop in accuracy relative to simpler contexts.

6. Genetic Algorithm Ensemble Enhancement

To augment robustness across QA domains, a genetic algorithm (GA) ensemble combines weighted model outputs. This approach involves:

Population size: 50 individuals
Tournament selection size: 5
Crossover probability $P_c = 0.8$ (two-point crossover)
Mutation probability $P_m = 0.05$ per weight

Each individual encodes real-valued weights $w_j$ for model $j$ , and ensemble predictions are determined by the argmax of the weighted sum of per-token logits. Over 100 GA generations, ensemble accuracy is improved from 43% to 45.2%, yielding an absolute gain of 2.2% (Muneeb et al., 29 Nov 2025).

7. Domain-Specific Competence and Use Cases

ahotrod/electra_large_discriminator_squad2_512 demonstrates superior accuracy, notably achieving 96.45% on biomedical_cpgQA and 87.30% on bioasq10b-factoid, indicating particular aptitude for biomedical QA applications. Performance on dialog-style (QuAC: 70.10%) and mathematics-focused datasets (Atlas-math-sets: 55.80%) further confirms cross-domain generalizability. A plausible implication is that robust RTD pretraining and SQuAD v2 fine-tuning are effective for specialized factual extraction tasks, even without domain-specific retraining. Practical use in information retrieval, user support, and educational platforms is substantiated by metric-driven outcomes across varied contexts (Muneeb et al., 29 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Comparative Analysis of 47 Context-Based Question Answer Models Across 8 Diverse Datasets (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ahotrod/electra_large_discriminator_squad2_512.