High-Recall RoBERTa Classifier
- The paper highlights a novel loss strategy that blends cross-entropy with a precision–recall surrogate to significantly boost recall in document classification.
- High-recall RoBERTa classifiers are Transformer-based models tuned to capture the maximum actionable positives while maintaining competitive precision in diverse applications.
- Continuous active learning and smart threshold calibration enable these models to adapt across domains, ensuring minimal precision loss and improved operational outcomes.
A high-recall RoBERTa classifier designates a class of Transformer-based models trained and calibrated to maximize recall—capturing the highest possible proportion of relevant or actionable positive instances—while maintaining competitive precision, particularly in document classification and information triage pipelines. High-recall configurations are essential in applications where missed positives (false negatives) are unrecoverable or costly, such as actionable suggestion mining or comprehensive document review. This entry surveys the architecture, training objectives, optimization procedures, recall–precision trade-offs, domain adaptation considerations, and practical recommendations associated with state-of-the-art high-recall RoBERTa classifiers (Trivedi et al., 27 Jan 2026, &&&1&&&).
1. Architecture and Model Variants
Trivedi et al. (Trivedi et al., 27 Jan 2026) selected RoBERTa-base (12 layers, 768 hidden units, 12 heads, ≈110M parameters) for actionable suggestion mining, balancing effectiveness and computation. They used the model’s standard tokenizer and omitted additional embedding layers. Sadri and Cormack (Sadri et al., 2022) employed RoBERTa-large and recommended fine-tuning all weights, optionally freezing layer normalization early for training stability. Both approaches utilized a classification head, typically via a single-layer projection on the final <s> (equivalent of [CLS]) hidden state with a sigmoid output, sometimes with dropout (p=0.1). Optionally, concatenation of the last four Transformer layers’ <s>-embeddings with downstream reduction via an MLP was suggested for richer representations.
2. Training Objectives: Precision–Recall Surrogate and Loss Innovations
To bias optimization toward recall, novel loss functions were introduced:
- Precision–Recall Surrogate Loss: The target function blends conventional cross-entropy with a differentiable surrogate for (soft) precision. For batch size and model scores :
Key hyperparameters: number of thresholds , temperature , stability , , .
- Cost-sensitive and Margin Ranking Losses: In continuous active learning regimes (Sadri et al., 2022), class imbalance is addressed by upweighting positive instances:
An optional margin ranking loss enforces that positive examples are scored higher than negatives by at least a margin .
This objective design ensures upward pressure on positive predictions across a range of thresholds, directly improving batch-level recall without substantial loss of probability calibration.
3. Training Procedures and Thresholding
Training Data and Preprocessing: In (Trivedi et al., 27 Jan 2026), 1,110 domain-labeled customer reviews (440 positives, 670 negatives) from the hospitality/food sector were used, with only standard tokenization. In human-in-the-loop CALBERT scenarios (Sadri et al., 2022), training data is incrementally grown through continuous active learning cycles, starting with an empty label set and progressively expanding via uncertainty- and score-driven sampling.
Hyperparameter Configuration:
- Batch size 16, learning rate –, AdamW optimizer, weight decay $0.01$, 10% learning rate warmup.
- Surrogate loss weights and thresholds as above.
Threshold Calibration: For soft-label models, training occurs across multiple thresholds (e.g., 25 evenly spaced ). At inference, a default threshold of 0.5 is often sufficient if probabilities are well calibrated, but practitioners may adopt lower thresholds (e.g., 0.2–0.3) to further raise recall, particularly in document retrieval contexts (Sadri et al., 2022). Post hoc calibration (e.g., Platt scaling, isotonic regression) can further stabilize decision points.
4. Recall–Precision Trade-off and Performance Metrics
On held-out datasets with a positive rate of 13–18% (Trivedi et al., 27 Jan 2026), the high-recall RoBERTa classifier trained with a precision–recall surrogate achieved:
- Precision: 0.9039
- Recall: 0.9221
In comparison, standard cross-entropy yielded similar precision but lower recall (0.8873), indicating a +3.49% absolute recall gain at . F increased from ≈0.895 to ≈0.913. These gains are significant when omitted positives are costly.
Learning curves demonstrate recall saturation by 70% of the training set, and ablations confirm that the surrogate shifts performance upward along the precision–recall (PR) frontier, providing higher recall at matched precision levels.
5. Continuous Active Learning and High-Recall Retrieval
In document triage and recall-intensive workflows (Sadri et al., 2022), the CALBERT paradigm adapts RoBERTa to iterative human-feedback loops:
- Stage 1: Retrieve candidates (e.g., top- from BM25).
- Stage 2: Score with RoBERTa, sample diverse positives/uncertains for labeling.
- Continuously fine-tune on the growing labeled set, applying cost-sensitive weighting to maintain class balance.
- Dynamically adjust classification thresholds to ensure Recall@\% for candidate pools; stop iterating when new positives plateau.
Additional strategies include balancing exploitation and exploration in candidate selection, ablation of margin loss vs. standard cross-entropy, and validation across train/dev/test topic splits to optimize annotation efficiency and final recall. Embedding-based approximations (late interaction) may be employed for scalable reranking.
6. Domain Adaptation and Generalization
No specialized domain adaptation was performed beyond initial training on hospitality/food reviews (Trivedi et al., 27 Jan 2026). Cross-domain evaluations spanning real estate, healthcare, finance, and automotive sectors showed recall persistence (0.90–0.98), although precision dropped with increasing vocabulary divergence. This suggests domain transfer without further tuning preserves high recall but at the expense of false positive rates in specialized contexts. Lightweight additional adaptation (such as vocabulary expansion or in-domain fine-tuning) is recommended if precision drops are problematic.
7. Implementation Recommendations and Operational Insights
Key operational advice includes:
- Choose RoBERTa-base for most use cases; larger variants yield diminishing returns on business-classification metrics relative to compute.
- Tune (0.5–0.7) and (1.0–1.5) to balance calibration with recall uplift.
- Employ 20–30 soft thresholds () and a low temperature (–$0.05$) in the surrogate.
- Use learning rate warm-up over 5–10% of steps to stabilize training.
- Consistently verify on held-out data that recall grows without catastrophic precision loss.
- For new domains, begin with cross-domain validation before investing in further adaptation.
Even a modest absolute gain in recall (e.g., 3–5%) translates into many additional actionable findings in operational deployments. In suggestion-mining pipelines, a high-recall RoBERTa classifier serves as an effective first-stage filter for downstream LLM-based extraction, categorization, and summarization systems (Trivedi et al., 27 Jan 2026).
References:
- [A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews, Trivedi et al. (2026)]
- [Continuous Active Learning Using Pretrained Transformers, Sadri & Cormack (2022)]