- The paper presents BTZSC, a comprehensive evaluation suite for zero-shot text classification across four model families using 22 strictly zero-shot English datasets.
- It details a standardized evaluation protocol comparing NLI cross-encoders, embedding models, rerankers, and instruction-tuned LLMs on tasks spanning sentiment, topic, intent, and emotion.
- Empirical findings reveal that while rerankers achieve the highest aggregate performance, embedding models offer superior inference speed and LLMs scale strongly despite higher latency.
BTZSC: A Unified Benchmark for Zero-Shot Text Classification Across Model Families
Introduction
Zero-shot text classification (ZSC) aims to assign labels to text instances without using any labeled examples from the target task. While early ZSC methods were dominated by NLI cross-encoders, the last several years have seen the emergence of robust embedding models, rerankers, and instruction-tuned LLMs. Existing evaluations, notably MTEB, often do not disentangle genuine zero-shot capability from performance boosted via linear probes or indirect supervision. BTZSC addresses this gap by offering an evaluation suite of 22 strictly zero-shot English datasets, spanning sentiment, topic, intent, and emotion classification with varying cardinality and document length. The benchmark provides a systematic, controlled platform for directly comparing the four principal model paradigms, using a shared evaluation methodology.
Benchmark Design
BTZSC evaluates four model families—NLI cross-encoders, embedding models, rerankers, and instruction-tuned LLMs—in a strictly zero-shot regime. The design prioritizes several axes of diversity:
- Task and domain diversity: The 22 datasets represent sentiment, topic, intent, and emotion classification, drawing from product reviews, political documents, social media, news, and conversations.
- Class granularity: The benchmark includes binary, medium- (e.g., 4-way), and high-cardinality (up to 77-way) label sets.
- Document length: Both short (single-sentence) and long-form (several paragraphs) documents are considered.
- Evaluation protocol: All models are evaluated on standardized, semantically rich verbalizers for each class. No labeled examples or task-relevant supervision are used at test time.
This careful curation mitigates domain adaptation effects and ensures a fair assessment of each model’s semantic matching ability.
Model Families and Methodological Details
NLI Cross-Encoders: These models are fine-tuned on large NLI datasets and perform classification by treating the text as a premise and each label verbalizer as a hypothesis. Both base and large model variants are considered, as well as different learning objectives (binary vs. triplet loss).
Embedding Models: Models such as E5, BGE, and GTE generate vector encodings for texts and candidate labels. Label assignment is based on cosine similarity in embedding space, with no cross-attention between input and label.
Rerankers: Typically sequence-to-sequence or cross-encoder models, rerankers score the relevance of candidate labels (treated as short documents) to a query text. BTZSC evaluates both lightweight bi-encoder rerankers and large autoregressive rerankers (notably, Qwen3-Reranker).
Instruction-tuned LLMs: ZSC is posed as a multiple-choice task where the model predicts the label via next-token probability over candidate answers using prompt templates. The study considers models from 270M to 12B parameters (Gemma, Llama, Qwen, Mistral, Phi).
Experimental Setup
All models are evaluated in an identical zero-shot protocol with fixed verbalizer templates and tokenization strategies. No dataset-specific prompt tuning or post-hoc calibration is performed. For each input, the class label assignment method is matched to the model family (e.g., entailment score for cross-encoders, cosine for embeddings, next-token probability for LLMs). Results are reported via macro-F1 (primary), accuracy, precision, and recall, with averages computed over datasets and tasks.
Main Empirical Findings
- Rerankers lead on aggregate metrics: Qwen3-Reranker-8B achieves macro-F1 = 0.72 (accuracy = 0.76), dominating all other models and achieving the highest scores across most task types.
- Embedding models close the gap: Strong models like gte-large-en-v1.5 reach macro-F1 = 0.62, slightly surpassing the best NLI cross-encoders, while offering significantly better inference speed. Importantly, further scaling embedding models yields only incremental improvements.
- Instruction-tuned LLMs excel at scale: LLMs with 8–12B parameters (e.g., Mistral-Nemo-Instruct-2407) achieve macro-F1 up to 0.67. Below 3B parameters, LLMs remain noncompetitive for ZSC.
- NLI-based cross-encoders plateau: Despite increases in backbone size or data diversity, NLI cross-encoders peak around macro-F1 = 0.60. Advancements from model scaling are limited in this family.
- Latency-accuracy trade-offs: Embedding models dominate the Pareto efficient front for deployment, combining high accuracy with low computational cost. Large LLMs, while accurate, exhibit high latency; rerankers strike a middle ground depending on architecture.
Task-Specific Insights
- Sentiment classification is saturated: All leading model families achieve macro-F1 > 0.9 on binary and ternary sentiment tasks, indicating limited value for discrimination among SOTA architectures.
- Emotion and high-cardinality intent are most challenging: Even the best models score < 0.5 macro-F1 on datasets like EmpatheticDialogues and banking77.
- Topic and intent tasks reveal model family strengths: Rerankers and LLMs outperform embeddings and NLI cross-encoders on harder topic and intent tasks, especially as the number of classes increases.
Scaling and Family-Wise Trends
- Rerankers benefit monotonically from scaling, outperforming all other models when parameter count exceeds 1B.
- Embedding model scaling saturates quickly: Increases in parameter count and pretraining data size improve performance up to ~500M parameters, beyond which performance plateaus.
- LLMs show steep scaling returns: Marked performance jumps occur between 3B and 8B parameters, especially when paired with advanced instruction-tuning.
NLI as a Proxy for ZSC
- Strong correlation for NLI cross-encoders and LLMs: NLI AUROC predicts ZSC F1, confirming direct transfer from entailment task competence to label matching.
- No reliable correlation for embedding models: Above a certain threshold of NLI performance, further improvements do not transfer to zero-shot classification, implying that richer embedding-space structure is the limiting factor.
Robustness and External Validation
- Evaluations on eight English classification tasks from MTEB-v2 confirm strong consistency with BTZSC family-wise trends (Ï„=0.69 Kendall correlation in rankings). Absolute scores for embeddings are higher in MTEB, reflecting its relative ease (especially in topic prediction), but the ordinal ordering and scaling conclusions persist.
Practical and Theoretical Implications
BTZSC reframes the assessment of zero-shot text classifiers by enforcing a genuinely unsupervised evaluation regime and by facilitating comparison across modern model families. The results indicate a shift in the optimal trade-off regime for real-world applications: specialized rerankers attain the highest accuracy but incur nontrivial inference costs, while advanced embedding models approach similar performance at dramatically lower latency. Instruction-tuned LLMs at moderate-to-large scale are competitive for most tasks but remain efficiency-constrained. NLI cross-encoders, while formerly dominant, now constitute a plateaued baseline.
These findings have direct implications for both research and deployment:
- For practical deployment, embedding models (notably GTE and E5 families) should be prioritized where inference speed, scalability, and cost matter.
- For maximizing raw accuracy (e.g., when latency/cost is less of a constraint), modern rerankers such as Qwen3-Reranker should be selected.
- For theoretical research, the decoupling between NLI ability and ZSC performance in embeddings signals that future work should focus on optimizing embedding spaces for fine-grained discrimination across large label inventories and difficult domains (emotion, intent).
- For methodological advances, the benchmark highlights the need for model architectures that combine the accuracy of rerankers, the flexibility of LLMs, and the efficiency of embeddings.
Future Directions
BTZSC is positioned as a platform for reproducible, fine-grained progress in ZSC. Promising avenues for extension include:
- Multilingual ZSC evaluation: Current datasets are English only; extending to other languages will further challenge model generalization.
- Dataset and label verbalizer enrichment: Investigating more nuanced verbalization techniques or richer context exploitation per label.
- Continued scaling and distillation: Scaling reranker and embedding architectures further, as well as distilling large models into efficient baselines suitable for edge deployment.
- Generalization to open-ontology settings: Moving beyond static label sets to test open and compositional label construction.
Conclusion
BTZSC provides the first comprehensive, truly zero-shot benchmark spanning the four principal model families in text classification. The analysis demonstrates that contemporary rerankers and embedding models have largely eclipsed NLI cross-encoders, with instruction-tuned LLMs achieving competitive results at the cost of higher latency. The benchmark, code, and leaderboard are shared publicly to ensure reproducibility and to support further advances in universal, zero-shot text understanding (2603.11991).