Mapping-Free Automatic Verbalizer (MAV)
- Mapping-Free Automatic Verbalizer (MAV) is a prompt-based classification approach that replaces manual label-word mapping with a learnable verbalization function using full MLM outputs.
- It employs neural projection and prototype-based architectures to enhance scalability and information efficiency in few-shot and semi-supervised text classification.
- Experimental results show that MAV methods achieve significant accuracy gains and tighter class clusters compared to traditional verbalizers.
Mapping-Free Automatic Verbalizer (MAV) refers to a class of approaches in prompt-based classification frameworks that eliminate the dependence on explicit, manually crafted mappings between class labels and vocabulary tokens (“label words”). MAV methods instead implement a learnable verbalization function that leverages the full contextual output of pre-trained LLMs (PLMs) for few-shot and semi-supervised text classification. These techniques have been proposed to address information loss and scalability bottlenecks associated with traditional verbalizers, particularly in multi-class regimes or where expert knowledge for token selection is limited (Kho et al., 2023, Wei et al., 2022). MAVs are instantiated as either neural projections operating on full Masked LLM (MLM) output vectors (Kho et al., 2023) or as prototype-based systems using learned continuous label embeddings (Wei et al., 2022), fundamentally shifting prompt-based NLP paradigms from token-level lexical mapping to end-to-end representation learning.
1. Motivation and Conceptual Evolution
Prompt-based classification methods recast standard classification as a cloze task, wrapping the input in a template featuring a single token. Standard verbalizers map class labels to one or a few language tokens , interpreting MLM output probabilities as class likelihoods. Current practice suffers from two major limitations:
- Manual verbalizer bottleneck: Expert selection of label words does not scale to large or fine-grained label sets, and manual mapping introduces variance and domain bias.
- Information compression: Typical verbalizers only use a small subset of the MLM vocabulary, discarding the majority of representational information in the output, which is especially detrimental in multi-class settings and low-resource regimes.
MAVs address these problems by removing the need for explicit label-token associations, instead learning to aggregate the entirety of the MLM output signal or its hidden representation into class scores in an end-to-end trainable way (Kho et al., 2023, Wei et al., 2022). This procedural shift enables more robust, scalable, and information-efficient self-training for both few-shot and semi-supervised text classification.
2. Formal Architectures and Mathematical Frameworks
Two principal MAV realizations have been developed:
2.1 Neural Mapping MAV (Kho et al., 2023)
Given MLM output (probabilities for each vocabulary word at ), MAV passes through two fully-connected layers:
0
where:
- 1, 2 are the parameters of the “vocab extractor,” compressing full-vocabulary MLM logits into a 3-dimensional feature.
- 4 is an elementwise activation (ReLU or tanh).
- 5, 6 produce class logits.
- The MLM head remains frozen throughout to preserve pre-trained vocabulary priors.
2.2 Prototype-based MAV ((Wei et al., 2022), also: Prototypical Prompt Verbalizer)
Here, class 7 is represented by a continuous vector “prototype” 8 in a learned metric space. The 9 hidden state 0 is projected to 1, and input is classified by its cosine similarity to each 2:
3
Prototypes are initialized by aggregating representations from cloze-elicited sentences and refined using contrastive/metric learning.
3. Training Regimes and Objective Functions
3.1 Neural MAV Semi-supervised Self-training (Kho et al., 2023)
The objective combines:
- Supervised loss (4): Cross-entropy over labeled samples.
- Self-training loss (5): FixMatch-style consistency regularization, using high-confidence weakly augmented pseudo-labels and strong data augmentations.
- Auxiliary MLM loss (6): Standard cross-entropy over randomly masked tokens (outside the prompt) to prevent catastrophic forgetting.
The optimization target is
7
where 8 and 9 are tunable hyperparameters.
3.2 Contrastive Prototype MAV (Wei et al., 2022)
Here, the loss is a weighted sum of three contrastive objectives:
- Instance–instance (0): Encourages same-class representations to be close and different-class representations to be separated.
- Instance–prototype (1): Pulls instances towards their class prototype.
- Prototype–instance (2): Pushes prototypes away from other-class instances.
The combined loss is: 3
4. Comparative Experimental Results
Comprehensive evaluations have been conducted on datasets such as TREC (6 classes), TREC50 (22 classes), GoEmotions (26 classes), Yahoo Answers (10 classes), AG’s News (4 classes), and DBPedia (14 classes) (Kho et al., 2023, Wei et al., 2022).
- MAV's semi-supervised accuracy exceeds prior self-training baselines by an average of +12.8% on multi-class tasks and achieves the highest benefit ratio—closest to 1, indicating optimal utilization of the unlabeled set (Kho et al., 2023).
- Prototype MAV (PPV) outperforms manual and soft verbalizer approaches on complex few-shot setups, particularly for many-class problems: e.g., on Yahoo and DBpedia, PPV matches or surpasses manual prompt-tuning at 10/20 shots per class (Wei et al., 2022).
Illustrative performance table from (Kho et al., 2023):
| Model/Baseline | Small-supervised | Semi-supervised | Full-supervised | Benefit Ratio |
|---|---|---|---|---|
| Manual Verbalizer | – | – | – | – |
| AMuLaP | – | – | – | – |
| Neural MAV (proposed) | + | +12.8% | + | ~1.0 |
5. Analysis, Interpretability, and Ablation Studies
Analyses in (Kho et al., 2023) reveal:
- Cluster quality: t-SNE of [MASK] representations shows that MAV leads to tighter class clusters (higher Silhouette score) than token-based verbalizers.
- Model interpretability: SHAP attributions for the learned 4–5 parameters reveal emphasis on semantically relevant tokens without explicit label-word exposure.
- Ablations: The default hidden size 6 achieved optimum; freezing only the MLM head preserved nearly full performance, and MAVs remained robust across strong/weak augmentation variants (FlexMatch vs. FixMatch).
Studies in (Wei et al., 2022) show that the use of all three contrastive loss terms is critical for optimal performance; prototypes retain substantial task-specific structure even when PLM weights are frozen.
6. Position Relative to Alternative Automatic Verbalization Approaches
Compared with manual and soft verbalizers, MAVs:
- Do not require human curation or search for optimal label words.
- Outperform discrete mapping approaches, especially as label-cardinality rises.
- Exploit the full span of MLM knowledge rather than isolated token probabilities.
Manifold-based verbalizer alternatives such as LLE-INC (Wang et al., 2023) avoid explicit mapping by re-embedding token space using intra-class neighborhood constraints for 7-NN classification, but remain fundamentally non-parametric and do not optimize a learned verbalization head as MAVs do.
7. Significance and Implications for Prompt-based Classification
Mapping-Free Automatic Verbalizers represent a systematic advancement in prompt-based classification methodology, especially for multi-class and resource-constrained scenarios. By learning mappings from full-model output distributions or latent representations to class logits or prototypes, MAVs address scalability, robustness, and information efficiency. Empirical evidence substantiates their advantage in leveraging unlabeled data, optimizing benefit ratios, and producing tighter, more meaningful class clusters than competing approaches (Kho et al., 2023, Wei et al., 2022). Furthermore, MAVs provide a framework for interpretable, automatized verbalization—directly capitalizing on pre-trained model semantics and circumventing the limitations inherent in discrete, hand-crafted mapping paradigms.