Text-to-Vec: Token-Level Contextual Embedding
- Text-to-Vec is a method that converts text into an interpretable vector by computing token-level local n-gram perplexities.
- It aggregates n-gram probabilities from autoregressive transformers to highlight model uncertainty and detect errors.
- Empirical evaluations show significant improvements in local error detection compared to scalar perplexity, with robust metrics across tasks.
A Text-to-Vec module produces a vectorial representation of a text input (sentence, document, or token sequence), encoding properties such as local or global contextual probability, semantics, or structure in an embedding suitable for downstream tasks. In "Text vectorization via transformer-based LLMs and n-gram perplexities" (Škorić, 2023), Škorić proposes a method for contextual text vectorization that departs from traditional scalar perplexity by generating an “N-dimensional perplexity vector” tied to local n-gram surprisal as scored by an autoregressive transformer LLM.
1. Algorithmic Framework: Local N-gram Perplexity Vector
The core algorithm computes a per-token vector of local perplexities by aggregating n-gram window probabilities as estimated by a transformer.
- Tokenization: The input text of tokens is segmented as . In the principal experiments, the tokenizer yields word-level tokens; subword tokenizers are permissible.
- N-gram Extraction: From , extract all contiguous n-grams for from $1$ to .
- Probabilistic Scoring: Each is evaluated via next-token prediction of a pre-trained transformer LM, computing the joint probability
- N-gram Perplexity: The n-gram perplexity is defined as
- Token-level Local Perplexity: For each position , aggregate the perplexities of all n-grams including :
- Vector Assembly: The final embedding is . Optionally, may be standardized for "relative perplexity" but this is not part of the baseline.
This process returns an interpretable vector that highlights localized model uncertainty, sensitive to rare words or improbable sequences, as opposed to a single scalar summary.
2. Mathematical Formulation
All key statistics are computed strictly as described in the main text:
| Quantity | Expression | Comment |
|---|---|---|
| Joint Probability | Over n-gram | |
| N-gram Perplexity | Local surprisal measure | |
| Token-wise Index | All n-grams covering | |
| Local Perplexity | Centered at token | |
| Final Vector | N-dimensional output |
This explicit per-token aggregation preserves distributional detail that is discarded in classical scalar perplexity.
3. Architectural and Hyperparameter Choices
- Transformer Model: Any auto-regressive transformer LM. Examples are GPT-2 and (for Serbian evaluation) a GPT-2 variant trained on the Serbian corpus. Only the final softmax probabilities are required.
- Sliding Window Size : For the worked example, ; for empirical tasks, to ensure a reasonable number of windows for sentences .
- Stride: Always 1 (fully overlapping windows).
- Normalization: Optional (subtract mean, divide by standard deviation); not included as default.
- Layer Output: The method discards hidden activations, utilizing only the probability estimates.
The permutation and coverage of the n-gram windows ensures that edge tokens are not neglected: tokens near boundaries participate in fewer -grams.
4. Worked Example
Consider the input “When in Rome, do as the Romans do.” ():
| Token () | Windows covering | computation |
|---|---|---|
| 1 (When) | ||
| 2 (in) | ||
| 3 (Rome) | ||
| ... | ... | ... |
| 10 (.) |
This yields . High (e.g., at token 2) indicates a local probability dip, often due to a modeling anomaly or typo.
5. Empirical Evaluation and Use Cases
The method's diagnostic value is demonstrated through three error-detection tasks—removal, insertion, replacement of a word—on expert-translated Serbian sentences (total ), each altered at one position. The evaluation protocol is as follows:
- Task: Predict the error index by selecting the token with maximum local perplexity ().
- Metrics: Accuracy (correct position identified), Weighted accuracy (scaled by $1/N$), and comparison to random baseline.
Results (Table 4, paper):
| Task | Random | Text-to-Vec |
|---|---|---|
| Removal | 5.80 % | 10.37 % |
| Insertion | 3.12 % | 17.26 % |
| Replacement | 2.02 % | 18.56 % |
Weighted accuracy improves by a factor of 3–8 over chance. High Pearson correlation () between accuracy and weighted accuracy confirms robustness across sentence lengths. Notably, scalar perplexity does not provide this localization power, and no direct comparison to BERT embeddings or global perplexity was reported.
6. Implementation, Complexity, and Limitations
- Computational Cost: For sentence of tokens and window size , LM forward passes are required, each over tokens. For small and typical text lengths, this cost is dominated by the LM inference; parallel processing of windows is practical.
- Limitations: The text-to-Vec method is inherently tied to the LM’s probabilistic calibration. Mis-calibrated transformer LMs (or domain mismatch) will affect interpretability. The vector’s dimensionality scales with input length, which may pose issues for downstream models expecting fixed-size vectors.
- Deployment Considerations: Efficient inference necessitates batched n-gram scoring, and optional vector normalization for applications demanding scale invariance. Adopters should select to balance localization against window sparsity.
7. Applications and Potential Extensions
The primary utility is in local error detection—identifying marginalized tokens within high-probability contexts. Beyond this, plausible extensions include:
- Fine-grained quality assessment in translation, ASR, or OCR—flagging outlier tokens.
- Surprisal pattern-based similarity search—retrieving sentences with similar distributions of local modeling "surprise."
- Integration of the vector into error-detection classifiers or explanation systems, leveraging per-token probabilities as features.
- Open areas: comparison with dense sentence embeddings, calibration on out-of-distribution text, and use in languages without robust LM support.
As an algorithmic primitive, the Text-to-Vec module offers practitioners an explicit, interpretable embedding of token-level LLM uncertainty—distinct from both scalar perplexity and black-box dense embedding methods—enabling novel downstream analytics and diagnostics in transformer-based NLP systems (Škorić, 2023).