Text-to-Vec: Token-Level Contextual Embedding

Updated 14 November 2025

Text-to-Vec is a method that converts text into an interpretable vector by computing token-level local n-gram perplexities.
It aggregates n-gram probabilities from autoregressive transformers to highlight model uncertainty and detect errors.
Empirical evaluations show significant improvements in local error detection compared to scalar perplexity, with robust metrics across tasks.

A Text-to-Vec module produces a vectorial representation of a text input (sentence, document, or token sequence), encoding properties such as local or global contextual probability, semantics, or structure in an embedding suitable for downstream tasks. In "Text vectorization via transformer-based LLMs and n-gram perplexities" (Škorić, 2023), Škorić proposes a method for contextual text vectorization that departs from traditional scalar perplexity by generating an “N-dimensional perplexity vector” tied to local n-gram surprisal as scored by an autoregressive transformer LLM.

1. Algorithmic Framework: Local N-gram Perplexity Vector

The core algorithm computes a per-token vector of local perplexities by aggregating n-gram window probabilities as estimated by a transformer.

Tokenization: The input text of $N$ tokens is segmented as %%%%1%%%%. In the principal experiments, the tokenizer yields word-level tokens; subword tokenizers are permissible.
N-gram Extraction: From $w_1, ..., w_N$ , extract all contiguous n-grams $t_i = (w_i, w_{i+1}, ..., w_{i+n-1})$ for $i$ from $1$ to $N-n+1$ .
Probabilistic Scoring: Each $t_i$ is evaluated via next-token prediction of a pre-trained transformer LM, computing the joint probability

$p(t_i) = \prod_{j=0}^{n-1} P(w_{i+j} | w_1, ..., w_{i+j-1})$

N-gram Perplexity: The n-gram perplexity is defined as

$PP(t_i) = p(t_i)^{-1/n} = \exp \left(-\frac{1}{n} \sum_{j=0}^{n-1} \log P(w_{i+j} | w_1, ..., w_{i+j-1}) \right)$

Token-level Local Perplexity: For each position $k$ , aggregate the perplexities of all n-grams including $w_k$ :

$I(k) = \{ i \mid i \leq k \leq i+n-1 \}$

$LP_k = \frac{1}{|I(k)|} \sum_{i \in I(k)} PP(t_i)$

Vector Assembly: The final embedding is $v = [LP_1, LP_2, ..., LP_N]^T$ . Optionally, $v$ may be standardized for "relative perplexity" but this is not part of the baseline.

This process returns an interpretable vector that highlights localized model uncertainty, sensitive to rare words or improbable sequences, as opposed to a single scalar summary.

2. Mathematical Formulation

All key statistics are computed strictly as described in the main text:

Quantity	Expression	Comment
Joint Probability	$p(t_i) = \prod_{j=0}^{n-1} P(w_{i+j} \| w_1, ..., w_{i+j-1})$	Over n-gram $t_i$
N-gram Perplexity	$PP(t_i) = p(t_i)^{-1/n}$	Local surprisal measure
Token-wise Index	$I(k) = \{ i \mid \max(1, k-n+1) \leq i \leq \min(k, N-n+1) \}$	All n-grams covering $w_k$
Local Perplexity	$LP_k = \frac{1}{\|I(k)\|} \sum_{i \in I(k)} PP(t_i)$	Centered at token $w_k$
Final Vector	$v \in \mathbb{R}^N,\quad v_k = LP_k$	N-dimensional output

This explicit per-token aggregation preserves distributional detail that is discarded in classical scalar perplexity.

3. Architectural and Hyperparameter Choices

Transformer Model: Any auto-regressive transformer LM. Examples are GPT-2 and (for Serbian evaluation) a GPT-2 variant trained on the Serbian corpus. Only the final softmax probabilities are required.
Sliding Window Size $n$ : For the worked example, $n=4$ ; for empirical tasks, $n=3$ to ensure a reasonable number of windows for sentences $N \geq 2n$ .
Stride: Always 1 (fully overlapping windows).
Normalization: Optional (subtract mean, divide by standard deviation); not included as default.
Layer Output: The method discards hidden activations, utilizing only the probability estimates.

The permutation and coverage of the n-gram windows ensures that edge tokens are not neglected: tokens near boundaries participate in fewer $n$ -grams.

4. Worked Example

Consider the input “When in Rome, do as the Romans do.” ( $N=10$ ):

Token ( $w_k$ )	Windows covering $w_k$	$LP_k$ computation
1 (When)	$t_1$	$LP_1 = PP(t_1) = 76.83$
2 (in)	$t_1, t_2$	$LP_2 = (76.83 + 569.06)/2 = 322.95$
3 (Rome)	$t_1, t_2, t_3$	$LP_3 = 252.94$
...	...	...
10 (.)	$t_7$	$LP_{10} = PP(t_7) = 94.20$

This yields $v = [76.83, 322.95, 252.94, 219.67, 190.22, 193.69, 99.85, 95.48, 83.31, 94.20]$ . High $LP_k$ (e.g., at token 2) indicates a local probability dip, often due to a modeling anomaly or typo.

5. Empirical Evaluation and Use Cases

The method's diagnostic value is demonstrated through three error-detection tasks—removal, insertion, replacement of a word—on expert-translated Serbian sentences (total $\sim8,188$ ), each altered at one position. The evaluation protocol is as follows:

Task: Predict the error index by selecting the token with maximum local perplexity ( $\arg\max_k LP_k$ ).
Metrics: Accuracy (correct position identified), Weighted accuracy (scaled by $1/N$), and comparison to random baseline.

Results (Table 4, paper):

Task	Random	Text-to-Vec
Removal	5.80 %	10.37 %
Insertion	3.12 %	17.26 %
Replacement	2.02 %	18.56 %

Weighted accuracy improves by a factor of 3–8 over chance. High Pearson correlation ( $>0.99$ ) between accuracy and weighted accuracy confirms robustness across sentence lengths. Notably, scalar perplexity does not provide this localization power, and no direct comparison to BERT embeddings or global perplexity was reported.

6. Implementation, Complexity, and Limitations

Computational Cost: For sentence of $N$ tokens and window size $n$ , ${N-n+1}$ LM forward passes are required, each over $n$ tokens. For small $n$ and typical text lengths, this cost is dominated by the LM inference; parallel processing of windows is practical.
Limitations: The text-to-Vec method is inherently tied to the LM’s probabilistic calibration. Mis-calibrated transformer LMs (or domain mismatch) will affect $LP_k$ interpretability. The vector’s dimensionality scales with input length, which may pose issues for downstream models expecting fixed-size vectors.
Deployment Considerations: Efficient inference necessitates batched n-gram scoring, and optional vector normalization for applications demanding scale invariance. Adopters should select $n$ to balance localization against window sparsity.

7. Applications and Potential Extensions

The primary utility is in local error detection—identifying marginalized tokens within high-probability contexts. Beyond this, plausible extensions include:

Fine-grained quality assessment in translation, ASR, or OCR—flagging outlier tokens.
Surprisal pattern-based similarity search—retrieving sentences with similar distributions of local modeling "surprise."
Integration of the $v$ vector into error-detection classifiers or explanation systems, leveraging per-token probabilities as features.
Open areas: comparison with dense sentence embeddings, calibration on out-of-distribution text, and use in languages without robust LM support.

As an algorithmic primitive, the Text-to-Vec module offers practitioners an explicit, interpretable embedding of token-level LLM uncertainty—distinct from both scalar perplexity and black-box dense embedding methods—enabling novel downstream analytics and diagnostics in transformer-based NLP systems (Škorić, 2023).

Markdown Upgrade to Chat

References (1)

Text vectorization via transformer-based language models and n-gram perplexities (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text-to-Vec Module.