Papers
Topics
Authors
Recent
2000 character limit reached

Text-to-Vec: Token-Level Contextual Embedding

Updated 14 November 2025
  • Text-to-Vec is a method that converts text into an interpretable vector by computing token-level local n-gram perplexities.
  • It aggregates n-gram probabilities from autoregressive transformers to highlight model uncertainty and detect errors.
  • Empirical evaluations show significant improvements in local error detection compared to scalar perplexity, with robust metrics across tasks.

A Text-to-Vec module produces a vectorial representation of a text input (sentence, document, or token sequence), encoding properties such as local or global contextual probability, semantics, or structure in an embedding suitable for downstream tasks. In "Text vectorization via transformer-based LLMs and n-gram perplexities" (Škorić, 2023), Škorić proposes a method for contextual text vectorization that departs from traditional scalar perplexity by generating an “N-dimensional perplexity vector” tied to local n-gram surprisal as scored by an autoregressive transformer LLM.

1. Algorithmic Framework: Local N-gram Perplexity Vector

The core algorithm computes a per-token vector of local perplexities by aggregating n-gram window probabilities as estimated by a transformer.

  • Tokenization: The input text of NN tokens is segmented as w1,w2,,wNw_1, w_2, \dots, w_N. In the principal experiments, the tokenizer yields word-level tokens; subword tokenizers are permissible.
  • N-gram Extraction: From w1,...,wNw_1, ..., w_N, extract all contiguous n-grams ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1}) for ii from $1$ to Nn+1N-n+1.
  • Probabilistic Scoring: Each tit_i is evaluated via next-token prediction of a pre-trained transformer LM, computing the joint probability

p(ti)=j=0n1P(wi+jw1,...,wi+j1)p(t_i) = \prod_{j=0}^{n-1} P(w_{i+j} | w_1, ..., w_{i+j-1})

  • N-gram Perplexity: The n-gram perplexity is defined as

PP(ti)=p(ti)1/n=exp(1nj=0n1logP(wi+jw1,...,wi+j1))PP(t_i) = p(t_i)^{-1/n} = \exp \left(-\frac{1}{n} \sum_{j=0}^{n-1} \log P(w_{i+j} | w_1, ..., w_{i+j-1}) \right)

  • Token-level Local Perplexity: For each position kk, aggregate the perplexities of all n-grams including wkw_k:

I(k)={iiki+n1}I(k) = \{ i \mid i \leq k \leq i+n-1 \}

LPk=1I(k)iI(k)PP(ti)LP_k = \frac{1}{|I(k)|} \sum_{i \in I(k)} PP(t_i)

  • Vector Assembly: The final embedding is v=[LP1,LP2,...,LPN]Tv = [LP_1, LP_2, ..., LP_N]^T. Optionally, vv may be standardized for "relative perplexity" but this is not part of the baseline.

This process returns an interpretable vector that highlights localized model uncertainty, sensitive to rare words or improbable sequences, as opposed to a single scalar summary.

2. Mathematical Formulation

All key statistics are computed strictly as described in the main text:

Quantity Expression Comment
Joint Probability p(ti)=j=0n1P(wi+jw1,...,wi+j1)p(t_i) = \prod_{j=0}^{n-1} P(w_{i+j} | w_1, ..., w_{i+j-1}) Over n-gram tit_i
N-gram Perplexity PP(ti)=p(ti)1/nPP(t_i) = p(t_i)^{-1/n} Local surprisal measure
Token-wise Index I(k)={imax(1,kn+1)imin(k,Nn+1)}I(k) = \{ i \mid \max(1, k-n+1) \leq i \leq \min(k, N-n+1) \} All n-grams covering wkw_k
Local Perplexity LPk=1I(k)iI(k)PP(ti)LP_k = \frac{1}{|I(k)|} \sum_{i \in I(k)} PP(t_i) Centered at token wkw_k
Final Vector vRN,vk=LPkv \in \mathbb{R}^N,\quad v_k = LP_k N-dimensional output

This explicit per-token aggregation preserves distributional detail that is discarded in classical scalar perplexity.

3. Architectural and Hyperparameter Choices

  • Transformer Model: Any auto-regressive transformer LM. Examples are GPT-2 and (for Serbian evaluation) a GPT-2 variant trained on the Serbian corpus. Only the final softmax probabilities are required.
  • Sliding Window Size nn: For the worked example, n=4n=4; for empirical tasks, n=3n=3 to ensure a reasonable number of windows for sentences N2nN \geq 2n.
  • Stride: Always 1 (fully overlapping windows).
  • Normalization: Optional (subtract mean, divide by standard deviation); not included as default.
  • Layer Output: The method discards hidden activations, utilizing only the probability estimates.

The permutation and coverage of the n-gram windows ensures that edge tokens are not neglected: tokens near boundaries participate in fewer nn-grams.

4. Worked Example

Consider the input “When in Rome, do as the Romans do.” (N=10N=10):

Token (wkw_k) Windows covering wkw_k LPkLP_k computation
1 (When) t1t_1 LP1=PP(t1)=76.83LP_1 = PP(t_1) = 76.83
2 (in) t1,t2t_1, t_2 LP2=(76.83+569.06)/2=322.95LP_2 = (76.83 + 569.06)/2 = 322.95
3 (Rome) t1,t2,t3t_1, t_2, t_3 LP3=252.94LP_3 = 252.94
... ... ...
10 (.) t7t_7 LP10=PP(t7)=94.20LP_{10} = PP(t_7) = 94.20

This yields v=[76.83,322.95,252.94,219.67,190.22,193.69,99.85,95.48,83.31,94.20]v = [76.83, 322.95, 252.94, 219.67, 190.22, 193.69, 99.85, 95.48, 83.31, 94.20]. High LPkLP_k (e.g., at token 2) indicates a local probability dip, often due to a modeling anomaly or typo.

5. Empirical Evaluation and Use Cases

The method's diagnostic value is demonstrated through three error-detection tasks—removal, insertion, replacement of a word—on expert-translated Serbian sentences (total 8,188\sim8,188), each altered at one position. The evaluation protocol is as follows:

  • Task: Predict the error index by selecting the token with maximum local perplexity (argmaxkLPk\arg\max_k LP_k).
  • Metrics: Accuracy (correct position identified), Weighted accuracy (scaled by $1/N$), and comparison to random baseline.

Results (Table 4, paper):

Task Random Text-to-Vec
Removal 5.80 % 10.37 %
Insertion 3.12 % 17.26 %
Replacement 2.02 % 18.56 %

Weighted accuracy improves by a factor of 3–8 over chance. High Pearson correlation (>0.99>0.99) between accuracy and weighted accuracy confirms robustness across sentence lengths. Notably, scalar perplexity does not provide this localization power, and no direct comparison to BERT embeddings or global perplexity was reported.

6. Implementation, Complexity, and Limitations

  • Computational Cost: For sentence of NN tokens and window size nn, Nn+1{N-n+1} LM forward passes are required, each over nn tokens. For small nn and typical text lengths, this cost is dominated by the LM inference; parallel processing of windows is practical.
  • Limitations: The text-to-Vec method is inherently tied to the LM’s probabilistic calibration. Mis-calibrated transformer LMs (or domain mismatch) will affect LPkLP_k interpretability. The vector’s dimensionality scales with input length, which may pose issues for downstream models expecting fixed-size vectors.
  • Deployment Considerations: Efficient inference necessitates batched n-gram scoring, and optional vector normalization for applications demanding scale invariance. Adopters should select nn to balance localization against window sparsity.

7. Applications and Potential Extensions

The primary utility is in local error detection—identifying marginalized tokens within high-probability contexts. Beyond this, plausible extensions include:

  • Fine-grained quality assessment in translation, ASR, or OCR—flagging outlier tokens.
  • Surprisal pattern-based similarity search—retrieving sentences with similar distributions of local modeling "surprise."
  • Integration of the vv vector into error-detection classifiers or explanation systems, leveraging per-token probabilities as features.
  • Open areas: comparison with dense sentence embeddings, calibration on out-of-distribution text, and use in languages without robust LM support.

As an algorithmic primitive, the Text-to-Vec module offers practitioners an explicit, interpretable embedding of token-level LLM uncertainty—distinct from both scalar perplexity and black-box dense embedding methods—enabling novel downstream analytics and diagnostics in transformer-based NLP systems (Škorić, 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Text-to-Vec Module.