Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models (2405.05417v2)

Published 8 May 2024 in cs.CL

Abstract: The disconnect between tokenizer creation and model training in LLMs allows for specific inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted model behaviour. Although such `glitch tokens', tokens present in the tokenizer vocabulary but that are nearly or entirely absent during model training, have been observed across various models, a reliable method to identify and address them has been missing. We present a comprehensive analysis of LLM tokenizers, specifically targeting this issue of detecting under-trained tokens. Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop novel and effective methods for automatically detecting these problematic tokens. Our findings demonstrate the prevalence of such tokens across a diverse set of models and provide insights into improving the efficiency and safety of LLMs.

Citations (14)

Summary

  • The paper introduces a methodology combining tokenizer analysis, model-based indicators, and prompting to detect under-trained tokens.
  • Findings reveal that under-trained tokens are prevalent across LLMs, affecting both efficiency and safety.
  • The approach employs unembedding normalization and cosine distance metrics, offering model-specific insights and guiding tokenizer improvements.

The paper "FISHING FOR MAGIKARP: AUTOMATICALLY DETECTING UNDER-TRAINED TOKENS IN LLMs" explores a critical gap in the development and training of tokenizers used in LLMs. It specifically addresses the persistent issue concerning 'glitch tokens'—tokens present in the tokenizer's vocabulary that are rarely or never encountered during model training. These tokens can provoke unforeseen behaviors in LLMs, such as hallucinations or jumbled outputs, thus representing a potential threat to the robustness and reliability of LLMs.

The authors propose a methodical approach to identify these problematic tokens through:

  1. Tokenizer Analysis: The authors analyze various categories of tokens including partial UTF-8 sequences, unreachable tokens, and special tokens. The analysis is vital to filter out these tokens, ensuring an accurate detection pipeline for under-trained tokens.
  2. Model-based Indicators: By examining the architectural components of LLMs, specifically the 'unembedding' matrix, they devise indicators to pinpoint under-trained tokens. The methodology involves the normalization of the matrix by removing the constant component, followed by the evaluation of cosine distances between the mean unused token embedding and rows in the matrix. The indicators' effectiveness is assessed differently depending on whether the model uses tied or untied embeddings. For models with tied embeddings, additional visualization metrics like L2 norms are also used.
  3. Verification via Prompting: The identified tokens are further validated through a prompting technique to ensure their capability to induce erroneous model outputs. This step acts as a filter to single out truly under-trained tokens.

This thorough methodology was applied to a variety of popular open-weight models, including the Cohere Command R, Google Gemma, Meta Llama2, Mistral, Alibaba Qwen, and OpenAI GPT-2 models. For closed-source models, the paper suggests insights gained from open models could guide token assessments despite direct weight examination being unfeasible.

Key Findings

  • Prevalence of Under-Trained Tokens: The research demonstrates that under-trained problematic tokens are prevalent across multiple LLMs. This prevalence is directly influenced by the disconnection between the tokenizer training set and model training data.
  • Effectiveness of Detection Indicators: The indicators used were notably predictive of token misbehavior, enabling effective identification of under-trained tokens. Collective use of model-specific and indicator-based methods was crucial to distinguishing these candidates effectively.
  • Model-Specific Observations: Common elements observed include single-byte tokens causing under-training issues, fragments of merged tokens leading to partial Unicode sequences, and the inclusion of special tokens. The paper presented model-specific anomalies discovered including under-trained tokens related to international words, usernames, or punctuation sequences depending upon the tokenizer and model architecture employed.
  • Impact on Model Efficiency and Safety: Identifying these tokens has major implications for both the efficiency of models and their resilience to malicious inputs. The reduction in under-trained tokens can help eliminate unnecessary spending on model capacity and safeguard against exploiting such tokens to bypass model constraints.

Recommendations and Future Directions

To mitigate the occurrence and impacts of under-trained tokens, the authors offer several guidelines:

  • Synchronizing pre-processing steps across tokenizer training, model training, and inference.
  • Aligning model training data closely with tokenizer properties.
  • Careful handling of rare or partial UTF-8 sequences in token construction.
  • Regular checks for unreachable tokens and disparities between fast and slow tokenizer versions.

The paper underscores the need for further research to enhance tokenizer training mechanisms, such as preventing single-document token definitions in BPE training and systematizing the application of weight decay on unused tokens. These enhancements promise to alleviate the challenges posed by under-trained tokens and augment the overall efficacy and security of LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com