Idiosyncrasies in Large Language Models (2502.12150v2)
Abstract: In this work, we unveil and study idiosyncrasies in LLMs -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs. Code is available at https://github.com/locuslab/LLM-idiosyncrasies.
Summary
- The paper demonstrates that LLM outputs have unique idiosyncrasies identified via a synthetic N-way classification task achieving up to 97.1% accuracy.
- The study employs a pre-trained text embedding model with a linear classifier to analyze lexical, formatting, and semantic patterns across diverse models.
- The findings highlight risks of idiosyncrasy propagation in synthetic data training and offer a robust framework for inferring model similarity.
LLMs exhibit unique, model-specific patterns in their outputs, termed "idiosyncrasies," which allow for their differentiation. Research investigates these patterns by framing the problem as a classification task: given a text output generated from a shared prompt, the objective is to identify the source LLM. This approach reveals robust, distinguishing characteristics across various state-of-the-art models (2502.12150).
Methodology for Identifying Idiosyncrasies
The core methodology revolves around a synthetic N-way classification task. For N distinct LLMs, a classifier is trained to predict the source model index i∈{1,...,N} given a text output x.
- Data Generation: A diverse set of prompts (e.g., from datasets like UltraChat) is used to query each of the N LLMs. This generates parallel text outputs {x1,...,xN} for each prompt. For instance, using 11,000 prompts results in 11,000 outputs per model, typically split into training (e.g., 10,000) and validation (e.g., 1,000) sets.
- Classifier Architecture: A pre-trained text embedding model, such as
LLM2vec
, serves as the base. A linear classification head is added on top of the embeddings. The model is then fine-tuned on the generated (text, source model label) pairs. The input is the LLM-generated text x, and the output is a probability distribution over the N possible source models.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
import torch import torch.nn as nn from transformers import AutoModel, AutoTokenizer # Example using a generic transformer encoder class LLMClassifier(nn.Module): def __init__(self, base_model_name, num_classes): super().__init__() self.encoder = AutoModel.from_pretrained(base_model_name) # Use CLS token embedding or mean pooling self.classifier_head = nn.Linear(self.encoder.config.hidden_size, num_classes) def forward(self, input_ids, attention_mask): outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask) # Example using CLS token embedding cls_embedding = outputs.last_hidden_state[:, 0, :] logits = self.classifier_head(cls_embedding) return logits # --- Training Loop (Pseudocode) --- # model = LLMClassifier(base_model_name='locuslab/LLM2vec-xlm-roberta-large', num_classes=N) # tokenizer = AutoTokenizer.from_pretrained('locuslab/LLM2vec-xlm-roberta-large') # optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5) # criterion = nn.CrossEntropyLoss() # for epoch in range(num_epochs): # for batch in dataloader: # Batches of (text, label) # inputs = tokenizer(batch['text'], return_tensors='pt', padding=True, truncation=True) # labels = batch['label'] # # optimizer.zero_grad() # logits = model(inputs['input_ids'], inputs['attention_mask']) # loss = criterion(logits, labels) # loss.backward() # optimizer.step() # --- End Training Loop ---
- Evaluation: Classification accuracy on a held-out set of generated texts is the primary metric. High accuracy suggests strong, learnable idiosyncrasies differentiating the models.
- LLM Groups Studied: The approach was applied to various groups:
- Proprietary Chat APIs (e.g., ChatGPT-4o, Claude-3.5-Sonnet, Grok-2, Gemini-1.5-Pro, DeepSeek-V3)
- Open-weight Instruction-tuned LLMs (e.g., Llama3.1-8b, Gemma2-9b, Qwen2.5-7b, Mistral-v3-7b)
- Base pre-trained LLMs corresponding to the instruct models.
- Models within the same family but varying sizes (e.g., Qwen2.5 series).
Key Findings on Distinguishability
The classification framework revealed significant distinguishability among LLMs.
- High Accuracy: For a five-way classification task involving top chat APIs (ChatGPT, Claude, Grok, Gemini, DeepSeek), the fine-tuned
LLM2vec
classifier achieved 97.1% accuracy on held-out validation data. Pairwise comparisons often exceeded 99% accuracy. - Instruct and Base Models: Similar high performance was observed for instruction-tuned models (96.3% for a 4-way task) and, to a lesser extent, base pre-trained models (87.3% for a 4-way task), indicating that both pre-training and fine-tuning contribute to unique output signatures.
- Within-Family Differences: Even models from the same family but different sizes showed non-trivial separability (e.g., 59.8% accuracy for 4 sizes of Qwen2.5 models), suggesting scale influences output characteristics.
- Robustness: These idiosyncrasies proved robust, generalizing across different prompt distributions (tested on OOD datasets like Cosmopedia, LmsysChat, WildChat) and persisting even when prompts included constraints on output length or formatting.
Sources of LLM Idiosyncrasies
Further analysis pinpointed several factors contributing to these unique model signatures:
- Word-Level Distributions:
- Word choice and frequency are primary drivers. Classification accuracy remains high after removing special characters (95.1% for chat models) or even shuffling words within the text (88.9% for chat models), highlighting the importance of the bag-of-words signal.
- Specific phrases identified via TF-IDF and logistic regression coefficients act as strong indicators (e.g., "certainly" for ChatGPT, "based on" for Claude).
- The distribution of the very first word generated also varies significantly across models.
- Letter-level frequencies, conversely, appear highly similar and do not aid distinguishability.
- Markdown Formatting:
- The usage patterns of markdown elements (bold, italics, headers, lists, code blocks) are highly characteristic, particularly for chat and instruction-tuned models. A classifier trained solely on the sequence of markdown tags (ignoring text content) achieved 73-77% accuracy for these model types.
- Models exhibit distinct frequency distributions for different markdown types (e.g., Claude uses fewer bolded segments and headers compared to other chat APIs). Base models show less distinctive formatting patterns.
- Semantic Content and Style:
- Idiosyncrasies are deeply embedded and not merely superficial lexical choices. Accuracy remains high (>90% for chat/instruct) when texts undergo transformations preserving meaning but altering wording, such as:
- Paraphrasing: Rewriting the text using an external LLM (e.g., Mixtral).
- Translation: Translating the text to another language (e.g., German) and back to English.
- Even after aggressive summarization using an external LLM, which significantly shortens and alters the text, classification accuracy remains substantially above chance (e.g., 58.1% for chat models), indicating that high-level semantic structure and stylistic choices contribute to the unique signature.
- Qualitative analysis using LLMs as judges described differences in tone (e.g., descriptive vs. concise), level of detail, and structural preferences (e.g., use of lists, paragraph length).
- Idiosyncrasies are deeply embedded and not merely superficial lexical choices. Accuracy remains high (>90% for chat/instruct) when texts undergo transformations preserving meaning but altering wording, such as:
Broader Implications
The existence and nature of these idiosyncrasies have significant practical implications:
- Training on Synthetic Data:
- Training LLMs on synthetic data generated by other LLMs poses the risk of idiosyncrasy propagation. The model being trained may inadvertently learn and replicate the unique stylistic patterns of the data-generating model.
- Experiments showed that training different base models (e.g., Llama, Mistral) on synthetic data from the same source (e.g., ChatGPT) made their outputs less distinguishable from each other and more similar to the source model's style.
- Conversely, training the same base model on data from different sources (e.g., ChatGPT vs. Claude) resulted in models whose outputs reflected the distinct idiosyncrasies of their respective data sources. This highlights how synthetic data can imprint source model characteristics onto newly trained models.
- Inferring Model Similarity:
- The classification framework offers a quantitative method to assess the similarity between LLMs, including black-box APIs. By training a classifier on outputs from N-1 known models and evaluating it on outputs from an Nth (potentially unknown or held-out) model, one can analyze the confusion matrix or prediction patterns.
- If the classifier frequently misclassifies outputs from model X as originating from model Y, it suggests a higher degree of similarity in their output characteristics.
- This analysis revealed potential relationships, such as outputs from Claude, Grok, and Gemini often being misclassified as ChatGPT, potentially indicating shared architectural elements, training data overlaps (including synthetic data), or fine-tuning procedures. It also highlighted similarities between models like Phi-4, ChatGPT, and DeepSeek.
The code implementing the classification framework and experiments is available, facilitating further research and application of these findings https://github.com/locuslab/llm-idiosyncrasies.
Conclusion
The research demonstrates that LLMs possess distinct, measurable idiosyncrasies in their generated text, originating from lexical choices, formatting habits, and deeper semantic or stylistic patterns. These findings are quantified through a robust classification methodology achieving high accuracy in distinguishing models. The primary implications concern the potential inheritance of these unique signatures when training on synthetic data and the utility of this framework for assessing similarities between different LLMs, offering a new lens through which to analyze the rapidly evolving LLM landscape.
Related Papers
Tweets
YouTube
HackerNews
- Idiosyncrasies in Large Language Models (2 points, 0 comments)