BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models (2506.02204v2)

Published 2 Jun 2025 in cs.CL

Abstract: LLM evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated comparison of LLMs that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs. Specifically, BehaviorBox finds features that describe groups of words in fine-grained contexts, such as "conditional 'were' in the phrase 'if you were'" and "exclamation marks after emotional statements", where one model outperforms another within a particular datatset. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.

Summary

The paper introduces BehaviorBox, a novel pipeline that uncovers fine-grained performance differences between language models using performance-aware contextual embeddings.
It employs sparse autoencoders to extract semantically coherent features, enabling detailed analysis beyond traditional corpus-level perplexity metrics.
Experiments across model size, post-training, and family comparisons reveal actionable insights for optimizing model selection and training strategies.

Overview of BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between LLMs

The paper introduces a novel methodology named BehaviorBox, which aims at automating the comparison of LLMs (LMs) by identifying nuances in performance across fine-grained contexts. This approach addresses the complexity involved in evaluating LMs where traditional metrics like prompts and corpus-level perplexity often mask nuanced performance differences. BehaviorBox employs performance-aware contextual embeddings to discern specific text features where one LM outperforms another, offering a more granular understanding of model efficacy beyond the aggregate benchmarks typically used.

BehaviorBox is structured as a three-part pipeline:

Data Generation: The process begins by calculating contextual embeddings for words and aligning these embeddings with LM probabilities, creating a performance-aware representation of text data. This alignment addresses the challenge of differing tokenization strategies between LMs and embedding models, ensuring that the comparisons are meaningful across distinct model architectures and training regimes.
Feature Extraction: BehaviorBox uses sparse autoencoders (SAEs) to learn linear decompositions of these representations, extracting features that reveal performance variances between models. The unique contribution here is the ability to find semantically coherent, fine-grained features independent of predetermined data partitions, which enhances interpretability and context-awareness in model evaluation.
Labeling and Analysis: The system generates natural language descriptions for the discovered features, leveraging a LLM to annotate and validate these groups, facilitating an accessible interface for researchers to understand the contexts in which model performance diverges.

Strong Numerical Results and Insights

The paper presents comprehensive experiments using BehaviorBox to compare models across three dimensions: model size (7B vs. 13B), model training (base vs. post-training with RLHF), and model family (Llama vs. OLMo). Key insights include:

Model Size: Larger models consistently outperform smaller ones in scenarios requiring intricate text understanding, such as long-tailed stylistic and typographic phenomena, which supports existing scaling laws linking model size to improved performance.
Post-training: Interestingly, post-trained models demonstrate divergent performance, excelling in conversational contexts while exhibiting larger perplexity, highlighting a trade-off between raw accuracy and adaptive understanding in specific tasks.
Model Family: Despite minimal perplexity differences, comparisons across model families reveal distinct behavior patterns, particularly in text formatting and punctuation usage, underscoring the diverse architectural biases inherent to different LM families.

Furthermore, BehaviorBox can discern performance discrepancies in model-generated text, demonstrating its utility in practical generation settings beyond static corpus evaluations.

Implications and Future Directions

BehaviorBox provides a promising avenue for refining LM evaluation methodology. By uncovering fine-grained, interpretable differences between models, this approach not only aids model selection and deployment but also offers potential insights into improving LM training regimes, particularly concerning adaptive scaling and balancing between next-token prediction and contextual understanding.

In theoretical terms, the pipeline can serve as a hypothesis generation tool for further behavioral analysis, augmenting traditional approaches to model diagnostics. Future developments might focus on extending BehaviorBox's applicability across non-textual modalities and broader NLP tasks, optimizing feature extraction processes, and exploring integration with real-time LM deployment monitoring systems.

The methodology presents a significant step towards more transparent LM assessment, facilitating better understanding of model biases and aiding in ethical AI deployment. As AI systems continue advancing, methodologies like BehaviorBox are integral to ensuring these models operate reliably within diverse operational contexts.

PDF Markdown

Tweets

https://twitter.com/lltjuatja/status/1930065556159295951

https://twitter.com/lltjuatja/status/1932070391494660353