Evaluating Steering Techniques using Human Similarity Judgments (2505.19333v1)

Published 25 May 2025 in cs.AI

Abstract: Current evaluations of LLM steering techniques focus on task-specific performance, overlooking how well steered representations align with human cognition. Using a well-established triadic similarity judgment task, we assessed steered LLMs on their ability to flexibly judge similarity between concepts based on size or kind. We found that prompt-based steering methods outperformed other methods both in terms of steering accuracy and model-to-human alignment. We also found LLMs were biased towards 'kind' similarity and struggled with 'size' alignment. This evaluation approach, grounded in human cognition, adds further support to the efficacy of prompt-based steering and reveals privileged representational axes in LLMs prior to steering.

Summary

Analysis of Steering Techniques in LLMs Through Human Similarity Judgments

The paper "Evaluating Steering Techniques using Human Similarity Judgments" investigates the effectiveness of various steering techniques for LLMs by employing human similarity judgments as a performance metric. The research aims to bridge the existing gap where traditional evaluations of steering techniques predominantly focus on task-specific performance without considering the alignment of model representations with human cognitive processes.

The paper employs the triadic similarity judgment task, a method grounded in cognitive science, to assess the alignment of LLMs' representations to human cognition. By requiring models to judge the similarity between concepts based on two specific dimensions—size and kind—the authors provide a comprehensive evaluation of how well different steering methods adapt LLMs to mimic human-like judgment processes.

Key Findings

Prompt-Based Steering Superiority: The research found that prompt-based steering methods significantly outperform others in terms of steering accuracy and alignment with human judgments. This highlights the versatility of prompting in enabling models to align their outputs with human-like cognitive processes.
Innate Biases in LLMs: A critical observation was the inherent bias in LLMs toward 'kind' similarity over 'size' similarity, independent of steering interventions. Models demonstrated more alignment with kind-based judgments, while size-based judgments resulted in notable performance decrements. This suggests an innate pre-steering representational axis privileging kind over size.
Competence vs. Alignment: The findings indicate a discrepancy between model competence (task performance) and alignment with human representations. Despite achieving high accuracy on size judgments, the alignment of these judgments with human cognitive representations was poor. This emphasizes that accuracy alone may not suffice in evaluating the human-likeness of LLMs’ internal representation adjustments post-steering.
Effectiveness of Steering Techniques: Among the examined techniques, prompt-based methodology was highlighted as yielding higher alignment with human representations compared to intervention-based methods like Task Vectors or DiffMean. This superiority of prompt-based approaches reaffirms their utility in steering models towards more cognitively aligned outputs under current technological capabilities.

Implications and Future Directions

The paper underscores the importance of incorporating cognitive science methods into LLM evaluation frameworks, advocating for a paradigm shift towards assessing models not only on performance metrics but also on their ability to mimic the representational characteristics of human cognitive structures. This approach could catalyze significant advancements in AI interpretability by aligning machine intelligence closer to human-like thought processes.

For future developments, the paper advocates for leveraging these cognitive benchmarks across various contextual domains beyond size and kind, possibly integrating more complex and dynamic semantic dimensions reflecting real-world reasoning and decision-making. Such expansions could further refine the interpretability and efficacy of LLM steering techniques, aiding in the creation of more robust AI systems.

Moreover, exploring additional LLMs and steering methods beyond computational constraints could provide broader insights into the scalability and system-wide applicability of these findings.

Ultimately, this research contributes to a nuanced understanding of LLM steering, presenting findings that challenge and encourage ongoing refinement in AI steering methodologies to foster cognitive alignment and enhance the collaborative potency of human-AI interactions.

Evaluating Steering Techniques using Human Similarity Judgments (2505.19333v1)

Summary

Analysis of Steering Techniques in LLMs Through Human Similarity Judgments

Key Findings

Implications and Future Directions

Related Papers