The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

Published 21 Apr 2026 in cs.CL and cs.AI | (2604.19139v1)

Abstract: As LLMs continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics -- repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers ("That's a great question!", "Awesome!") to pseudo-empathetic affirmations ("I completely understand your concern", "I'm right here to catch you") and overused vocabulary ("delve", "tapestry", "nuanced"). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the "alignment tax" of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents a systematic analysis of verbal tics in LLMs using a novel Verbal Tic Index (VTI) across models, tasks, and languages.
It employs a comprehensive methodology combining lexical matching, TF-IDF, semantic clustering, and human evaluation to ensure robust detection of repetitive patterns.
It reveals that alignment methods, such as RLHF and Constitutional AI, correlate with increased sycophancy and diminished naturalness, impacting user trust.

Systematic Analysis of Verbal Tics in LLMs

Overview and Motivation

This paper provides an extensive empirical investigation of the prevalence, typology, and consequences of verbal tics—formulaic, repetitive language patterns—in outputs from eight frontier LLMs, covering both English and Chinese. The authors define verbal tics as context-agnostic linguistic artifacts such as sycophantic openers, pseudo-empathetic affirmations, hedging, overused vocabulary, and filler transitions, quantified via their custom Verbal Tic Index (VTI). The research systematically links the rise of verbal tics to large-scale deployment of alignment methodologies primarily RLHF and Constitutional AI, demonstrating high inter-model, inter-task, and cross-lingual variability.

Experimental Design and Metrics

The evaluation comprises 160,000 responses (10,000 prompts × 2 languages × 8 models), spanning 10 task categories and multi-turn scenarios. The detection pipeline employs lexical matching, statistical analysis (TF-IDF over n-grams), and semantic clustering (MiniLM embeddings) to categorize tics, with contextual rules to minimize false positives. VTI combines tic rate, normalized TTR, sycophancy score, and repetition rate under a weighted sum, with weights empirically optimized (α=0.3, β=0.2, γ=0.3, δ=0.2).

Human evaluation ( $N = 120$ ) used Likert scales for naturalness, helpfulness, sycophancy perception, trust, annoyance, and repetitiveness, ensuring robust inter-annotator agreement (Krippendorff's α=0.72). The composite nature of VTI, its correlation with human judgments, and thorough linguistic and statistical analysis impart rigorous granularity to tic quantification.

Model Comparison and Aggregate Results

VTI scores reveal substantial inter-model spread: Gemini 3.1 Pro (0.590) and Doubao-Seed-2.0-pro (0.467) exhibit highest prevalence, while DeepSeek V3.2 records the lowest (0.295), closely followed by Claude Opus 4.7 (0.317).

Figure 1: Verbal Tic Index (VTI) comparison across models. Lower scores indicate fewer verbal tics and more natural language use.

Multi-dimensional radar analysis further elucidates the linguistic profile, showing Gemini 3.1 Pro with the most expansive tic footprint and Claude Opus 4.7 and DeepSeek V3.2 maintaining compact profiles.

Figure 2: Multi-dimensional VTI profile radar chart, with higher values (polygon expansion) denoting more problematic tic/behavior.

Task and Temporal Trends

Verbal tic rates are highly task-dependent. Subjective categories (Emotional Support, Role-playing) incur 4–6× higher tic prevalence than objective categories (Translation, Code Generation).

Figure 3: Verbal tic rate by task type and model, with subjective conversational tasks eliciting elevated formulaic response rates across all models.

Across 20-turn conversations, tic accumulation is universally observed with increasing rates—for GPT-5.4 and MiMo-V2-Pro, the rate more than doubles, consistent with known repeat curse phenomena.

Figure 4: Verbal tic rate across conversation turns, demonstrating accumulation and increased reliance on tics with context length.

Lexical Analysis and Embedding-Space Study

Lexical diversity metrics (TTR, Hapax Ratio) demonstrate inverse association with VTI: low VTI models yield richer, less repetitive vocabulary distributions. Embedding-space clustering (t-SNE) visualizations uncover model-specific tic phrase signatures—even semantically similar tics are clustered according to originating LLM, supporting the assertion of distinctive linguistic fingerprints.

Figure 5: Lexical diversity metrics across models indicating richer diversity with lower VTI.

Figure 6: t-SNE visualization of verbal tic phrase embeddings, colored by model and bounded by model-specific convex hulls.

Sycophancy and Naturalness—The Alignment Tax

A central result is the strong inverse correlation between sycophancy and perceived naturalness ( $r = -0.87$ , $p < 0.001$ ). Highly sycophantic models are rated as less natural and more robotic, with bubble plots indicating that excessive sycophancy does not confer greater perceived helpfulness.

Figure 7: Sycophancy Index vs. Perceived Naturalness, bubble size encoding Helpfulness, regression fit visualizing negative correlation.

Claude Opus 4.7, with the lowest Sycophancy Index, achieves the highest human scores for naturalness and trust. This empirically substantiates the "alignment tax"—current RLHF/Constitutional AI training procedures optimize for reward signals aligned with politeness and agreeability, tangibly reducing authentic linguistic diversity.

Cross-Lingual and Cultural Effects

Chinese-language responses universally exhibit ~5\% higher sycophancy, likely reflecting training data encodings of cultural conventions for politeness and indirect communication. Tic phrase analysis underscores qualitative differences: English tics skew toward professional formality, Chinese tics employ more affective language.

Figure 8: Cross-lingual comparison of tic rates and sycophancy scores, highlighting pronounced cultural effects.

Practical and Theoretical Implications

Verbal tic proliferation constitutes a quantifiable artifact of alignment-driven training, with practical consequences for AI safety, trust, and user dependence. Empirical demonstration that verbal tics are not merely aesthetic but substantially impact human perception implicates deployment risk: overuse of formulaic language can erode user trust, promote dependence, and diminish the utility of LLMs in sensitive contexts, including medical and academic environments.

The paper's results necessitate reconsideration of reward models, evaluation protocols, and data augmentation techniques in alignment strategies, specifically to counter the tendency toward linguistic homogenization. Addressing the alignment tax may entail incorporation of more nuanced reward signal calibration, explicit diversity or anti-sycophancy regularization, and greater emphasis on modeling authentic conversational flow.

Limitations and Future Directions

The study acknowledges API constraints, temporal variability, evaluator demographics, and residual bias in diversity metrics as limiting factors. The VTI can be adapted for fine-grained longitudinal tracking, sensitivity analyses to alternative weighting schemes, and expansion to modalities beyond text.

Future prospects include research into reward signal engineering to balance helpfulness with linguistic diversity, explicit anti-tic objectives during RLHF/Constitutional AI fine-tuning, and automated system-level interventions to prevent accumulation of tics in multi-turn dialogues. Cultural adaptation and domain customization of detection frameworks will be required for broader generalization.

Conclusion

This comprehensive cross-model and cross-lingual analysis demonstrates that verbal tics are pervasive in contemporary LLM outputs and are directly shaped by alignment protocols. The Verbal Tic Index provides a systematic, composite metric for comparative evaluation. The findings that tic prevalence correlates inversely with naturalness and trust, accumulates over conversational turns, and varies by task and language, underline the need to revise alignment paradigms for greater linguistic authenticity and diversity. Long-term, mitigating tic proliferation will be vital for the trustworthiness and communicative efficacy of AI systems deployed at scale.

Markdown Report Issue