LLMs Can Get "Brain Rot"! (2510.13928v1)
Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in LLMs. To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' $g>0.3$) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0\%$ to $100\%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine "cognitive health checks" for deployed LLMs.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview: What this paper is about
This paper asks a simple but important question: If LLMs keep reading lots of low-quality, attention-grabbing internet posts (like junk food for the brain), do they get worse at thinking? The authors call this idea the “LLM Brain Rot Hypothesis.” They run careful experiments and find strong evidence that feeding LLMs “junk” text makes them worse at reasoning, remembering long information, staying safe, and even nudges their “personality” toward darker traits.
Key questions the researchers asked
- Does continually training LLMs on “junk” social media text cause lasting drops in their abilities?
- Which kinds of junk matter more: very popular short posts (high engagement) or content with flashy, shallow writing (low semantic quality)?
- Can the harm be undone with extra training or better prompts?
How the paper was done (in everyday language)
The team ran controlled experiments—meaning they changed just one thing (data quality) while keeping everything else the same.
- Building the “junk” vs “control” training data:
- From real Twitter/X posts, they made two kinds of “junk”:
- M1: High engagement junk—short posts that are very popular (lots of likes/retweets/replies). Think short, catchy posts that keep you scrolling.
- M2: Low semantic quality junk—posts that use attention-grabbing tricks (ALL CAPS, clickbait, hype) or shallow topics (e.g., conspiracy talk, empty hype).
- “Control” data was longer, less popular, or higher-quality content.
- They matched the amount of text (tokens) and training steps across conditions, so only data quality differed.
- Training setup:
- They took four existing LLMs (like Llama 3–8B and Qwen models) and:
- 1) Gave them extra “pre-training” on either junk, control, or mixes of both.
- 2) Then “instruction-tuned” them (fine-tuned to follow instructions) the same way across groups.
- How they tested the models:
- Reasoning: Grade-school science questions (ARC), with and without “let’s think step by step” reasoning.
- Long-context understanding: Can the model find and use info hidden in long passages (RULER tasks)?
- Safety: Does the model comply with harmful requests (HH-RLHF, AdvBench)?
- Personality-like behavior: Signs of traits like agreeableness or dark traits (TRAIT).
- Reading the results:
- They looked at performance differences and also “dose responses”—what happens as you increase the percentage of junk in the training mix.
- Forensics on errors:
- They examined where the models went wrong and discovered a main failure pattern: “thought-skipping.” The models began to answer quickly without planning or fully reasoning through steps.
Main findings and why they matter
- Junk data causes real, measurable decline across important skills.
- Reasoning drops: On ARC with “think step by step,” scores fell from about 74.9 to 57.2 when moving from 0% to 100% M1 junk.
- Long-context memory drops: On RULER’s CWE task, scores fell from about 84.4 to 52.3 as M1 junk increased.
- Safety gets worse: Models became more willing to follow harmful instructions after junk exposure.
- Personality shifts: Signs of “dark traits” (like narcissism and psychopathy) increased under the high-engagement (M1) junk condition.
- “Dose-response” pattern: More junk → more decline. This strongly suggests data quality is a causal factor, not a coincidence.
- The worst kind of junk to train on is not just short text—it’s popular, engagement-optimized text.
- Popularity (likes/retweets) turned out to be a better warning sign of harmful effects than just post length.
- Popularity hurt reasoning more; shortness hurt long-context memory more.
- The key failure mode is “thought-skipping.”
- Models started skipping planning and intermediate steps, jumping straight to answers (often wrong).
- Many wrong answers came from “no thinking,” “no plan,” or “skipping steps,” rather than deep misunderstanding. Junk content seems to teach models to respond fast and short, not carefully.
- Can we fix it? Partly, but not fully.
- Extra instruction tuning helped somewhat, but even with lots of clean instruction data, models didn’t fully recover to their original level.
- “Reflective” prompting helped only when a stronger outside model (like GPT-4o-mini) provided high-quality feedback. Self-reflection by the damaged model wasn’t enough.
- This suggests the damage becomes “internalized” in the model, not just a formatting issue.
What this means going forward
- Data quality matters a lot—and popularity-driven, attention-optimized web text can harm LLM “cognitive health.”
- Continual pre-training on uncurated internet streams should be treated as a safety issue, not just an efficiency choice.
- Model developers should:
- Curate training data more carefully, especially avoiding engagement-optimized junk.
- Monitor models over time with “cognitive health checks” for reasoning, long-context skills, safety, and personality-like behavior.
- Explore stronger repair methods, since standard instruction tuning and light fixes don’t fully undo the harm.
In short, training diets matter. Just like humans, when LLMs consume lots of flashy, shallow content, they get worse at careful thinking and safe behavior. The paper shows how and why—and calls for better data curation to keep models healthy.
Knowledge Gaps
Below is a concise, actionable list of knowledge gaps, limitations, and open questions that remain unresolved by the paper. These items identify what is missing, uncertain, or left unexplored, and point to concrete directions for future work.
- External validity of “junk” source: results are based on a single, dated Twitter/X corpus (circa 2010); it is unclear whether effects generalize to other platforms (Reddit, YouTube/TikTok transcripts, forums), more recent social media distributions, or non-social web data.
- Scale mismatch with real pretraining: interventions use ~1.2M tokens per condition and 3 epochs at LR 1e-5; how “brain rot” scales under realistic multi-billion-token continual pretraining remains unknown.
- Limited model diversity and size: only four relatively small instruct models (0.5B–8B) were tested; behavior for larger frontier models, base (non-instruct) checkpoints, mixture-of-experts, and different architectures is unassessed.
- From-scratch vs. continual pretraining: effects are shown only for small continual updates on already instruction-tuned models; whether the same degradation occurs during from-scratch pretraining or on non-instruct base models remains untested.
- Domain-shift confounding: even “control” data come from Twitter; a high-quality non-social control (e.g., Wikipedia/Books/Refined web) is missing to separate “Twitter domain” effects from “junkness” effects.
- Junk operationalization validity: M1 (popularity+shortness) and M2 (LLM-judged “quality”) may conflate stylistic, topical, and temporal confounds (celebrity accounts, news spikes, bot activity, time-of-day); topic- and metadata-controlled sampling is absent.
- Reliance on LLM judges: M2 labels (GPT-4o-mini) and safety scoring are LLM-judged; limited human validation (76% agreement on a small sample) introduces measurement bias and potential circularity; systematic human annotation and inter-rater reliability are needed.
- Limited robustness checks: no multiple training seeds, no replicate runs per condition, and effect sizes are computed over n=4 models; variance due to training stochasticity is not disentangled from intervention effects.
- Unmeasured training dynamics: no reporting of training loss/perplexity, gradient norms, or optimization traces; we cannot tell whether degradation reflects overfitting, forgetting, representation collapse, or optimization artifacts.
- Length and tokenization artifacts: although tokens are matched across conditions, short tweets change sequence count and batch composition; potential batch-statistics or tokenization distribution shifts are not controlled or analyzed.
- Context-length mismatch: RULER uses 4k tokens while instruction tuning used 2k context; the role of context-length training alignment in long-context declines is unclear.
- Mechanism claims vs. evidence: “representational drift” and “thought-skipping” are inferred from outputs; there is no mechanistic interpretability (e.g., CKA/CCA similarity, attention-head analyses, activation drift, embedding isotropy) to support persistent internal changes.
- Failure-mode labeling validity: failure categorization of CoT is LLM-labeled; human or multi-judge adjudication and reliability metrics are missing, risking label bias in “thought-skipping” conclusions.
- Dose-response attribution: mixtures keep token counts constant, but not necessarily balance topics, accounts, or temporal factors; precise attribution to popularity vs. shortness vs. topic salience remains under-identified.
- Safety evaluation breadth: safety changes are shown on HH-RLHF and AdvBench with LLM judging; broader safety/bias toxicity measures (e.g., stereotyping, harassment, demographic bias) and human red-teaming are not reported.
- Benchmarks scope: reasoning and retrieval are examined (ARC, RULER), but math (GSM8K/MATH), code (Humaneval/MBPP), general knowledge (MMLU), summarization, translation, and multilingual tasks are untested; breadth of cognitive decline is unknown.
- Contamination checks: potential overlap between intervention corpora and evaluation data (e.g., ARC-style Q&A appearing on Twitter) is not audited.
- Popularity as a causal driver: beyond correlation and ablation, causal evidence that popularity per se (vs. correlated latent attributes) drives decay is lacking; controlled counterfactual sampling (matching topic, length, time, account type) is needed.
- Positive effects not unpacked: the paper notes gains (e.g., openness, extraversion) but does not analyze what junk signals produce them or whether these “benefits” trade off with safety and reasoning in controlled settings.
- Decoding-time confounds: inference parameters (temperature, top-p, max tokens) and prompt format can influence reasoning depth and “skipping”; systematic decoding ablations (self-consistency, majority vote, deliberate decoding) are not reported.
- Training-time mitigations: only post-hoc instruction tuning and small clean pretraining are tried; training-time defenses (data reweighting, regularization, KL anchoring to base, replay buffers, anti-shortness curricula) remain unexplored.
- Post-hoc mitigation ceiling: instruction tuning up to ~50k examples and limited clean CPT did not restore baselines; larger-scale RLHF/SPIN, preference optimization, or structured CoT training with verifiers were not tested.
- Temporal persistence: durability of the rot over extended clean training horizons, or after staged detox protocols (e.g., alternating clean/junk epochs), is not measured.
- Granular feature attributions: which textual features (clickbait lexicon, capitalization, emoji, hashtags, link prevalence, repetition) most drive degradation is not isolated; targeted ablations could guide filters.
- Data governance and deployment: concrete “cognitive health checks” are suggested but not operationalized into metrics, thresholds, or continuous monitoring protocols for production training pipelines.
- Ethical and societal impacts: personality shifts (e.g., psychopathy increases) are measured via a proxy test; mapping to real-world behavioral risks in user-facing systems remains speculative without user studies or live A/B tests.
Open questions the paper motivates but does not resolve:
- What is the causal mechanism by which popularity/shortness alters internal representations and induces thought-skipping at inference time?
- How does the effect scale with model size and with orders-of-magnitude larger (and more realistic) continual pretraining?
- Can targeted data filters, curriculum schedules, or regularizers prevent or reverse the rot without sacrificing domain adaptation?
- Which decoding or training paradigms (self-consistency, ToT, verifiers, RLHF, debate) best counteract thought-skipping induced by junk exposure?
- Are there principled ways to retain any observed “benefits” (e.g., openness) while eliminating safety and reasoning harms?
Practical Applications
Below is a concise mapping from the paper’s findings to practical, real-world applications. Each item identifies who can use it, where it fits, what the tool/product/workflow looks like, and what assumptions or dependencies may affect feasibility.
Immediate Applications
- Training-time “Data Curation Firewall” for continual pretraining
- Sectors: Software/AI, Model Providers, Enterprise AI teams
- Tool/Product/Workflow: Popularity- and length-aware filters (M1), semantic-quality classifiers (M2), and re-weighting modules in data loaders to downweight or exclude short, high-engagement, and clickbait-like content; “junk dose” knobs for dataset mixture control before each update cycle.
- Assumptions/Dependencies: Availability of engagement/popularity metadata or reliable proxies; careful thresholds to avoid removing valuable concise content; domain shift considerations (Twitter ≠ all web data); legal/licensing constraints on filtering.
- Cognitive Health Checks in CI/CD for model updates
- Sectors: Software/AI, MLOps, Academia (benchmarking labs)
- Tool/Product/Workflow: A standardized “cognitive vitals” suite that runs ARC (with and without CoT), RULER (long-context tasks), HH-RLHF/AdvBench risk scores, and TRAIT personality probes pre/post update; dashboards with pass/fail gates; trend alerts on Hedges’ g or deltas.
- Assumptions/Dependencies: Compute budget for routine eval; permission to use benchmarks/LLMs for adjudication; acceptance of proxy metrics as early indicators; managing evaluation data leakage.
- Popularity-aware decontamination of social-media corpora
- Sectors: Data vendors, Web-scale crawlers, Model Providers
- Tool/Product/Workflow: Preprocessing modules that discard or strongly downweight content with high engagement + low length, and attention-grabbing patterns (e.g., trigger words, clickbait headlines); data “nutrition labels” reporting junk ratios.
- Assumptions/Dependencies: Correctness of popularity proxies (e.g., when raw metadata is missing); robustness of textual heuristics and classifiers; multilingual generalization.
- Inference-time “Reflect-then-Answer” scaffolding for high-stakes tasks
- Sectors: Healthcare, Finance, Legal, Enterprise Apps
- Tool/Product/Workflow: For critical prompts, auto-insert a planning step and, if thought-skipping is detected, trigger external reflection (e.g., a stronger model critiques and requests a revised answer); policies to limit CoT exposure while retaining structured reasoning (plans/rationales).
- Assumptions/Dependencies: Latency/cost overhead; organizational policy on chain-of-thought; access to a stronger external model; privacy controls for reflective context.
- Thought-skipping detectors and remediation prompts
- Sectors: Developer Tools, Software/AI
- Tool/Product/Workflow: Lightweight classifiers to detect “No Thinking,” “No Plan,” or “Skipped Steps” patterns in responses; automatic remediation prompts (e.g., “outline steps first,” “verify variables,” “complete all planned steps”).
- Assumptions/Dependencies: Classifier accuracy; prompt fragility across models; careful use in domains where concise responses are preferred.
- Safety and personality regression tests in RLHF/instruction-tuning workflows
- Sectors: Software/AI, Safety/Alignment teams
- Tool/Product/Workflow: Add HH-RLHF/AdvBench risk scoring and TRAIT checks to fine-tuning pipelines to detect training-time safety drift; enforce “no-regression” gates on risk measures and undesirable traits (psychopathy, narcissism).
- Assumptions/Dependencies: Agreement on acceptable thresholds; reliance on external LLMs for risk adjudication; potential domain dependence of trait probes.
- Sector-specific data governance for continual training
- Sectors: Healthcare, Finance, Education, Robotics/Autonomy
- Tool/Product/Workflow: “No-junk exposure” policies for domain models, including whitelists of vetted corpora, minimum-length constraints for training texts, and pre-update validations emphasizing long-context tasks (e.g., variable tracking, multi-key retrieval).
- Assumptions/Dependencies: Availability of quality domain data; compliance/regulatory buy-in; monitoring for task-specific regressions (e.g., long-context in clinical summarization).
- Procurement standards and “data nutrition labels” in contracts
- Sectors: Policy, Enterprise IT, Procurement, Data Vendors
- Tool/Product/Workflow: Require vendors to disclose junk ratios (by M1/M2), length distributions, and provenance; embed “training-time safety” clauses mandating quality controls and periodic cognitive health reports.
- Assumptions/Dependencies: Industry acceptance; standardized definitions and tests; auditability of disclosures.
- RAG and long-context QA hardening
- Sectors: Software/AI, Knowledge Ops, Enterprise Search
- Tool/Product/Workflow: Avoid continual pretraining or fine-tuning on social content for RAG models; validate long-context performance with RULER-like tasks; introduce retrieval checks to enforce full evidence aggregation before answering.
- Assumptions/Dependencies: RAG pipeline design (chunking, recall/precision tradeoffs); reliable long-context evaluation; cost of additional checks.
- Audit and certification services for “training-time safety”
- Sectors: Consulting, Compliance, Model Providers
- Tool/Product/Workflow: Third-party audits that quantify brain-rot risk (junk dose, cognitive deltas, safety/personality drift), certify controls, and produce remediation plans.
- Assumptions/Dependencies: Trusted auditors; access to data lineage; accepted reference benchmarks.
Long-Term Applications
- Training-time safety standards and regulation
- Sectors: Policy/Regulators, Standards Bodies (e.g., ISO, NIST-equivalent)
- Tool/Product/Workflow: Formal standards requiring reporting of junk exposure, cognitive health testing pre-release, and controls on engagement-optimized content in training; “cognitive health” compliance marks.
- Assumptions/Dependencies: Broad stakeholder alignment; shared metrics and reference suites; enforceability.
- New objectives and architectures to resist thought-skipping
- Sectors: AI Research, Model Providers
- Tool/Product/Workflow: Anti-brain-rot regularizers (e.g., plan-completion losses), curricula that reward structured multi-step reasoning and long-form contexts, memory/attention designs that maintain plan integrity under noisy data.
- Assumptions/Dependencies: Scalability to frontier models; retention of fluency; avoiding over-constraining creativity.
- Popularity-aware web crawling, provenance, and exposure control at scale
- Sectors: Data Infrastructure, Web Crawlers, Cloud Providers
- Tool/Product/Workflow: Crawlers that capture, store, and filter based on engagement signals; provenance tracking and per-sample exposure accounting to enforce global junk “budget caps.”
- Assumptions/Dependencies: Access to reliable engagement metadata; cross-platform standardization; handling missing or manipulated signals.
- Marketplaces for “clean” corpora and quality scoring
- Sectors: Data Marketplaces, Publishers, Enterprise AI
- Tool/Product/Workflow: Curated datasets with certified low junk ratios and high semantic quality; quality scores bundled with licensing; differential pricing tied to cognitive health impact.
- Assumptions/Dependencies: Verified scoring methodologies; incentives for publishers; sustainability of curation.
- Reflection-as-a-service and dynamic escalation policies
- Sectors: Software/AI, High-stakes Applications (Health, Finance, Gov)
- Tool/Product/Workflow: Managed APIs that provide external reflection/critique and iterative reasoning assistance when local models exhibit thought-skipping or ambiguity; policy engines to trigger escalation.
- Assumptions/Dependencies: Stronger model availability; cost/latency governance; privacy/security of escalated contexts.
- Automated thought-skipping and personality drift monitors embedded in MLOps
- Sectors: MLOps Platforms, Observability Vendors
- Tool/Product/Workflow: Live detectors that estimate plan completeness, logical consistency, and tone/trait drift on production traffic; alerts, rollbacks, and shadow testing against pre-defined guardrails.
- Assumptions/Dependencies: Reliable online proxies; safe logging policies; false positive management.
- Cross-disciplinary research on human–AI “engagement harms” co-dynamics
- Sectors: Academia (CS, HCI, Psychology, Communications), Think Tanks
- Tool/Product/Workflow: Joint studies quantifying how engagement-optimized content affects both humans and AI systems; policy recommendations for platform design and AI training interactions.
- Assumptions/Dependencies: Data access from platforms; longitudinal paper funding; ethical approvals.
- Education and media literacy updates for AI builders and users
- Sectors: Education, Professional Training, Daily Life
- Tool/Product/Workflow: Curricula and best-practice guides warning against fine-tuning assistants on social feeds; checklists for safe DIY model updates; user prompts that encourage stepwise reasoning in everyday assistant use.
- Assumptions/Dependencies: Adoption by institutions; usability of guidance; evidence of real-world effectiveness.
Notes on general assumptions and validity limits drawn from the paper:
- External validity: Results are demonstrated on multiple open models but at modest scales and on English Twitter/X data; replication on larger frontier models, multilingual corpora, and other platforms remains needed.
- Measurement dependencies: Some labels (e.g., M2 quality, safety risks) rely on LLM-as-a-judge; adjudicator choice may affect scores.
- Trade-offs: Overzealous filtering may remove concise, high-quality content; policies should favor down-weighting over hard exclusion where appropriate.
- Cost/latency: Reflection scaffolds and expanded evaluations introduce operational overhead; use tiered policies (critical vs routine tasks).
Glossary
- AdamW: An optimization algorithm that combines Adam with weight decay for better generalization in training neural networks. "learning rate , AdamW, cosine learning rate schedule, bf16 precision, an effective batch size of 8"
- AdvBench: A safety benchmark of harmful instructions used to assess whether models comply with unsafe requests. "AdvBench~\citep{advbench} supplies harmful instructions as prompts, and models are judged on whether they comply, yielding a binary pass/fail safety score."
- ARC (AI2 Reasoning Challenge): A benchmark of grade-school science multiple-choice questions used to evaluate reasoning ability. "ARC (AI2 Reasoning Challenge)~\citep{arc} presents 7,787 grade-school science problems (authored for human tests) in a multiple-choice question-answering (QA) format, with performance measured by accuracy."
- ARC-Challenge: The harder subset of the ARC benchmark focusing on more difficult reasoning questions. "under M1, ARC-Challenge with Chain Of Thoughts drops "
- bf16 precision: A 16-bit floating-point format (bfloat16) used to speed up training while maintaining numerical stability. "bf16 precision"
- Chain Of Thought (COT): A prompting technique that encourages models to produce step-by-step reasoning before answering. "We also experimented with the Chain Of Thought (COT)~\citep{wei2022chain}, by prompting LLM with ``let's think step by step''."
- Confusion matrix: A table used to summarize classification performance by counts of correct and incorrect predictions across classes. "Right: Confusion matrix between human and GPT-predicted semantic quality (M2)."
- Continual pre-training: Ongoing additional pre-training of a model on new data after its initial pre-training phase. "continual pre-training of 4 LLMs on the junk dataset"
- Cosine learning rate schedule: A training schedule where the learning rate follows a cosine function, typically decreasing over time. "cosine learning rate schedule"
- Dose-response: A relationship showing how the magnitude of an effect changes with the “dose” or proportion of an intervention. "dose-response cognition decay"
- Engagement degree (M1): A metric operationalizing junk data by tweet popularity (likes/retweets/replies/quotes) and short length. "M1 (engagement degree) selects short but highly popular posts that often engage users longer online"
- Hedges' g: A standardized effect size measure that adjusts for small sample bias. "Hedges' "
- HH-RLHF: A preference dataset from Reinforcement Learning from Human Feedback used to evaluate safety and helpfulness. "HH-RLHF~\citep{hhrlhf} consists of prompt--response pairs, where annotators choose between two model completions."
- Instruction tuning (IT): Fine-tuning a model on instruction-response pairs to improve following instructions and alignment. "Scaling post-hoc instruction tuning (IT) and continual control training."
- Jailbreaking: Techniques that bypass a model’s safety alignment to elicit unsafe outputs. "and therefore can be easily undone by jailbreaking~\citep{advbench}"
- Long-context understanding: The capability to retrieve, track, and reason over information spread across extended input sequences. "cognitive declines in reasoning, long-context understanding, and ethical norms."
- Machiavellism: A personality trait characterized by manipulative behavior and a cynical view of others, assessed in TRAIT. "TRAIT includes Big Five traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) and three socially undesirable traits (Psychopathy, Machiavellism, and Narcissism)."
- Model collapse: A degradation phenomenon where models trained on model-generated data forget rare (tail) distributions and converge to lower-quality outputs. "resulting in the forgetting of tail-distribution (model collapse)~\citep{shumailov2023curse, shumailov2024ai,seddik2024bad}."
- Needle-In-A-Haystack (NIAH): Retrieval tests where a model must find specific information (“needle”) within long distractor contexts (“haystack”). "For RULER, we select a subset of tasks to present, and the full results are in \cref{tab:benchmark_llama_full}. For brevity, we use NIAH for needle-in-a-haystack test, and QA for question answering."
- Next-token prediction loss: The standard autoregressive training objective where the model predicts the next token in a sequence. "We execute continuing pre-training by using the next-token prediction loss on synthetic corpora"
- Orthogonal operationalizations: Distinct, non-overlapping formulations of an intervention that capture different dimensions of a concept. "constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality)"
- Point-biserial correlation: A correlation measure between a binary variable and a continuous variable. " represents the Point-Biserial correlation."
- Poisoning pre-training data: Injecting malicious or crafted patterns into the training corpus to cause undesired behaviors. "poisoning pre-training data with crafted repetitive patterns~\citep{panda2024teach}."
- Preference fine-tuning: Adjusting model outputs to align with human preferences via techniques like RLHF or supervised preference data. "Even modest data shifts during preference fine-tuning can dramatically affect safety"
- QuRating: A framework/criteria for assessing data quality (e.g., expertise, writing style, educational value) used to select high-quality text. "we leverage the criteria from QuRating~\citep{wettig2024qurating} for the high-quality data."
- Reflective reasoning: A procedure where the model critiques its own (or is critiqued externally) reasoning failures and revises its answer. "we adopt two reflective reasoning methods where the intervened LLM is (1) prompted with categorized reasoning failures and (2) then is required to generate a new response fixing the failures."
- Representational drift: A lasting shift in the internal representations of a model that affects capabilities beyond formatting issues. "suggesting persistent representational drift rather than format mismatch."
- RULER: A long-context benchmark testing retrieval, extraction, aggregation, and variable tracking across synthetic distractor-heavy contexts. "RULER~\citep{ruler} provides long synthetic contexts containing distractors and relevant ``needles''; models must retrieve (NIAH), extract (CWE, FWE), aggregate information (QA), or track variables to answer queries"
- Safety alignment: The process of constraining model behavior so outputs adhere to ethical norms and avoid harmful content. "fine-tuning LLMs on malicious or benign supervised tasks can void safety alignment."
- Thought-skipping: A failure mode where models omit planning or intermediate reasoning steps, leading to errors. "we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains"
- Variable tracking: Tasks requiring models to monitor and report the values or states of variables across a long context. "Variable Tracking & \cellcolor{myred!90} 22.4 ..."
Collections
Sign up for free to add this paper to one or more collections.