To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Shaping AI Capabilities with Token-Level Data Filtering

This lightning talk explores a groundbreaking approach to AI safety that goes beyond traditional post-hoc safeguards. The researchers demonstrate how filtering training data at the token level during pretraining can effectively remove unwanted capabilities while preserving benign ones, offering a more robust alternative to techniques like refusal training that can be bypassed through jailbreaks or fine-tuning.
Script
What if we could prevent AI models from learning dangerous capabilities in the first place, rather than trying to control them after training? This research introduces a revolutionary approach that shapes model capabilities during pretraining by filtering training data at the token level, offering a more robust foundation for AI safety.
Let's start by examining why current safety approaches fall short.
Traditional safety measures like refusal training and classifiers don't actually remove capabilities from the base model. The authors recognized that we need to shape capabilities during pretraining itself, before unwanted abilities are ever learned.
The researchers asked whether filtering training data could robustly shape capabilities at scale. They used medical knowledge as a proxy for dangerous capabilities, attempting to remove it while preserving related biology knowledge.
The key innovation lies in moving from document-level to token-level filtering.
Traditional document-level filtering throws away entire documents containing any unwanted content, discarding valuable benign text. Token-level filtering precisely targets problematic spans while preserving the surrounding useful information, achieving better precision at the same recall.
The authors developed two token-level interventions: loss masking keeps problematic tokens visible for context but prevents learning from them, while removal replaces them entirely. They used innovative weak supervision with sparse autoencoders to efficiently label tokens.
Let's examine how they implemented this approach at scale.
The researchers trained models from 61 million to 1.8 billion parameters on filtered versions of FineWeb-Edu. They evaluated capability suppression using perplexity on medical text, multiple choice questions, and free-form medical question answering.
Since token-level annotations are expensive, they developed a clever weak supervision approach using sparse autoencoders to identify medical concepts, then distilled this knowledge into efficient bidirectional classifiers for production filtering.
The results demonstrate clear advantages for token-level filtering across multiple dimensions.
Token filtering achieved the same level of medical capability suppression as document filtering while preserving more biology knowledge. For the largest models, this translated to a 7000-fold effective compute slowdown on the medical domain.
The capability suppression extended beyond perplexity to actual task performance. Medical multiple choice accuracy dropped to chance while biology performance remained intact, and medical question answering saw dramatic quality decreases with minimal impact on general tasks.
Crucially, the filtered models showed strong robustness against attempts to recover medical capabilities through fine-tuning. Token removal required 13 times more training data than unlearning baselines to restore medical performance, with robustness increasing at larger scales.
The research uncovered some unexpected benefits of the filtering approach.
Surprisingly, token-filtered models retained the ability to recognize medical content and showed improved alignment behavior. When trained to refuse medical questions, they generalized better than document-filtered models, achieving higher refusal rates on medical queries without over-refusing on general topics.
The authors acknowledge several important limitations of their approach.
The authors note that filtering remains a blunt instrument, and fine-grained control within the target domain suffers. There's also uncertainty about whether the approach will remain effective as models scale to much larger sizes.
This work opens several promising research directions.
Future research could move beyond content-based proxies to direct influence attribution, integrate model internals into filtering decisions, and explore defense-in-depth strategies that combine pretraining filtering with post-training safeguards.
Let's consider the broader implications of this research.
This research demonstrates a shift from reactive to proactive AI safety, showing that we can shape model capabilities during training itself. The robustness advantages over traditional approaches could be crucial as AI systems become more powerful.
Token-level data filtering represents a fundamental advance in AI safety, offering a robust foundation for controlling model capabilities before they're learned rather than after they emerge. For more cutting-edge AI research insights, visit EmergentMind.com to stay at the forefront of the field.