To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Value-Based Pre-Training with Downstream Feedback

This presentation explores a revolutionary approach to foundation model pretraining that uses lightweight downstream feedback to steer the pretraining process. Instead of relying on fixed proxy objectives like next-token prediction, this method introduces a task designer that reshapes pretraining tasks online using gradient alignment with downstream goals, achieving better performance per compute step without training the large model on downstream labels.
Script
What if we could teach foundation models to learn exactly what we care about, instead of hoping that next-token prediction will magically give us the capabilities we need? This breakthrough research shows how a small amount of downstream feedback can dramatically reshape pretraining to deliver more value per gradient step.
Let's start by understanding the fundamental limitation this work addresses.
Building on this challenge, foundation models today train on massive datasets using proxy objectives that don't directly align with what we want them to learn. This misalignment wastes precious compute cycles on patterns that may not translate to better reasoning or perception capabilities.
The authors ask a provocative question: can we use just a tiny bit of downstream feedback to guide pretraining toward the capabilities we actually need? The key insight is to keep everything else the same but intelligently reshape how the pretraining task itself is constructed.
Now let's dive into their elegant solution called value-based pretraining.
The solution introduces a lightweight task designer that continuously reshapes pretraining tasks based on how well they align with downstream goals. Crucially, the main model never sees downstream labels - it only benefits from better-designed pretraining tasks.
At the heart of their approach is a value function that computes the dot product between pretraining gradients and downstream gradients. This simple mathematical operation captures how much a pretraining step will help with downstream performance.
This diagram beautifully illustrates the complete framework where the task designer sits between unlabeled data and the learner, continuously optimizing how pretraining tasks are constructed. The downstream feedback creates a control loop that steers pretraining toward more valuable learning, while the learner itself remains focused purely on the reshaped pretraining objective.
Let's examine how this works in practice for both language and vision tasks.
The researchers demonstrate their approach across two very different domains. For language, they replace hard next-token targets with learnable soft distributions over the most likely tokens. For vision, they replace fixed data augmentations with learned, instance-specific masking that creates more informative self-supervised learning views.
The algorithm elegantly alternates between updating the task designer to maximize gradient alignment and updating the learner on the designer-shaped pretraining tasks. This creates a continuous feedback loop that steers pretraining toward more downstream-relevant patterns without ever exposing the learner to downstream labels.
Now let's look at the impressive experimental validation across language and vision domains.
The language experiments focus on mathematical reasoning using continued pretraining on math problems. Remarkably, they use only 12% of the GSM8K training set as feedback while processing the same amount of unlabeled data and compute as the baselines.
For vision, they tackle dense prediction tasks that traditional self-supervised learning struggles with. Using just 512 labeled examples each from segmentation and depth estimation datasets, they guide continued pretraining on ImageNet toward better dense understanding.
The results are striking - consistent improvements across all model sizes on mathematical reasoning tasks. The smaller models benefit most dramatically, suggesting that value-based pretraining is particularly effective when compute is limited and every gradient step needs to count.
The vision results demonstrate that steering pretraining toward dense prediction tasks doesn't hurt general visual representation quality. In fact, models often improve on both the target dense tasks and maintain or enhance performance on other visual tasks like retrieval.
This figure reveals two key insights: first, value-based pretraining reaches target performance levels using fewer tokens, demonstrating improved sample efficiency. Second, by adjusting the feedback weights, researchers can smoothly trade off between different capabilities like segmentation and depth estimation, giving practitioners fine-grained control over what their models learn.
The researchers conducted thorough ablations to understand what drives their improvements. Random feedback destroys the benefits, confirming that intelligent gradient alignment is crucial. Even after removing potential data contamination, the advantages persist, and the method shows good transfer to related but distinct tasks.
Like all good research, this work acknowledges important limitations and future directions.
The method currently requires differentiable feedback signals and adds modest computational overhead. Future work could extend this to handle preferences, pass-fail metrics, or other non-differentiable feedback that's more natural for many applications.
The future looks promising for extending this approach to handle human preferences, tool use success rates, and other real-world feedback signals. This could fundamentally change how we think about the boundary between pretraining and fine-tuning.
This work addresses one of the most fundamental challenges in modern machine learning: how to efficiently train foundation models that excel at the capabilities we actually care about. By providing a principled way to steer pretraining with minimal supervision, it opens new possibilities for building more capable and aligned models.
Value-based pretraining represents a paradigm shift from hoping proxy objectives will give us what we want, to actively steering learning toward our goals from the very beginning. Visit EmergentMind.com to dive deeper into this research and explore how gradient alignment could reshape the future of foundation model training.