Learn Before Represent (LBR) in ML Generalization

Updated 23 January 2026

LBR is a framework that formalizes learning invariant representations by sampling multiple related tasks, drastically reducing per-task sample complexity.
The two-stage approach first learns a shared representation, then fine-tunes task-specific heads, ensuring efficient adaptation with minimal labeled data.
Empirical results in neural, video-action, and LLM domains demonstrate LBR’s robust performance with significant reductions in labeled data requirements.

Learn Before Represent (LBR) is a principle and emerging framework in machine learning that formalizes the process of first acquiring knowledge or invariant structure from an environment—over many related tasks or passive observations—prior to extracting representations for efficient task- or domain-specific generalization. LBR provides theoretical and algorithmic grounding for a broad class of transfer, representation, and meta-learning methods and is instantiated in diverse settings ranging from neural networks and internal representation learning to action space inference from video and domain embedding for LLMs.

1. Foundational Principle and Formalization

LBR posits that the information required for robust generalization across tasks cannot, in general, be contained in a single task. Instead, generalization demands access to an environment—a probability measure $Q$ over a family of tasks $\mathcal{T}$ —from which the learner can sample multiple related tasks. The learner’s ultimate goal is to extract a representation $f$ —usually a map $f: X \to V$ —that captures structure common to all tasks in the environment, such that future task learning requires dramatically fewer labeled examples.

Formally, the LBR workflow consists of:

Sampling $n$ tasks $h_1, \dots, h_n$ independently from $Q$ ; for each task, collecting $m$ examples $(x_{ij}, y_{ij})$ according to joint distributions for $h_i$ .
Fixing hypothesis families: representations $\mathcal{F} = \{ f: X \to V \}$ and per-task heads $\mathcal{G} = \{ g: V \to \hat{A} \}$ .
Finding $\hat f \in \mathcal{F}$ that minimizes the empirical representation loss across the $n$ tasks, where

$E^*_Z(f) = \frac{1}{n} \sum_{i=1}^n \inf_{g \in \mathcal{G}} \left[ \frac{1}{m} \sum_{j=1}^m \ell(g(f(x_{ij})), y_{ij}) \right].$

For a new task, using only a small sample to learn a new $g$ on top of the frozen $\hat f$ , yielding strong generalization with minimal supervision (Baxter, 2019, Baxter, 2019).

LBR is formally characterized by its sample complexity bounds, which demonstrate that joint representation learning over $n$ tasks reduces per-task sample complexity by an $\approx n$ -fold factor compared to learning each task independently.

2. Generalization Guarantees and Sample Complexity

Theoretical analyses provide uniform generalization bounds for LBR. For permissible function classes (e.g., neural networks), for loss $\ell$ bounded by $M$ , accuracy $\varepsilon$ , and confidence $\delta$ , the sample complexity for achieving uniform approximation is: $n \geq O\left(\frac{1}{\varepsilon^2} W_{\mathcal{G}}\right), \quad m \geq O\left(\frac{1}{\varepsilon^2} (W_{\mathcal{F}} + \frac{1}{n} W_{\mathcal{G}})\right)$ where $W_{\mathcal{F}}$ and $W_{\mathcal{G}}$ are the parameter counts in the representation and head classes, respectively.

Critically, the per-task sample requirement for subsequent, unseen tasks reduces from $O(W_{\mathcal{F}} + W_{\mathcal{G}})$ to $O(W_{\mathcal{G}})$ when a shared $f$ has already been learned. This results in an $O(n)$ -fold reduction in the number of examples required per new task in the high- $n$ regime, provided tasks are sufficiently related (Baxter, 2019, Baxter, 2019).

The generalization argument leverages symmetrization and empirical covering-number bounds, which show that sample complexity decomposes as $O(a + b/n)$ per task, where $a$ depends on the output map capacity and $b$ on the representation capacity.

3. Algorithmic Instantiations

3.1 Classical and Neural LBR

The two-stage LBR procedure is algorithmically realized as:

Representation learning stage: Obtain $\hat f$ minimizing $E^*_Z(f)$ across $n$ tasks using the full $(n, m)$ -array of samples.
Task-specific learning: For a new task, learn only the head $g$ using minimal data, keeping $\hat f$ fixed.

Optimization can proceed by gradient descent, with parameter updates for $\hat f$ averaging gradients over all tasks and per-task heads trained independently (Baxter, 2019).

3.2 LBR in Action-Space Discovery from Video

In the CLASP framework (Rybkin et al., 2018), LBR is used to discover an agent’s action manifold via unsupervised stochastic video prediction. The method learns a latent variable $z_t$ that encodes scene dynamics, subject to composability constraints. A second semi-supervised stage grounds $z_t$ in physical actions with minimal labeled data:

Stage 1: Passive, unsupervised video prediction using a variational autoencoder with information bottlenecks and compositionality loss.
Stage 2: Supervised grounding (MLP mappings) between $z_t$ and action labels $u_t$ with few annotated trajectories.

Empirical results confirm that, after Stage 1, fewer than 1% of the labeled data are required to achieve performance comparable to fully supervised methods (Rybkin et al., 2018).

3.3 Generative-Contrastive LBR for LLM Embeddings

Recent work extends LBR to domain-specific LLM embeddings (Liang et al., 16 Jan 2026), addressing the known deficiency of contrastive learning (CL) to acquire vertical-domain knowledge:

Stage 1: IB-constrained generative learning (IB-GL). Introduce bottleneck “memory” tokens $Z$ and enforce the Markov chain $X \rightarrow Z \rightarrow Y$ with a custom attention mask, driving $I(X; Z) \downarrow$ and $I(Z; Y) \uparrow$ .
Stage 2: Generative-refined contrastive learning (Gen-CL) over $Z$ , using InfoNCE but preserving the causal decoder architecture. Architecturally, representation extraction uses the final hidden state of the last bottleneck token as embedding.

This sequential training enables robust, domain-aware, and compact representations, outperforming both standard CL and generative learning alone in chemistry, medical, and code retrieval by large margins. Bottleneck ratio $R = |X|/|Z| \approx 500$ is optimal for compression and performance (Liang et al., 16 Jan 2026).

4. Representative Applications and Empirical Findings

LBR has been evaluated and validated in multiple domains:

Low-dimensional environments: Translation-invariant Boolean functions and symmetric Boolean functions with feedforward networks exhibit rapid reduction in required $m$ as $n$ increases; after learning $n\approx 20$ patterns, only $m \sim 10$ examples suffice for new patterns (Baxter, 2019).
Action modeling from videos: In Reacher/BAIR setups, CLASP shows absolute action prediction errors of $2.9^\circ$ , matching fully supervised models with $\sim 100$ times less labeled data (Rybkin et al., 2018).
Domain-specific LLMs: In chemistry, medical, and code retrieval, LBR (Qwen2.5-1.5B, Causal) achieves Recall@10 of 0.797/0.979/0.980 versus 0.712/0.961/0.850 for the strongest baseline, and an average score of 87.9 versus 79.3. The IB-GL stage alone outperforms pure CL on retrieval and nearly matches supervised fine-tuning in generation (Liang et al., 16 Jan 2026).

Weighted ablations demonstrate the importance of adequate compression ( $R\approx 500$ for memory tokens) and maintaining the causal attention pattern throughout both stages to preserve knowledge-rich representations.

Application Domain	Framework	Empirical Finding
Boolean functions	Neural LBR	$m$ per task drops sharply as $n$ increases
Action video prediction	CLASP (LBR)	<3° error with $\sim$ 100 labeled clips, matching supervised models
LLM domain embeddings	LBR (LLMs)	Substantial gains in domain retrieval and generation over CL and GL baselines

5. Context Within Representation, Transfer, and Meta-Learning

LBR generalizes the representation learning and meta-learning paradigm by providing formal backing for leveraging inductive bias and environmental information across tasks. Unlike conventional empirical risk minimization, which focuses on learning a single task, LBR mathematically explains why and how shared representation learning—via the explicit use of a task environment—yields drastic reductions in sample complexity and supports fast adaptation.

LBR subsumes and provides theoretical context for a range of methods:

Multi-task learning (shared feature spaces)
Transfer/few-shot learning (post-hoc rapid adaptation)
Contrastive pretraining (alignment after compression/knowledge acquisition)
Information bottleneck approaches (controlled semantic compression) The objective separation between knowledge acquisition (generative, bottlenecked) and representation alignment (contrastive, discriminative) resolves practical conflicts in model objectives and architectural mismatches, especially relevant for LLM adaptation (Liang et al., 16 Jan 2026).

6. Limitations, Open Directions, and Extensions

Current LBR research faces several open challenges:

Task-structure sensitivity: Performance depends on the degree of shared structure across tasks; unrelated tasks afford no sample-complexity gains (Baxter, 2019, Baxter, 2019).
Scalability: Extensions to higher-dimensional, real-world domains (e.g., complex agent body-plans, law, and finance) remain non-trivial (Rybkin et al., 2018, Liang et al., 16 Jan 2026).
Adaptive bottleneck design: Fixed-size bottleneck tokens in IB-GL may over- or under-compress, motivating research into adaptive or hierarchical bottlenecks.
Objective coupling: Joint optimization of generative and contrastive objectives, or multi-task blends, is not fully resolved.
Planning and control: In CLASP, moving from fixed MPC-style planning to optimal control, and extending to real robot loops, are ongoing directions.

A plausible implication is that further development of adaptive representations, hierarchical IB constraints, and hybrid generative-contrastive pipelines may expand LBR’s impact on both generalization theory and practical domain adaptation.

7. Historical Significance and Modern Implications

LBR’s theoretical roots date to the 1990s (Baxter, 2019, Baxter, 2019) and have presaged many developments in modern machine learning, from explicit multi-task pretraining to contrastive and bottleneck-based representation frameworks. The LBR perspective provides rigorous explanation for the observed effectiveness of pretraining on large task or data corpora and subsequent fine-tuning, quantifying efficiency gains via formal sample complexity reductions.

In modern LLM-based settings, LBR addresses knowledge alignment and objective conflicts, establishing a new paradigm for accurate and robust representations in vertical domains (Liang et al., 16 Jan 2026). Its unifying framework underscores the value of first acquiring domain-invariant structure before extracting and aligning representations for downstream use, providing a principled alternative to ad hoc transfer and adaptation heuristics.

Markdown Upgrade to Chat

References (4)

Learning Internal Representations (PhD Thesis) (2019)

Learning Internal Representations (COLT 1995) (2019)

Learning what you can do before doing anything (2018)

Learn Before Represent: Bridging Generative and Contrastive Learning for Domain-Specific LLM Embeddings (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learn Before Represent (LBR).