Filtered Behavior Cloning (FBC)
- Filtered Behavior Cloning (FBC) is an offline RL method that enhances policy training by discarding suboptimal trajectories based on cumulative rewards.
- It applies trajectory-level filtering or per-transition advantage estimation (AFBC) to emphasize high-quality demonstrations in sparse reward and noisy environments.
- Empirical evaluations show FBC methods achieve faster convergence, lower variance, and improved performance compared to standard behavioral cloning and Decision Transformer.
Filtered Behavior Cloning (FBC) is a class of offline reinforcement learning (RL) algorithms that address the challenge of learning high-quality policies from static datasets by explicitly discarding suboptimal data before, or during, behavior cloning. Unlike standard behavioral cloning (BC)—which fits a policy to mimic all actions in the offline dataset—FBC applies explicit filtering at the trajectory or transition level, focusing policy learning on high-return demonstrations. Empirical evaluations show that FBC-type approaches can provide substantial gains in sparse reward settings and in the presence of high-noise or low-quality supervision (Omori et al., 14 Jul 2025, Grigsby et al., 2021).
1. Formal Framework and Variants
Let denote an offline dataset of trajectories . In the prototypical FBC implementation, each trajectory is scored by its cumulative return . Given a filtering threshold , the filtered dataset is
The policy is then trained, by minimizing the negative log-likelihood over state-action pairs extracted from : Variants such as Advantage-Filtered Behavioral Cloning (AFBC) [Editor's term] employ per-transition advantage estimates in place of trajectory-level return, optionally weighting or masking the BC loss by , e.g., 0 (Grigsby et al., 2021).
2. Algorithms and Implementation Details
The canonical FBC algorithm performs the following:
2
In high-noise regimes, AFBC augments this with a learned critic 1 (optimizing a double Q-learning loss as in TD3/SAC), and computes batch-wise advantages: 2 for 3. The actor loss is masked to positive-advantage transitions or weighted (e.g., exponentially). Prioritized Experience Replay (PER) is used to upsample the rare high-advantage samples and stabilize learning (Grigsby et al., 2021).
3. Hyperparameterization and Practical Guidelines
For trajectory-level FBC employed in sparse-reward domains, 4 is typically set to 5 to select only successful trajectories, or to an empirical percentile (90th percentile, i.e., top 10% of returns, for sparsified-reward domains). Experiments in (Omori et al., 14 Jul 2025) used 6 across all tasks without cross-validation. Practitioners are advised to ensure the filtered dataset has at least 100 trajectories to mitigate overfitting; a filter range of 5–20% is typical.
For AFBC in high-noise scenarios, the per-sample filter 7 is set as a binary mask 8 or an exponential weighting 9. PER parameters (priority exponent 0, small 1 offset) are adopted to maintain sampling of scarce expert-like transitions (Grigsby et al., 2021).
FBC eliminates the need for model-specific hyperparameters required by Decision Transformer (e.g., return-to-go conditioning, context window); only standard BC/Gaussian policy and MLP hyperparameters are required (Omori et al., 14 Jul 2025).
4. Empirical Performance and Comparative Analysis
Extensive benchmarking (Omori et al., 14 Jul 2025, Grigsby et al., 2021) demonstrates that FBC methods surpass both standard BC and Decision Transformer in key settings:
- On D4RL locomotion tasks with sparsified rewards, FBC achieved an aggregate normalized score of 78.2 (vs. DT 75.4, BC 38.7, CQL 40.8). FBC outperformed DT on 7/9 tasks, and provided a ≈4% aggregate improvement over DT.
- On Robomimic robotic manipulation (sparse reward), FBC delivered an average success rate of 0.89 (vs. DT 0.86); best per-task results were jointly held by FBC or filtered DT (FDT).
- In high-noise datasets (e.g., expert:noise ratio 1:44), AFBC with PER retained >90% of expert performance, while vanilla BC collapsed and unprioritized AFBC rapidly degraded for noise ratios >10:1 (Grigsby et al., 2021).
FBC requires less data and converges with lower variance and faster wall-clock time than DT. In the referenced MLP backbone, FBC uses 0.5 M parameters and wall-clock training roughly 3× faster than DT's 1.0 M (transformer backbone) (Omori et al., 14 Jul 2025).
| Algorithm | Supervision Filter | Critic Required | Extra Overhead |
|---|---|---|---|
| BC | None | No | Minimal |
| FBC | Trajectory return | No (standard) | Minor (filtering) |
| AFBC | Q-based advantage | Yes | Critic + filtering |
| DT | RTG conditioning (implicit) | No | Transformer, RTG |
| AFBC+PER | Q-based adv., PER sample | Yes | Critic, PER |
5. Theoretical and Empirical Rationale
- In sparse-reward problems, failed demonstrations encode little learning signal and introduce label noise; explicit filtering focuses model capacity on genuine successes, reducing overfitting and variance.
- In high-noise/offline datasets with a majority of low-quality transitions, trajectory-level FBC is insufficient; per-transition filters (advantage-based) are more robust. However, binary filters can dramatically reduce effective batch size (as most transitions are discarded), leading to high gradient variance. Prioritized sampling (PER) counteracts this signal scarcity.
- Decision Transformer’s return-to-go conditioning acts as a form of implicit filtering, but with additional computational complexity. In sparse regimes, FBC’s explicit trajectory removal is more direct and efficient (Omori et al., 14 Jul 2025).
- AFBC provides resilience to extreme noise but depends on reliable value estimation—estimator variance near the filtration threshold remains a challenge (Grigsby et al., 2021).
6. Limitations and Appropriate Usage
FBC offers clear benefits when reward signals are sparse and trajectory quality is highly variable. If trajectories are uniformly good (dense reward, medium/expert datasets), filtering can remove desirable state-action diversity and harm generalization. In tasks requiring credit assignment across suboptimal intermediate actions ("stitching"), discarding all but the top returns may preclude robust policy recovery; more advanced methods (e.g., CQL, return-conditioned or Q-regularized DT) may be preferable.
The size of the filtered dataset is critical—overly aggressive thresholds risk overfitting and instability. In extremely low-quality or adversarial datasets where expert behavior is very rare, AFBC combined with PER is empirically necessary for positive signal extraction (Grigsby et al., 2021).
7. Connections to Related Areas and Future Directions
FBC/AFBC intersects with research on imitation learning under imperfect supervision, robust policy learning, and prioritized experience replay. Techniques such as advantage estimation, value-based filtering, and multi-modal data selection may offer further improvements. Empirical exploration of more sophisticated filtering heuristics (e.g., statistical confidence bounds, uncertainty estimates) have not yet surpassed the classic filter+PER baseline in large-scale experiments (Grigsby et al., 2021). A plausible implication is that simplicity and ease of implementation make FBC and AFBC appealing defaults, but their effectiveness largely depends on the underlying distributional properties of the offline dataset.
Further analysis of sample complexity, theoretical regret bounds under different noise models, and generalization to domains with temporal abstraction or hierarchical RL remain open research directions.