- The paper introduces a closed-loop paradigm that couples proposal generation with scoring to improve autonomous driving planning.
- It employs a two-stage training process with pseudo-expert coverage and conservative self-distillation to enhance both proposal diversity and quality.
- Empirical results demonstrate state-of-the-art performance on benchmarks like NAVSIM and nuScenes with robust reproducibility and safety improvements.
CLOVER: Closed-Loop Value Estimation & Ranking for End-to-End Autonomous Driving Planning
End-to-end autonomous driving planners are predominantly trained on imitation of logged human trajectories but are evaluated using rule-based planning metrics that reflect safety, feasibility, progress, and comfort. This creates a mismatch between training and evaluation: adherence to logged paths does not guarantee satisfaction of planning metrics, and valid high-scoring alternatives may exist outside the demonstration set. Proposal-selection planners, which generate a set of candidate trajectories, often suffer from limited candidate diversity and inadequate ranking quality due to single-trajectory supervision. The performance ceiling is determined by both the coverage of high-quality proposals and the efficacy of the ranking mechanism.
CLOVER Framework
CLOVER introduces a closed-loop value estimation and ranking paradigm that couples proposal generation and scoring. The architecture consists of a generator producing diverse candidate trajectories and a scorer predicting planning-metric sub-scores for each proposal. Inference selects the top-ranked trajectory based on the composed score. Training proceeds in two stages:
Stage 1: Pseudo-expert coverage is achieved by constructing evaluator-filtered pseudo-expert trajectories from diverse action families (lateral offsets, speed profiles, etc.), providing set-level supervision to expand proposal diversity and quality. This increases the oracle upper bound for the candidate set.
Stage 2: Conservative closed-loop self-distillation alternates scorer fitting (to true evaluator scores) and generator refinement. The scorer is used to select top-k and vector-Pareto targets, and the generator is trained to cover these with stability regularization. This process avoids diversity collapse and prevents exploitation of scorer imperfections.
Theoretical Guarantees
CLOVER’s refinement mechanism relies on a selected-set enrichment condition: if scorer-selected targets are statistically enriched for true high-quality trajectories relative to the existing proposal distribution, conservative set-level distillation increases the generator's support for high-quality candidates. CLOVER does not require a globally perfect scorer; target enrichment suffices for proposal set improvement. Empirical analysis supports this premise, showing substantial enrichment of true high-score proposals among scorer-selected candidates.
Empirical Results
CLOVER achieves state-of-the-art results across major closed-loop planning benchmarks:
- NAVSIM v1: 94.5 PDMS, outperforms prior generator-scorer baselines and approaches human-driver reference.
- NAVSIM v2: 90.4 EPDMS with the updated evaluator, and 87.2 EPDMS* under the original code. On the challenging NavHard split, 48.3 EPDMS matches the strongest previously reported results.
- nuScenes open-loop: CLOVER achieves the lowest L2 displacement error and collision rate among compared approaches.
Seed-level reproducibility studies indicate negligible training variation (<0.02 PDMS), attesting to robustness.
Proposal Quality and Diversity
Analysis of generated proposals demonstrates significant stage-wise improvements:
- Stage 1: Dramatically expands proposal diversity and oracle upper bound (Oracle@64 PDMS increases from 0.9933 to 0.9976), though introduces a low-score tail.
- Stage 2: Refines the expanded distribution, increases mean proposal score (from 0.7972 to 0.8277), reduces low-score proposals (PDMS<0.50 drops from 9.05 to 6.83), and preserves diversity (Qualified Cluster Count@2m increases from 6.02 to 8.71 versus baseline).
Qualitative visualizations confirm that CLOVER produces broader candidate sets, covering multiple feasible driving modes, while maintaining high proposal quality.
Ablation and Diagnostic Studies
Critical ablations reinforce the effectiveness of CLOVER’s components:
- Pseudo-expert coverage and closed-loop refinement are mutually complementary; full CLOVER reaches 94.5 PDMS.
- Vector-Pareto guidance outperforms scalar top-k targets and distance suppression, preserving diversity among high-scoring proposals.
- Anchor-assisted soft reranking (for EPDMS) improves extended comfort and total score by reducing temporal selection jitter, without sacrificing progress.
Proposal count studies indicate diminishing returns beyond K=64 candidates. Fixed-proposal scorer diagnostics show that larger video and vision backbones (e.g., Wan2.2-5B) improve ranking quality, highlighting the importance of scorer design for final performance.
Limitations and Future Directions
CLOVER focuses on trajectory-level scoring and ranking at the per-scene level. Temporally aggregated metrics (e.g., extended comfort) depend on cross-frame consistency, currently mitigated only by optional anchor-assisted reranking. Future advances could integrate sequence-level or history-aware scorers for improved temporal coherence. Scaling the scorer architecture (with more powerful features or world models) may further enhance ranking quality, though computational costs must be considered.
Implications and Outlook
CLOVER formally bridges the gap between imitation-based training and rule-based evaluation in end-to-end autonomous driving by integrating evaluator-guided proposal coverage and scorer-mediated conservative refinement. Its two-stage paradigm ensures both diversity and quality in candidate generation, while robust closed-loop distillation shifts probability mass toward high-value regions without requiring a perfect surrogate. Theoretical and empirical evidence indicates that even imperfect scorers suffice if target selection yields statistical enrichment.
Practically, CLOVER's training schema and efficient inference architecture are compatible with existing proposal-selection planners, enabling significant gains in safety, comfort, and planning robustness. The methodology is extensible to increasingly challenging metrics, more diverse driving environments, and richer evaluation schemes. The framework’s diagnostic protocols for scorer development provide a standardized path toward further improvements.
Conclusion
CLOVER establishes a closed-loop value estimation and ranking blueprint for end-to-end autonomous driving planning, achieving strong numerical results and improved proposal-set quality and diversity on benchmark evaluations. The framework couples evaluator-filtered pseudo-expert coverage and closed-loop self-distillation, demonstrating both practical efficacy and theoretical rigor in guiding proposal generation via imperfect but enriched scorer targets. This paves the way for more robust and generalizable autonomous planning systems with flexible integration of scoring and proposal-generation modules (2605.15120).