Reinforcing Diffusion Planners through
Closed-Loop and Sample-Efficient Fine-Tuning

Hongchen Li^1,2,3    Tianyu Li^2,3    Jiazhi Yang³    Haochen Tian³
Caojun Wang^1,2,3    Lei Shi⁴    Mingyang Shang⁵    Zengrong Lin⁵
Gaoqiang Wu⁵    Zhihui Hao⁵    Xianpeng Lang⁵    Jia Hu¹^†    Hongyang Li³^†
¹Tongji University          ²Shanghai Innovation Institute
³OpenDriveLab at The University of Hong Kong            ⁴Meituan Inc.           ⁵Li Auto Inc.

Paper arXiv

🎯 TL;DR: A closed-loop and sample-efficient RFT framework for diffusion-based planners.

📐 Orthogonal Guided Denoising. An orthogonally decoupled guided denoising strategy that perturbs lateral displacement and longitudinal velocity around a reference trajectory, facilitating diverse and continuous trajectory sampling.

🧭 Scenario-Adaptive Exploration. A learnable exploration policy that dynamically adjusts the denoising process based on the current scenario, toward more promising exploration thus improving sample-efficiency.

🔁 Closed-Loop Optimization. A dual-branch optimization framework that jointly refines trajectory distributions and the exploration policy through closed-loop rollouts.

Abstract

Diffusion-based planners have emerged as a promising approach for human-like trajectory generation in autonomous driving. Recent works incorporate reinforcement fine-tuning to enhance robustness through reward-oriented optimization in a generation-evaluation loop. However, they struggle to generate multi-modal, scenario-adaptive trajectories, hindering the exploitation efficiency of informative rewards during fine-tuning. To resolve this, we propose PlannerRFT, a sample-efficient reinforcement fine-tuning framework for diffusion-based planners. PlannerRFT adopts a dual-branch optimization that simultaneously refines the trajectory distribution and adaptively guides the denoising process toward more promising exploration, without altering the original inference pipeline. To support parallel learning at scale, we develop nuMax, an optimized simulator that achieves 10x faster rollout compared to native nuPlan. Extensive experiments show that PlannerRFT achieves state-of-the-art performance, with distinct behaviors emerging during the learning process.

Model Overview

We enhance multi-modality during RL sampling through Guided Denoising, with guidance scales modulated by the Exploration Policy to generate scenario-adaptive trajectories. The planner gathers on-policy interaction data through Closed-Loop Rollout in simulation. A dual-branch optimization framework performs trajectory optimization and exploration optimization to steer the denoising process.

Experiment Results

Closed-loop Performance

Closed-loop planning results on nuPlan dataset. PlannerRFT adopts the DDIM sampler with 5 denoising steps. DPM: DPM-Solver with 10 denoising steps. DDIM: DDIM sampler with 5 denoising steps. NR: non-reactive traffic. R: reactive traffic.

More Quantitative Results

Comparison of Exploration Policies

We evaluate four exploration policies in RL sampling: 1) w/o Guidance: denoising from random noise without any guidance, 2) w/ Uniform Dist: sampling the guidance scale from a uniform distribution, 3) w/ Fixed Beta Dist: sampling the guidance scale from a fixed Beta distribution, and 4) our policy-guided denoising. All planners use the same 5-step DDIM denoising setup. \( \mathcal{D} \) denotes the modality of the sampled trajectory group, following the definition in DiffusionDrive , while \( \bar{r} \) and \( s_{r} \) represent the mean and standard deviation of the corresponding rewards, respectively.

Comparison of Guidance Type

We evaluate the effectiveness of lateral and longitudinal guidance by enabling them individually in the policy-guided denoising process. Results are reported on the Test14-random reactive benchmark.

Comparison of Fine-tuning Data Distribution

We collect 144,494 non-overlapping scenarios from nuPlan at 10 Hz sampling rate. We evaluate all scenes using the pretrained planner and construct three datasets according to performance scores: 1) Fail, including 10,417 collision or off-road cases; 2) Lt90, including all low-score (less than 90) cases, totaling 24,691 scenes; and 3) All, which includes all available scenes.

Qualitative Results

Left: IL-Pretrained. Right: PlannerRFT. S-Curve Lane Change. The ego vehicle starts in the left lane of an S-curve, with a stationary vehicle ahead in the same lane.

Left: IL-Pretrained. Right: PlannerRFT. Multi-Lane. The ego vehicle is driving in a multi-lane road.

Left: IL-Pretrained. Right: PlannerRFT. Blocked Right-Turn in Reactive Traffic. The ego vehicle intends to turn right at the upcoming intersection.

Left: IL-Pretrained. Right: PlannerRFT. Left-Turn in Reactive Traffic. The ego vehicle intends to turn left at an unsignalized intersection.

Left: IL-Pretrained. Right: PlannerRFT. Intersection Pedestrian Avoidance. The ego vehicle intends to make a right turn at an intersection while pedestrians are crossing.

Left: IL-Pretrained. Right: PlannerRFT. Traffic-Cone Narrowing. The ego vehicle is driving on a curved road with cones.

Left: IL-Pretrained. Right: PlannerRFT. A Causal-Confusion Scenario. The ego vehicle turns right at an intersection. The IL pretrained planner directs the ego vehicle to turn right and then pull over to the side of the road. This behavior is likely due to causal confusion: a large number of scenarios in the training data where vehicles turn right and stop to pick up passengers.

Closed-Loop Simulator: nuMax

For scalable closed-loop training, we develop nuMax, a GPU-parallel simulator built upon Waymax and calibrated for the large-scale nuPlan benchmark, achieving up to 10x faster simulation speed than the native nuPlan simulator.

Illustration of nuMax. (a) Scenario cache: nuPlan sce- narios are preprocessed and cached for fast loading during large- scale RL rollouts; (b) LQR tracker and scorer: vehicle kinemat- ics and reward computation are calibrated to match nuPlan; and (c) Distributed RL training pipeline: enables communication be- tween PyTorch DistributedDataParallel (DDP) workers and the JAX-based simulator.

Distributed RL Training Pipeline. Policy inference and learning run in PyTorch DDP, while environment simulation is executed in JAX on rank-0 to avoid XLA conflicts. Observations and replay samples are scattered to all ranks for guided denoising and gradient updates, and planned trajectories are gathered back to rank-0 for simulation, with synchronization at every step.

Ecosystem

This work is part of a broader research program at OpenDriveLab focusing on large-scale simulation, scene generation, behavior modeling, and end-to-end training for autonomous driving.

Please stay tuned for more work from OpenDriveLab: R2SE, ReSim, NEXUS, MTGS, Centaur .

BibTeX

If you find the project helpful for your research, please consider citing our paper:

@article{li2026plannerrft,
        title={PlannerRFT: Reinforcing Diffusion Planners through Closed-Loop and Sample-Efficient Fine-Tuning},
        author={Hongchen Li and Tianyu Li and Jiazhi Yang and Haochen Tian and Caojun Wang and Lei Shi and Mingyang Shang and Zengrong Lin and Gaoqiang Wu and Zhihui Hao and Xianpeng Lang and Jia Hu and Hongyang Li},
        journal={arXiv preprint arXiv:2601.12901},
        year={2026}
      }