Optimization-Guided Diffusion for Interactive Scene Generation

Shihao Li1,2,3    Naisheng Ye2    Tianyu Li2    Kashyap Chitta4    Tuo An3   
Peng Su5    Boyang Wang1    Haiou Liu1    Chen Lv5    Hongyang Li2

1 Intelligent Vehicle Research Center at Beijing Institute of Technology     2 OpenDriveLab at The University of HongKong
3 Nanyang Technological University      4 NVIDIA Research     5 Yinwang Intelligent Tech. Co. Ltd.
Work done while interning at OpenDriveLab
OMEGA (\Omega) teaser
OMEGA (\(\Omega\)). Our plug-and-play, training-free guidance module can be dropped into existing diffusion-based driving scene generators to provide significantly more realistic, controllable, and interactive traffic scenarios. Color transitions indicate the direction and speed of motion: ego (red gradient), attacker (yellow gradient), and other vehicles (blue gradient).
Top

Our guidance produces diverse and realistic future scenarios from the same historical initialization, increasing the rate of physically and behaviorally valid scenes on both the nuPlan benchmark (used during the base model's training) and the Waymo benchmark (zero-shot deployment) compared to unguided baselines.

Middle

Controllability-conditioned generation synthesizes scenes where vehicles must reach predefined goal points (green flags) for one or multiple target agents, achieving precise behavioral control while maintaining natural interactions.

Bottom

Finally, we enable adversarial scene generation, adaptively producing diverse attack scenarios without explicitly specifying attacker trajectories.

Abstract

Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate \(5\times\) more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.

Method

Standard diffusion vs optimization-guided OMEGA

Standard diffusion sampling (top) vs. our optimization-guided OMEGA (bottom). OMEGA re-anchors each reverse diffusion step using constrained optimization within a KL-bounded trust region, steering sampling toward physically consistent and behaviorally coherent scenes.

Two-phase noise scheduling

Two-phase noise scheduling. Warmup reduces noise along full trajectories to establish macro-level layout and agent-wise dynamic plausibility, minimizing drift and stabilizing long-horizon trends. Rolling-Zero then performs fine-grained, time-indexed denoising with residual-noise updates that incorporate environmental feedback, improving reactivity and yielding structurally feasible, responsive multi-agent scenes.

Key Contributions✨

Optimization-guided diffusion sampling. A plug-and-play, training-free method that re-anchors each reverse step via constrained optimization, enforcing structural consistency and stable diffusion guidance.

🕐 Two-phase noise scheduling. A coarse-to-fine denoising strategy for interaction fidelity, ensuring macro-level plausibility with fine-grained reactivity, enabling context-aware and dynamically coordinated multi-agent scene evolution.

♟️ OMEGAAdv. A game-thoretic formulation that models ego-attacker interactions as a distributional optimization game, enabling the adaptive generation of realistic yet challenging driving scenarios.

Video Results

Click any GIF to view in full resolution. (Click outside / press Esc to close.)

Free Exploration on nuPlan

Scenes 1-2: Two rollouts sampled from the same initial scene, resulting in different future trajectories. Scenes 3-4: Additional rollouts from distinct nuPlan scenarios. In the BEV view, the ego vehicle is shown in red; in the three camera-centric views, the ego vehicle appears in black. Surrounding agents are consistently rendered in blue.

Scene 1 (Run A)
Same init, different sample
nuPlan FE Scene 1 BEV
Bird’s-eye
nuPlan FE Scene 1 Axo
Axonometric
nuPlan FE Scene 1 Front
Front
nuPlan FE Scene 1 Rear
Rear
Idle
Scene 2 (Run B)
Same init, different sample
nuPlan FE Scene 2 BEV
Bird’s-eye
nuPlan FE Scene 2 Axo
Axonometric
nuPlan FE Scene 2 Front
Front
nuPlan FE Scene 2 Rear
Rear
Idle
Scene 3
Different scenario
nuPlan FE Scene 3 BEV
Bird’s-eye
nuPlan FE Scene 3 Axo
Axonometric
nuPlan FE Scene 3 Front
Front
nuPlan FE Scene 3 Rear
Rear
Idle
Scene 4
Different scenario
nuPlan FE Scene 4 BEV
Bird’s-eye
nuPlan FE Scene 4 Axo
Axonometric
nuPlan FE Scene 4 Front
Front
nuPlan FE Scene 4 Rear
Rear
Idle

Zero-Shot Free Exploration on Waymo

Scenes 1-2: Two rollouts sampled from the same initial scene, resulting in different future trajectories. Scenes 3-4: Additional rollouts from distinct Waymo scenarios. In the BEV view, the ego vehicle is shown in red; in the three camera-centric views, the ego vehicle appears in black. Surrounding agents are consistently rendered in blue.

Scene 1 (Run A)
Same init, different sample
Waymo FE Scene 1 BEV
Bird’s-eye
Waymo FE Scene 1 Axo
Axonometric
Waymo FE Scene 1 Front
Front
Waymo FE Scene 1 Rear
Rear
Idle
Scene 2 (Run B)
Same init, different sample
Waymo FE Scene 2 BEV
Bird’s-eye
Waymo FE Scene 2 Axo
Axonometric
Waymo FE Scene 2 Front
Front
Waymo FE Scene 2 Rear
Rear
Idle
Scene 3
Different scenario
Waymo FE Scene 3 BEV
Bird’s-eye
Waymo FE Scene 3 Axo
Axonometric
Waymo FE Scene 3 Front
Front
Waymo FE Scene 3 Rear
Rear
Idle
Scene 4
Different scenario
Waymo FE Scene 4 BEV
Bird’s-eye
Waymo FE Scene 4 Axo
Axonometric
Waymo FE Scene 4 Front
Front
Waymo FE Scene 4 Rear
Rear
Idle

Goal-Conditioned Controllability

Scenes 1–2: A paired rollout from the same scene, produced by specifying different goal points for the ego and selected agents. Scenes 3–4: Another paired rollout under a different scene initialization, with alternative goal specifications. In the BEV view, the ego vehicle is shown in red; in the three camera-centric views, the ego vehicle appears in black. Surrounding agents are consistently rendered in blue. Goal markers: red stars/dots indicate the ego goal, and green stars/dots indicate agent goals.

Scene 1
Same init, different goals
nuPlan GC Scene 1 BEV
Bird’s-eye
nuPlan GC Scene 1 Axo
Axonometric
nuPlan GC Scene 1 Front
Front
nuPlan GC Scene 1 Rear
Rear
Idle
Scene 2
Same init, different goals
nuPlan GC Scene 2 BEV
Bird’s-eye
nuPlan GC Scene 2 Axo
Axonometric
nuPlan GC Scene 2 Front
Front
nuPlan GC Scene 2 Rear
Rear
Idle
Scene 3
Same init, different goals
nuPlan GC Scene 3 BEV
Bird’s-eye
nuPlan GC Scene 3 Axo
Axonometric
nuPlan GC Scene 3 Front
Front
nuPlan GC Scene 3 Rear
Rear
Idle
Scene 4
Same init, different goals
nuPlan GC Scene 4 BEV
Bird’s-eye
nuPlan GC Scene 4 Axo
Axonometric
nuPlan GC Scene 4 Front
Front
nuPlan GC Scene 4 Rear
Rear
Idle

Adversarial Scene Generation

Adversarial scenes generated by adapting to different attacker relative positions, resulting in diverse interaction outcomes (e.g., Lane Intrusion, Hard Brake, Forced Merge, and Cross-Traffic Encroachment). In the BEV view, the ego vehicle is shown in red; in the three camera-centric views, the ego vehicle appears in black. Surrounding agents are rendered in blue, while the attacking vehicle is highlighted in yellow.

Scene 1
Lane Intrusion
Adv Scene 1 BEV
Bird’s-eye
Adv Scene 1 Axo
Axonometric
Adv Scene 1 Front
Front
Adv Scene 1 Rear
Rear
Idle
Scene 2
Hard Brake
Adv Scene 2 BEV
Bird’s-eye
Adv Scene 2 Axo
Axonometric
Adv Scene 2 Front
Front
Adv Scene 2 Rear
Rear
Idle
Scene 3
Forced Merge
Adv Scene 3 BEV
Bird’s-eye
Adv Scene 3 Axo
Axonometric
Adv Scene 3 Front
Front
Adv Scene 3 Rear
Rear
Idle
Scene 4
Cross-Traffic Encroachment
Adv Scene 4 BEV
Bird’s-eye
Adv Scene 4 Axo
Axonometric
Adv Scene 4 Front
Front
Adv Scene 4 Rear
Rear
Idle

Quantitative Results

Free Exploration

Free-exploration scene generation on nuPlan and zero-shot evaluation on Waymo. Oracle denotes the original logged data. We report N-Dist (JSD of nearest-agent distance, \(\times 10^{-3}\)), L-Dev (lateral deviation JSD, \(\times 10^{-3}\)), A-Dev (angular deviation JSD, \(\times 10^{-3}\)), Spd (speed JSD, \(\times 10^{-3}\)), per-agent / per-scene validity (P-Ag. / P-Sc.), and overall scene valid rate (no collision / off-road / kinematic violation). Arrows \(\downarrow\) / \(\uparrow\) indicate lower / higher is better.

Free-exploration scene generation results

Goal-Conditioned Controllability

Controllability under user-specified intents. Each method is evaluated by assigning multiple user-defined goal points to the target vehicle and measuring whether it can reliably reach these goals while maintaining scene realism and validity.

Goal-conditioned controllability comparison

Adversarial Scene Generation

Adversarial scene generation with ego–attacker interactions. We compare ground truth (Oracle), Nexus variants finetuned or guided for attacking behavior: Nexus-FT (finetuned on adversarial scenarios), Nexus-GC (goal-attacking conditioning), and Nexus-CTG\(_\mathrm{Adv}\) (cost-based attacking with CTG). Please refer to the supplementary material for details. We report ego risk metrics (mean time-to-collision per frame, TTC [s], and the percentage of frames with \(\mathrm{TTC} < 1,2,3\,\mathrm{s}\)), motion intensity (mean acceleration [\(\mathrm{m^2/s}\)] and jerk [\(\mathrm{m^3/s}\)]), and ego non-responsible collision rate (Ego NC, \%). Right: Nexus-\(\Omega_\mathrm{Adv}\) produces more low-speed, short-TTC frames, indicating riskier yet realistic scenarios.

Adversarial scene generation results
More Results

Comparison of Guidance Strategies

Comparison of different guidance strategies

Ablation Study

Ablation study

Ego-Kinematic Distributions in Adversarial Scenarios

Joint distributions of ego speed with TTC, acceleration, jerk, and yaw rate are shown for ground truth, Nexus, and Nexus-\(\Omega_{\mathrm{Adv}}\). Marginal histograms use a logarithmic scale to highlight rare events. Nexus-\(\Omega_{\mathrm{Adv}}\) produces more short-TTC frames, shifts ego speeds toward lower and medium ranges, and broadens acceleration and jerk distributions, indicating stronger adversarial effects on ego motion.

Ego-kinematic distributions in adversarial scenarios

More Qualitative Results

Click any image to view in full resolution.
nuPlan

Free Exploration

Top: Multiple inference runs from the same initial scene yield diverse future evolutions. Bottom: Additional scenarios illustrating broad variation in free exploration.

Free exploration on nuPlan
Waymo (Zero-shot)

Free Exploration

Top: Zero-shot rollouts on Waymo produce diverse future evolutions. Bottom: Additional scenarios demonstrating strong cross-domain generalization.

Zero-shot free exploration on Waymo
nuPlan

Goal-Conditioned Generation

Top: Different goal points yield diverse but consistent goal-directed behaviors. Bottom: Additional cases showing controllable, coherent multi-agent outcomes.

Goal-conditioned scene generation
nuPlan

Adversarial Generation

Top: Different attacker identities generate distinct adversarial outcomes. Bottom: Nexus-ΩAdv adapts to attacker positions and produces plausible attack behaviors.

Adversarial scene generation

Ecosystem

This work is part of a broader research program at OpenDriveLab focusing on large-scale simulation, scene generation, behavior modeling, and end-to-end training for autonomous driving.

Please stay tuned for more work from OpenDriveLab: R2SE, ReSim, NEXUS, MTGS, Centaur .

BibTeX

If you find the project helpful for your research, please consider citing our paper:
@article{li2025optimization,
        title={Optimization-Guided Diffusion for Interactive Scene Generation},
        author={Li, Shiaho and Ye, Naisheng and Li, Tianyu and Chitta, Kashyap and An, Tuo and Su, Peng and Wang, Boyang and Liu, Haiou and Lv, Chen and Li, Hongyang},
        journal={arXiv preprint arXiv:2512.07661},
        year={2025}
      }