Driving with Graph Visual Question Answering

Chonghao Sima*, Katrin Renz*, Kashyap Chitta, Li Chen,
Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li

* Equal contribution.   Equal co-advising.


In DriveLM, we study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users.

Specifically, we aim to facilitate Perception, Prediction, Planning, Behavior, Motion tasks with human-written reasoning logic as a connection. We propose the task of GVQA to connect the QA pairs in a graph-style structure. To support this novel task, we provide the DriveLM-Data.


In the DriveLM dataset, QAs are connected in a graph-style structure, with QA pairs as every node, and objects' relationships as the edges.

Perception, Prediction, Planning

The most central element of DriveLM is frame-wise P3 QA, where P3 stands for Perception, Prediction, and Planning. This allows us to achieve complete functionality in full-stack autonomous driving.

What if

We try to reason about future events that have not yet happened. The way we do this is to ask many "What if"-style questions, which is a common way for humans to imagine the future by language.

What is GVQA?

The most exciting aspect of the dataset is that the questions and answers QA pairs are connected in a graph-style structure, with QA pairs as every node and potential logical progression as the edges. The reason for doing this in the AD domain is that AD tasks are well-defined per stage, from raw sensor input to final control action through perception, prediction and planning.

Its key difference to prior VQA tasks for AD is the availability of logical dependencies between QAs, which can be used to guide the answering process. Below is a demo video illustrating the idea.

Features of the DriveLM-Data

  • 🛣 Completeness in functionality (covering Perception, Prediction, and Planning QA pairs).

  • 🔜 Reasoning for future events that have not yet happened.
    • Many "What If"-style questions: imagine the future by language.

  • ♻ Task-driven decomposition.
    • One scene-level description into many frame-level trajectories & planning QA pairs.

How about the annotation process?

The annotation process is different for DriveLM-nuScenes and DriveLM-CARLA.

For DriveLM-nuScenes, we divide the annotation process into three steps:

1️⃣ Keyframe selection. Given all frames in one clip, the annotator selects the keyframes that need annotation. The criterion is that those frames should involve changes in ego-vehicle movement status (lane changes, sudden stops, start after a stop, etc.).

2️⃣ Key objects selection. Given keyframes, the annotator needs to pick up key objects in the six surrounding images. The criterion is that those objects should be able to affect the action of the ego vehicle (traffic signals, pedestrians crossing the road, other vehicles that move in the direction of the ego vehicle, etc.).

3️⃣ Question and answer annotation. Given those key objects, we automatically generate questions regarding single or multiple objects about perception, prediction, and planning. More details can be found in our data.

For DriveLM-CARLA, we employ an automated annotation approach:

We collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework with a privileged rule-based expert. We set up a series of routes in urban, residential, and rural areas and execute the expert on these routes. During this process, we collect the necessary sensor data, generate relevant QAs based on privileged information about objects and the scene, and organize the logical relationships to connect this series of QAs into a graph.


Please consider citing our project if it helps your research.

    title={DriveLM: Driving with Graph Visual Question Answering},
    author={Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and Xie, Chengen and Luo, Ping and Geiger, Andreas and Li, Hongyang},
    journal={arXiv preprint arXiv:2312.14150},
    title={DriveLM: Driving with Graph Visual Question Answering},
    author={DriveLM contributors},


The OpenDriveLab team is part of the Shanghai AI Lab and kindly supported by National Key R&D Program of China (2022ZD0160104) and NSFC (62206172). This work was also supported by the BMBF (Tübingen AI Center, FKZ: 01IS18039A), the DFG (SFB 1233, TP 17, project number: 276693517), and by EXC (number 2064/1 – project number 390727645). We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting K. Renz and K. Chitta. Our gratitude goes to Tai Wang for the valuable feedback, Jens Beißwenger for assisting with the CARLA setup, Qingwen Bu for refining the figures, Jiajie Xu for refining the DriveLM-nuScenes and cleaning the DriveLMAgent codebase, and Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Tianyu Li, Yunsong Zhou, Zetong Yang for the fruitful discussion.