In DriveLM, we study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users.
Specifically, we aim to facilitate Perception, Prediction, Planning, Behavior, Motion
tasks with human-written reasoning logic as a connection. We propose the task of GVQA to connect the QA pairs in a graph-style structure. To support this novel task, we provide the DriveLM-Data.
In the DriveLM dataset, QAs are connected in a graph-style structure, with QA pairs as every node, and objects' relationships as the edges.
The most central element of DriveLM is frame-wise P3 QA, where P3 stands for Perception, Prediction, and Planning. This allows us to achieve complete functionality in full-stack autonomous driving.
We try to reason about future events that have not yet happened. The way we do this is to ask many "What if"-style questions, which is a common way for humans to imagine the future by language.
The most exciting aspect of the dataset is that the questions and answers QA pairs
are connected in a graph-style structure, with QA pairs as every node and potential logical progression as the edges. The reason for doing this in the AD domain is that AD tasks are well-defined per stage, from raw sensor input to final control action through perception, prediction and planning.
Its key difference to prior VQA tasks for AD is the availability of logical dependencies between QAs, which can be used to guide the answering process. Below is a demo video illustrating the idea.
The annotation process is different for DriveLM-nuScenes and DriveLM-CARLA.
For DriveLM-nuScenes, we divide the annotation process into three steps:
1️⃣ Keyframe selection. Given all frames in one clip, the annotator selects the keyframes that need annotation. The criterion is that those frames should involve changes in ego-vehicle movement status (lane changes, sudden stops, start after a stop, etc.).
2️⃣ Key objects selection. Given keyframes, the annotator needs to pick up key objects in the six surrounding images. The criterion is that those objects should be able to affect the action of the ego vehicle (traffic signals, pedestrians crossing the road, other vehicles that move in the direction of the ego vehicle, etc.).
3️⃣ Question and answer annotation. Given those key objects, we automatically generate questions regarding single or multiple objects about perception, prediction, and planning. More details can be found in our data.
For DriveLM-CARLA, we employ an automated annotation approach:
We collect data using CARLA 0.9.14 in the Leaderboard 2.0 framework with a privileged rule-based expert. We set up a series of routes in urban, residential, and rural areas and execute the expert on these routes. During this process, we collect the necessary sensor data, generate relevant QAs based on privileged information about objects and the scene, and organize the logical relationships to connect this series of QAs into a graph.
Please consider citing our project if it helps your research.