Workshop


Autonomous systems, such as robots and self-driving cars, have rapidly evolved over the past decades. Recently, foundation models have emerged as a promising approach to building more generalist autonomous systems due to their ability to learn from vast amounts of data and generalize to new tasks. The motivation behind this workshop is to explore the potential of foundation models for autonomous agents and discuss the challenges and opportunities associated with this approach.

Contact


Contact us via [email protected]
Join discussions in our WeChat group

Autonomous Grand Challenge

The field of autonomy is rapidly evolving, and recent advancements from the machine learning community, such as large language models (LLM) and world models, bring great potential. We believe the future lies in explainable, end-to-end models that understand the world and generalize to unvisited environments. In light of this, we propose seven new challenges that push the boundary of existing perception, prediction, and planning pipelines.

Check out the Challenge website for more details.






Schedule


Time zone:

Time Speaker Theme Content
Hongyang Li Opening Remarks  
Sergey Levine
UC Berkeley, USA
Robotic Foundation Models    
Sherry Yang
Google DeepMind, USA
Foundation Models as Real-World Simulators    
Coffee Break ☕️
Alex Kendall
Wayve, UK
Building Embodied AI to be Safe and Scalable  
Autonomous Grand Challenge Part Ⅰ
  • End-to-End Driving at Scale
    • Introduction (5 min, Igor Gilitschenski)
    • Innovation Award & Outstanding Champion (10 min)
  • Predictive World Model
    • Introduction (5 min, Zetong Yang)
    • Innovation Award & Honorable Runner-up (10 min)
  • Multi-View 3D Visual Grounding
    • Introduction (5 min, Tai Wang)
    • Innovation Award & Outstanding Champion (10 min)
  • Occupancy and Flow
    • Introduction (5 min, Jiazhi Yang)
    • Outstanding Champion (10 min)
 
Lunch Break Poster Session of Autonomous Grand Challenge
Rares Ambrus
Toyota Research Institute, USA
Visual Foundation Models for Embodied Applications  
Autonomous Grand Challenge Part Ⅱ
  • Mapless Driving
    • Introduction (5 min, Huijie Wang)
    • Innovation Award & Outstanding Champion (10 min)
  • Driving with Language
    • Introduction (5 min, Chonghao Sima)
    • 1st Place (5 min)
    • 2nd Place (5 min)
  • CARLA Autonomous Driving Challenge
    • Introduction (5 min, Matt Rowe)
    • Innovation Award & Outstanding Champion (10 min)
 
Andrei Bursuc
Valeo, France
Foundation Models in the Automotive Industry    
Coffee Break ☕️
Ted Xiao
Google DeepMind, USA
What's Missing for Robotics-first Foundation Models?    
Li Chen
Shanghai AI Lab, China
Visual World Models as Foundation Models for Autonomous Systems    
Panel
Host: Anthony Hu
Challenges in Building Foundations Models for Embodied AI
Panelist: Andrei Bursuc, Alex Kendall, Hongyang Li, Christos Sakaridis, Ted Xiao
 

Brief takeaways on the afternoon session by Christos Sakaridis


Visual Foundation Models for Embodied Applications
- Metric, domain-generic, and uncertainty-aware character of geometric representations and models is key to embodied AI.
- Motion is a central cue for unsupervised learning of semantic and geometric features and concepts.
Foundation Models in the Automotive Industry
- Two stages: large-scale self-supervised pre-training, followed by versatile supervised adaptation/fine-tuning to different tasks/inputs-outputs.
- Modular pre-training is preferable for addressing diverse setups.
- Driving scenes usually do not have the diversity of internet-scale data: data richness is very important for driving-related pre-training.
What's Missing for Robotics-first Foundation Models?
- Narrow communication bottlenecks between different intelligence spaces.
- Missing properties from robotics foundation models: positive transfer from scaling, promptability, scalable evaluation.
- Injecting physics, kinematics, and trajectory information into our robotic foundation models can upgrade language-heavy models.
Visual World Models as Foundation Models for Autonomous Systems
- LiDARs capture accurate geometric cues while cameras capture rich semantic cues: combine them for pre-training representations for automated driving.
- BEV models are useful for building world models.






Speakers



Sergey Levine

Associate Professor
UC Berkeley, USA

Alex Kendall

Co-Founder & CEO
Wayve, UK

Andrei Bursuc

Deputy Scientific Director
Valeo, France

Rares Ambrus

Head of Computer Vision
Toyota Research Institute, USA

Ted Xiao

Senior Research Scientist
Google DeepMind, USA

Sherry Yang

Senior Research Scientist
Google DeepMind, USA

Li Chen

Research Scientist
Shanghai AI Lab, China




Organizers



Hongyang Li

Shanghai AI Lab

Kashyap Chitta

Univeristy of Tübingen

Andreas Geiger

Univeristy of Tübingen

Holger Caesar

TU Delft

Christos Sakaridis

ETH Zürich

Anthony Hu

Wayve

Fatma Güney

Koç University

German Ros

NVIDIA

Hang Qiu

UC Riverside

Dian Chen

Waabi

Huijie Wang

Shanghai AI Lab

Jiajie Xu

CMU

For challenge organizers, please refer to the challenge website.