SparseVideoNav

Sparse Video Generation Propels Real-World
Beyond-the-View Vision-Language Navigation

Hai Zhang^*, Siqi Liang^*, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li

The University of Hong Kong
*Equal Contribution

At a Glance

The first work to introduce video generation models for beyond-the-view navigation, achieving remarkable performance in real-world challenging scenarios.

System Overview

SparseVideoNav achieves a 20-second future horizon with sub-second trajectory inference latency. This represents a 27× speed-up compared to unoptimized counterpart.

Key Contributions

Video Generation for Navigation

Introduces video generation models into vision-language navigation for the first time, overcoming short-horizon limitations of LLM-based methods in challenging beyond-the-view cases.

Sparse Future Generation

Achieves sub-second trajectory inference with sparse 20-second future prediction, enabling 27× speed-up for practical real-world deployment.

Superior Performance

Achieves 2.5× success rate over LLM-based baselines on beyond-the-view tasks and marks the first realization of such capability in challenging night scenes.

Real-World Demos

Challenging Beyond-the-View Navigation Tasks

Night Scenes

Outdoor Scenes

Indoor Scenes

Language-Following Navigation

Robustness & Generalization

Dynamic Pedestrian Avoidance

Methodology

Overview of SparseVideoNav

(Top) denotes our whole training architecture. Current observation, historical observations, and the language instruction are fed into the video generation model (VGM) backbone to generate future sparse video latents. DiT-based action head then predicts continuous actions conditioned on generated sparse future and the language instruction.

(Bottom) denotes our four-stage training pipeline: Stage 1 adapting T2V to I2V, Stage 2 injecting history into I2V backbone; Stage 3 distilling the backbone to reduce denoising steps; Stage 4 learning actions based on generated sparse future. Components not utilized in a specific stage are indicated by gray blocks.

140H Real-World VLN Dataset

The largest real-world VLN dataset to date.

We will open-source the dataset to benefit the community. The estimated release time is 2026 Q3 due to restriction policy.

BibTeX

@article{zhang2026sparse,
  title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
  author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
  journal={arXiv preprint arXiv:2602.05827},
  year={2026}
}