Navigation World Models

Abstract

Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.

Following Trajectories in Known Environments

Qualitative examples of NWM synthesizing videos based on a single input frame, the model then autoregressively predicts the subsequent frames given the input actions. The environments were seen during training, the trajectories are novel.

Navigating Unknown Environments

NWM is conditioned on a single input image that was first seen at test time. The model then autoregressively predicts the subsequent frames given input actions.

Baselines & Ablations

Qualitative comparisons of different models, showcasing video generation examples.

Planning with NWM + NoMaD

We include planning examples using NWM with an external navigation policy (NoMaD). Trajectories from NoMaD are ranked by NWM, which generates trajectory videos and selects the highest-scoring one.

Limitations

One common failure, especially in unknown environments is mode collapse, where the model slowly generates frames that are more similar to the training data.

BibTeX

@misc{bar2024navigationworldmodels,
    title={Navigation World Models}, 
    author={Amir Bar and Gaoyue Zhou and Danny Tran and Trevor Darrell and Yann LeCun},
    year={2024},
    eprint={2412.03572},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.03572},
}