DreamDojo: NVIDIA's Open-Source Robot World Model Trained on 44,711 Hours of Human Video
DreamDojo is NVIDIA's open-source generalist robot world model, released February 2026. It trains a foundation video model on 44,711 hours of first-person human video (6,015 tasks, 9,869 scenes) then fine-tunes on small amounts of real robot data. The key innovation is 'continuous latent actions' — a spatiotemporal Transformer VAE that extracts motion information between video frames self-supervised, converting unlabeled human video into training data. Runs at ~10 FPS real-time after distillation.
DreamDojo is NVIDIA's open-source generalist robot world model, released in February 2026. It addresses the fundamental data scarcity problem in robotics: robots need massive training data, but robot data is expensive and slow to collect. The solution: train primarily on abundant human video, then fine-tune on small amounts of real robot data. ## Core Innovation: Continuous Latent Actions The key technical contribution is a method for extracting useful training signal from unlabeled human video. A spatiotemporal Transformer VAE (Variational Autoencoder) learns to extract "what moved between these two frames" in a self-supervised manner. This converts raw human video into (state, action, next-state) triples — the format needed for training a world model — without any human labeling. These "continuous latent actions" are not discrete robot commands but continuous vectors in a learned latent space that capture the essence of motion and interaction. The approach works because the physics of object manipulation is the same whether performed by a human hand or a robot gripper. ## Training Data Scale The model was trained on 44,711 hours of first-person human video spanning 6,015 distinct tasks across 9,869 scenes. This is 15x longer, 96x more skills, and 2,000x more scenes than any prior robot world model dataset. The scale is the differentiator — previous approaches used hundreds or low thousands of hours. ## Architecture and Performance After training, the model generates stable autoregressive video predictions for over a minute at approximately 10 FPS real-time (after distillation for efficiency). The model can predict how a scene will evolve given a proposed action, allowing robots to plan by simulating outcomes before acting. ## Open Source Release NVIDIA released the paper, code, and model weights (2B and 7B parameter versions). The dataset details and training pipeline are documented for reproducibility. ## Significance DreamDojo represents a shift in robotics from task-specific training to foundation model transfer — the same paradigm shift that occurred in NLP (GPT) and computer vision (CLIP/DINO). If robot world models can leverage the vast supply of human video on the internet, the data bottleneck that has limited robotics for decades may be broken.