Transforming robotics with video-driven learning
1X is an advanced robotics platform that leverages a video-pretrained world model, known as 1XWM, integrated into its NEO system to enhance robot learning and action prediction. Unlike traditional vision-language-action models, which primarily focus on visual and semantic understanding, 1XWM derives robot actions from text-conditioned video generation. This innovative approach allows robots to generalize to new objects, motions, and tasks without the need for extensive pre-training on large-scale robot data or teleoperated demonstrations. The system utilizes a two-stage grounding process, where a text prompt and a starting frame guide the robot's actions in real-world scenarios. The 1XWM backbone is built upon a 14B generative video model, enabling accurate translation of video knowledge into actionable insights, thereby bridging the gap between human-like motion and robotic execution. The integration of hardware designed for high fidelity transfer enhances the robot's ability to mimic human interaction dynamics, ensuring that the model's learned priors remain relevant and effective.