DeepMind Blog

RT-2: New model translates vision and language into action

thumbnail
  • Robotic Transformer 2 (RT-2) is a novel vision-language-action (VLA) model that learns from both web and robotics data to translate knowledge into instructions for robotic control.
  • RT-2 builds upon Robotic Transformer 1 (RT-1) and incorporates chain-of-thought reasoning for multi-stage semantic reasoning.
  • The model represents actions as tokens in its output and uses strings to describe actions, allowing it to process robot actions with natural language tokenizers.
  • RT-2 demonstrates increased generalization performance compared to previous models, with more than a 3x improvement across categories such as symbol understanding, reasoning, and human recognition.
  • The model retains performance on original tasks seen in robot data while improving performance on previously unseen scenarios by the robot, showing the benefit of large-scale pre-training.
  • RT-2 is also able to generalizeto novel objects in the real world, demonstrating its ability to combine vision, language, and action in robotic control.
  • The research shows that vision-language models can be transformed into powerful vision-language-action models for advancing robotic control.