AIWiki
Malaysia

Vision-Language-Action Model (VLA)

A class of multimodal foundation models for robotics that takes camera images and a text instruction as input and directly outputs low-level robot actions.

5 min readLast updated July 2026Models

A vision-language-action model (VLA) is a class of multimodal foundation model that integrates perception, language understanding, and motor control in a single system. Given an image or video of a robot's surroundings together with a natural-language instruction, a VLA directly outputs the low-level actions, such as joint positions or end-effector movements, needed to carry out the task. VLAs extend the recipe that produced large language models and vision-language models into the physical world, and they are central to current research on general-purpose robots and embodied AI.

Background

The approach was defined by RT-2, introduced by Google DeepMind in 2023. RT-2 built on a vision-language model and treated robot actions as another form of text: continuous movements were discretised into tokens, so that an action such as moving the gripper a set distance became a token the model could predict in the same way it predicts words. This let the model reuse a transformer architecture trained on billions of web images and text, transferring semantic knowledge from internet-scale data into robot control. The key insight was that a robot could benefit from web knowledge it never encountered during physical training, for example recognising an unfamiliar object by name and acting on it.

Architecture and training

Most VLAs share a common structure. A vision encoder converts camera frames into visual features, a language model processes the instruction, and an action decoder produces the control signal. Training typically combines two ingredients: large-scale pretraining on internet image-text data, which supplies broad semantic grounding, and fine-tuning on robot demonstration data, which teaches the mapping from perception and instruction to action.

Robot demonstration datasets are a limiting resource. The Open X-Embodiment collection, which aggregates trajectories from many robot types, has been important in allowing a single model to generalise across different robot bodies. OpenVLA, released in early 2025 as a fully open-source and commercially usable model, was trained on this data across many robot embodiments and matched RT-2 while being openly licensed. Later systems moved beyond discrete action tokens: the pi0 family from Physical Intelligence uses a flow-matching approach to generate smooth, continuous actions, improving dexterity on fine manipulation tasks. By 2026, systems such as pi0.6 and Gemini Robotics represented the state of the art, the former emphasising practical dexterity and self-improvement and the latter emphasising reasoning and generality.

The table below contrasts several representative models.

| Model | Developer | Notable trait | | --- | --- | --- | | RT-2 | Google DeepMind | Defined the token-based VLA approach | | OpenVLA | Academic collaboration | First fully open-source VLA | | pi0 | Physical Intelligence | Continuous action via flow matching | | Gemini Robotics | Google DeepMind | Strong reasoning and generality |

Applications and limitations

VLAs are studied for warehouse picking, household assistance, manufacturing, and humanoid robotics, where a single model that follows spoken instructions across many tasks is more scalable than hand-coded controllers. Challenges remain significant: robot data is far scarcer than web data, real-time control demands low inference latency, and safety guarantees for physical action are harder to provide than for text output. Generalisation to genuinely novel environments and long-horizon tasks is an active research frontier.

References

  1. Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind.
  2. Kim, M. J., et al. (2025). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv.
  3. Physical Intelligence. (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control. physicalintelligence.company.
  4. Wikipedia contributors. (2026). Vision-language-action model. Wikipedia.