What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Vision-Language-Action Model (VLA)

A class of multimodal foundation models for robotics that takes camera images and a text instruction as input and directly outputs low-level robot actions.

5 min readLast updated July 2026Models

A vision-language-action model (VLA) is a class of multimodal foundation model that integrates perception, language understanding, and motor control in a single system. Given an image or video of a robot's surroundings together with a natural-language instruction, a VLA directly outputs the low-level actions, such as joint positions or end-effector movements, needed to carry out the task. VLAs extend the recipe that produced large language models and vision-language models into the physical world, and they are central to current research on general-purpose robots and embodied AI.

Background

The approach was defined by RT-2, introduced by Google DeepMind in 2023. RT-2 built on a vision-language model and treated robot actions as another form of text: continuous movements were discretised into tokens, so that an action such as moving the gripper a set distance became a token the model could predict in the same way it predicts words. This let the model reuse a transformer architecture trained on billions of web images and text, transferring semantic knowledge from internet-scale data into robot control. The key insight was that a robot could benefit from web knowledge it never encountered during physical training, for example recognising an unfamiliar object by name and acting on it.

Architecture and training

Most VLAs share a common structure. A vision encoder converts camera frames into visual features, a language model processes the instruction, and an action decoder produces the control signal. Training typically combines two ingredients: large-scale pretraining on internet image-text data, which supplies broad semantic grounding, and fine-tuning on robot demonstration data, which teaches the mapping from perception and instruction to action.

Robot demonstration datasets are a limiting resource. The Open X-Embodiment collection, which aggregates trajectories from many robot types, has been important in allowing a single model to generalise across different robot bodies. OpenVLA, released in early 2025 as a fully open-source and commercially usable model, was trained on this data across many robot embodiments and matched RT-2 while being openly licensed. Later systems moved beyond discrete action tokens: the pi0 family from Physical Intelligence uses a flow-matching approach to generate smooth, continuous actions, improving dexterity on fine manipulation tasks. By 2026, systems such as pi0.6 and Gemini Robotics represented the state of the art, the former emphasising practical dexterity and self-improvement and the latter emphasising reasoning and generality.

The table below contrasts several representative models.

| Model | Developer | Notable trait | | --- | --- | --- | | RT-2 | Google DeepMind | Defined the token-based VLA approach | | OpenVLA | Academic collaboration | First fully open-source VLA | | pi0 | Physical Intelligence | Continuous action via flow matching | | Gemini Robotics | Google DeepMind | Strong reasoning and generality |

Applications and limitations

VLAs are studied for warehouse picking, household assistance, manufacturing, and humanoid robotics, where a single model that follows spoken instructions across many tasks is more scalable than hand-coded controllers. Challenges remain significant: robot data is far scarcer than web data, real-time control demands low inference latency, and safety guarantees for physical action are harder to provide than for text output. Generalisation to genuinely novel environments and long-horizon tasks is an active research frontier.

Malaysian Context — Robotics and Industry 4.0

Malaysia's manufacturing sector, which contributes a substantial share of gross domestic product, is a natural setting for VLA-driven automation. Under the Industry4WRD national policy on Industry 4.0 led by the Ministry of Investment, Trade and Industry (MITI), electronics and semiconductor manufacturers in Penang and the Klang Valley, including facilities operated by multinationals and local contract manufacturers, are investing in flexible robotics that can be reconfigured for new products without extensive reprogramming, which is precisely the promise of instruction-following VLAs.

Government research bodies such as MIMOS and universities including Universiti Teknologi Malaysia (UTM), Universiti Sains Malaysia (USM), and Universiti Teknologi PETRONAS conduct robotics and automation research relevant to VLA adoption. PETRONAS has explored robotics for inspection and maintenance in oil and gas environments, a domain where language-instructable robots could reduce human exposure to hazardous conditions.

Adoption faces the same data constraints seen globally, compounded by the cost of collecting robot demonstrations for local industrial tasks. Skills development is addressed partly through HRD Corp training levies and technical and vocational education and training (TVET) programmes, which are being oriented toward automation and robotics competencies to prepare the Malaysian workforce for increasingly autonomous factory floors.

Palm oil and agriculture, pillars of the Malaysian economy, present longer-term opportunities for embodied AI in tasks such as fruit harvesting and field inspection, though outdoor manipulation remains technically demanding for current VLA systems.

References

Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Google DeepMind.
Kim, M. J., et al. (2025). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv.
Physical Intelligence. (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control. physicalintelligence.company.
Wikipedia contributors. (2026). Vision-language-action model. Wikipedia.

Tags:robotics multimodal foundation model embodied AI

Type	Multimodal robotics foundation model
Inputs	Images/video and text instructions
Output	Low-level robot actions
Notable models	RT-2, OpenVLA, pi0
Origin	RT-2 (Google DeepMind, 2023)
Related	Multimodal AI, Physical AI, Foundation model