What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

AI Video Generation

AI video generation refers to the automated creation of video content from text prompts, images, or other inputs using generative neural networks, enabling synthetic video production without cameras or traditional animation.

6 min readLast updated June 2026Applications

AI video generation is the automated synthesis of video content using neural networks trained on large corpora of video and associated text. Given a natural language description, a reference image, or an existing video clip, AI video generation systems produce temporally coherent moving images that realise the described scene. The field represents an extension of image generation into the time dimension, requiring models to maintain spatial consistency within frames while also producing plausible motion across frames.

The rapid maturation of text-to-image models between 2021 and 2023 established the foundational architectures — diffusion models, transformer-based attention, and large-scale contrastive training — that were subsequently applied to video. The release of OpenAI's Sora in February 2024 marked a significant public milestone, demonstrating that AI systems could produce one-minute video clips with coherent physics, lighting, and motion that were difficult to distinguish from real footage at a casual glance.

Technical Foundations

Spatial and Temporal Modelling

Video can be understood as a sequence of image frames sampled at regular intervals (typically 24-60 frames per second). Early AI video approaches concatenated image generation techniques across frames, producing flickering or temporally inconsistent results. Modern video generation architectures address temporal consistency by treating the full video as a three-dimensional spatial-temporal volume and applying attention mechanisms that span both space and time.

Spatiotemporal attention allows a model to relate each pixel or patch not only to nearby pixels in the same frame but also to corresponding regions in adjacent frames. This enables the model to propagate visual features — the colour of a shirt, the position of a moving object — coherently across time.

Diffusion Transformers

The dominant architecture for state-of-the-art video generation as of 2025-2026 is the Diffusion Transformer (DiT), which combines the denoising diffusion probabilistic model (DDPM) framework with the transformer architecture. Diffusion models generate content by learning to reverse a noise-corruption process: starting from Gaussian noise, the model iteratively denoises the signal guided by a text or image conditioning signal.

OpenAI's Sora applied DiT to video by representing video as a sequence of spatiotemporal patches — analogous to the tokens used in language models — and training a transformer to denoise these patches. This approach scales well with compute and data, explaining the dramatic quality improvements observed with larger training runs.

Physics and Consistency

One of the key challenges in video generation is physical plausibility. Learned world models within video generators must respect gravity, fluid dynamics, object permanence, and lighting consistency. Early models frequently produced objects that changed shape, passed through each other, or reversed direction without cause. Modern systems such as Kling 3.0 and Veo 3.1 include explicit physical simulation priors and long-range temporal attention that substantially reduce these artefacts.

Major Systems

Sora (OpenAI, 2024) was the first publicly demonstrated model capable of generating realistic video up to one minute long from text prompts. It uses a DiT architecture and was trained on a large proprietary video dataset. OpenAI announced in March 2026 that the Sora consumer product would be discontinued in April 2026, with the API being deprecated in September 2026, as the company consolidated its video capabilities into other products.

Veo (Google DeepMind, 2024-2025) is Google's flagship video generation model. Veo 3.1 (2025) introduced native audio generation alongside video, allowing a single model to synthesise synchronised dialogue, ambient sound, and music. It is integrated into Google's Vertex AI platform for enterprise customers.

Kling (Kuaishou, 2024-2025) is a Chinese video generation model from the social video platform Kuaishou. Kling 3.0 (2025) introduced native multilingual lip-sync, supporting up to five languages, and continuous clip generation up to two minutes in length. It has become popular in creative and commercial production in Asia.

Runway Gen-4 (Runway ML, 2025) focuses on professional creative workflows. Its reference image controls allow users to maintain character consistency across shots, making it well-suited for marketing and narrative production.

Pika (Pika Labs, 2024-2025) specialises in short-form and social media video, with particular strength in image-to-video animation and lip-sync for talking-head content.

Applications

AI video generation is transforming production workflows across advertising, entertainment, education, and journalism. Advertising agencies use it to produce localised video variants — changing the setting, language, or characters in an ad without re-shooting — at a fraction of traditional production cost. Independent filmmakers and game studios use it for storyboarding, pre-visualisation, and concept exploration.

Educational content creators produce explainer videos from written scripts without camera equipment. News organisations have piloted AI video for data journalism visualisations and historical reconstruction. In e-commerce, product demonstration videos are increasingly generated from product images and specifications.

The technology also raises significant concerns about deepfakes and synthetic media for disinformation. Detection tools and provenance standards — notably the C2PA (Coalition for Content Provenance and Authenticity) standard, which embeds cryptographic metadata in generated media — are being developed to address these risks.

Malaysian Context — AI Video Generation in Creative and Commercial Industries

Malaysia's creative industry, centred in Kuala Lumpur and Penang, has been an early adopter of AI video generation tools for advertising, branded content, and entertainment. Malaysian advertising agencies including Dentsu Malaysia, Leo Burnett Malaysia, and TBWA Malaysia have reported using Runway and Kling to produce localised campaign assets, reducing turnaround time for regional campaigns from weeks to days.

The Malaysian Film Development Corporation (FINAS) has acknowledged AI video generation as a disruptive but potentially enabling technology for Malaysian content creators, particularly independent filmmakers who lack the budgets for conventional production. FINAS has begun consulting with industry stakeholders on guidelines for disclosure of AI-generated content in films seeking certification under Malaysian content rules.

Astro, the dominant pay-TV and digital media company in Malaysia, and Viu Malaysia have both explored AI video tools for content localisation, including AI-generated Bahasa Malaysia dubbing synchronised to on-screen lip movement. TV3, RTM, and other broadcasters have piloted AI video for news visualisation and sports highlight generation.

The Ministry of Communications has flagged deepfake video as a pressing concern under the Communications and Multimedia Act, and the Malaysia AI Governance Framework includes provisions requiring disclosure of AI-generated media in commercial and political contexts. MCMC (Malaysian Communications and Multimedia Commission) is working on technical standards for synthetic media watermarking that align with international C2PA standards.

Malaysian talent in animation, motion graphics, and video production is adapting to AI tools through programmes offered by MDEC, the Creative Content Industry Guild Malaysia (CCIGM), and private training providers. Concerns about job displacement are balanced by recognition that AI video generation reduces barriers to entry for Malaysian creators seeking to compete in regional and global markets.

References

OpenAI. (2024). Sora: Creating video from text. openai.com/sora.
Brooks, T., et al. (2024). Video generation models as world simulators. openai.com/research.
Google DeepMind. (2024). Veo: Our most capable generative video model. deepmind.google/veo.
Peebles, W., and Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV 2023.
C2PA. (2024). Content Provenance and Authenticity Specification. c2pa.org.

Tags:video generation text-to-video diffusion Sora generative AI

Type	Generative AI modality
Key models	Sora, Veo, Kling, Runway Gen-4, Pika
Primary architecture	Diffusion Transformer (DiT)
Input types	Text prompt, image, video (continuation)
Output	MP4 video, typically 5-120 seconds
Related	Diffusion model, multimodal AI, Sora, generative AI