AIWiki
Malaysia

Sora

Sora is a text-to-video generative AI model developed by OpenAI that produces short, high-fidelity video clips with synchronised audio from natural-language prompts.

5 min readLast updated May 2026Models

Sora is a text-to-video generative model developed by OpenAI that produces short video clips from natural-language prompts. First demonstrated in February 2024 and released to ChatGPT Plus and Pro subscribers in the United States and Canada in December 2024, Sora pairs a diffusion-based generative backbone with a transformer architecture trained on video and image data. Sora 2, unveiled on 30 September 2025, added synchronised audio, sharper realism, and improved physical plausibility, and was distributed through a standalone iOS app followed by an Android version.

Background

Sora was introduced as part of OpenAI's broader push into multimodal generative systems following the success of DALL-E for images. Its goal is described by OpenAI as building a "general-purpose simulator of the physical world", in which video generation is treated as a stepping stone toward models that can reason about the dynamics of objects, scenes, and agents. Sora extends earlier research lines on latent [[diffusion-model]] systems and on video diffusion models such as Imagen Video and Make-A-Video.

Architecture

At a high level, Sora encodes input videos into compressed spatial-temporal patches that the OpenAI team calls visual tokens. A diffusion transformer then learns to denoise these tokens conditioned on text, allowing the model to generate novel sequences from a textual prompt. The transformer backbone scales similarly to a large language model, with longer training and more parameters producing higher-quality, more temporally coherent video. Conditioning may also include reference images or short input clips for tasks such as image-to-video extension and video-to-video stylisation.

Sora 2 introduced several upgrades over the first generation. According to OpenAI's release materials, the second-generation model improves obedience to the laws of physics, handles multi-shot instructions while preserving world state, and generates a synchronised audio track containing dialogue, sound effects, and ambient noise. Clip length was raised to roughly 10 to 25 seconds in the consumer product.

Capabilities and limitations

Reported capabilities of Sora and Sora 2 include the generation of realistic, cinematic, and anime-style scenes, support for camera moves, and the ability to maintain character identity across cuts within a clip. The model can also be prompted to extend an existing image into motion or to interpolate between frames.

Documented limitations include difficulty with fine-grained text rendering inside videos, occasional violations of physical plausibility (objects passing through each other, limbs deforming), and content-safety constraints. OpenAI applies provenance markers such as C2PA metadata and visible watermarks to Sora outputs and uses classifier-based filtering to limit generation of non-consensual likenesses, explicit content, and depictions of real public figures.

| Feature | Sora 1 (Dec 2024) | Sora 2 (Sept 2025) | | --- | --- | --- | | Maximum clip length | ~20 seconds | ~25 seconds | | Audio | No | Synchronised audio | | Physics fidelity | Basic | Improved | | Multi-shot control | Limited | Stronger | | Distribution | ChatGPT Plus/Pro | Dedicated iOS and Android apps, API access |

Availability and access

At launch in December 2024, Sora was accessible through a web interface for ChatGPT Plus and Pro subscribers in the United States and Canada. Sora 2 was distributed as a dedicated mobile app and also exposed via the OpenAI API for developer use. Public availability has fluctuated as OpenAI has adjusted policies around compute, safety, and rights management. As of April 2026, the consumer Sora product was reported to be no longer available in its original form, while video generation features were being integrated into other OpenAI offerings.

Reception and controversies

Sora attracted attention for the visual quality of its outputs and renewed debate over the use of copyrighted video for training, the labour implications for stock-footage and animation industries, and the risks of synthetic media for elections and personal reputation. Stock-media companies, screen guilds, and several rights-holder groups raised concerns about consent and compensation. OpenAI has responded with opt-out mechanisms, content credentials, and partnerships with selected studios and rights organisations.

Comparison with other systems

Sora competes with text-to-video models such as Google DeepMind's Veo, Runway's Gen-3, Luma's Dream Machine, Kling AI from Kuaishou, and Pika Labs. These systems differ in clip length, audio support, fine-tuning options, and access models. Open-weights video diffusion models such as Stable Video Diffusion and CogVideoX provide research-oriented alternatives, although typically at lower fidelity than the proprietary frontier systems.

References

  1. OpenAI. (2024). Sora: Creating video from text. OpenAI Research.
  2. OpenAI. (2025). Sora 2 System Card. OpenAI.
  3. Brooks, T. et al. (2024). Video generation models as world simulators. OpenAI Technical Report.
  4. Malaysian Communications and Multimedia Commission. (2024). Guidelines on Content for Online Services. MCMC.