AIWiki
Malaysia

AI Safety

AI safety is a field of research and practice concerned with the development of artificial intelligence systems that behave reliably, avoid harmful outputs, and remain aligned with human values, especially as systems become more capable.

6 min readLast updated May 2026Applications

AI safety is a multidisciplinary field concerned with ensuring that artificial intelligence systems behave in ways that are beneficial, predictable, and aligned with human values. The field encompasses technical research into the reliability, robustness, interpretability, and alignment of AI systems; policy work on governance, standards, and international cooperation; and operational practices such as evaluation, red teaming, deployment controls, and incident response. AI safety draws from machine learning, computer security, control theory, philosophy, cognitive science, and law, and has grown significantly in prominence since the public release of large language models in 2022–2023.

Scope of Concerns

Near-term Risks

Near-term AI safety addresses risks present in systems being deployed today. These include hallucination — confident production of factually incorrect statements; biased outputs that disadvantage protected groups; failure modes under distribution shift, where models perform well on test data but poorly in production; security vulnerabilities such as prompt injection, training data poisoning, and model extraction; and misuse for fraud, harassment, surveillance, generation of child sexual abuse material, generation of weapons-relevant information, or facilitation of cyberattacks.

Frontier and Long-term Risks

Frontier AI safety concerns risks that may emerge from highly capable future systems. These include catastrophic misuse, where powerful models could meaningfully assist non-state actors in producing weapons of mass destruction; loss of human oversight, where systems take actions misaligned with human intent; deceptive alignment, where systems behave well during evaluation but pursue different goals when deployed; and societal-scale impacts such as labour displacement, concentration of power, and erosion of epistemic ecosystems.

Technical Research Areas

Alignment

Alignment research investigates how to train AI systems whose objectives, reasoning, and behaviour reliably reflect human intent. Techniques include reinforcement learning from human feedback (RLHF), constitutional AI, direct preference optimisation (DPO), debate, and scalable oversight methods such as recursive reward modelling. Constitutional AI, developed by Anthropic, uses an explicit set of principles and AI feedback to train models that critique and revise their own outputs.

Interpretability

Mechanistic interpretability seeks to reverse-engineer the internal computations of neural networks, identifying circuits and features that explain model behaviour. Sparse autoencoders, probing, and activation patching are among the methods used. Interpretability supports safety by enabling external verification of model reasoning, anomaly detection, and the discovery of misaligned representations before deployment.

Robustness and Evaluation

Robustness research studies how models behave under adversarial inputs, distribution shifts, and corner cases. Evaluation practices include capability evaluations to measure what models can do, behaviour evaluations to measure how they respond to specific prompts, and red teaming — structured adversarial probing by human teams or automated agents — to elicit harmful behaviour before public release. Standardised benchmarks include MMLU, GPQA, SWE-bench, BBH, and biosecurity- and cybersecurity-specific evaluation suites.

Misuse Prevention

Misuse prevention combines content filtering, refusal training, watermarking of generated outputs, rate limiting, identity verification, and post-deployment monitoring. Frontier model developers maintain abuse teams that detect and respond to attempts to use models for prohibited purposes.

Institutional and Policy Landscape

AI Safety Institutes

Several jurisdictions established AI Safety Institutes from 2023 onward, including the UK AI Safety Institute (UK AISI), the US AI Safety Institute (US AISI) within NIST, the Japan AI Safety Institute, the Singapore AI Safety Institute, the European AI Office, and analogous bodies in Canada, South Korea, India, France, and others. These institutes conduct technical evaluations of frontier models, develop methodologies, and coordinate internationally.

Voluntary Commitments and Policies

Voluntary commitments — such as the White House Voluntary AI Commitments, the Seoul Frontier AI Safety Commitments, and the AI Safety Summit Bletchley Declaration — formalised industry pledges around responsible scaling, evaluation, and transparency. Companies including Anthropic, OpenAI, Google DeepMind, Microsoft, Meta, Amazon, and others maintain their own safety policies; Anthropic's Responsible Scaling Policy (RSP) and OpenAI's Preparedness Framework define capability thresholds that trigger additional safeguards.

Regulatory Frameworks

Binding regulation includes the EU AI Act, which classifies systems by risk level and imposes obligations on providers and deployers of high-risk and general-purpose AI; sectoral regulations issued by financial, healthcare, and aviation regulators; and emerging standards from ISO/IEC, IEEE, and NIST.

Industry Practice

AI safety is increasingly embedded in industry development workflows. Frontier laboratories run multi-stage evaluation pipelines that include automated evaluations, structured red teaming, third-party audits, and government evaluations under voluntary access agreements. Deployment is staged, with limited release, controlled access, and monitoring preceding broad availability.

References

  1. Bengio, Y. et al. (2024). International Scientific Report on the Safety of Advanced AI: Interim Report. UK Department for Science, Innovation and Technology.
  2. Anthropic. (2024). Responsible Scaling Policy v2. San Francisco: Anthropic PBC.
  3. OpenAI. (2023). Preparedness Framework. San Francisco: OpenAI.
  4. European Union. (2024). Regulation (EU) 2024/1689 (AI Act). Official Journal of the European Union.
  5. MOSTI Malaysia. (2024). National Guidelines on Artificial Intelligence Governance and Ethics. Putrajaya: MOSTI.