AIWiki
Malaysia

AI Red Teaming

A structured adversarial evaluation practice in which testers attempt to elicit harmful, unsafe, or policy-violating behaviour from AI systems in order to surface risks before deployment.

5 min readLast updated May 2026Applications

AI red teaming is a structured adversarial evaluation practice in which dedicated testers — human, automated, or both — attempt to elicit harmful, unsafe, biased, or policy-violating behaviour from artificial intelligence systems before those systems are deployed. The term is borrowed from military exercises and from offensive cybersecurity, where a "red team" simulates an adversary against a "blue team" defending the system. Applied to AI, red teaming has become a central component of frontier model release processes at Anthropic, OpenAI, Google DeepMind, Meta AI and Microsoft, and is embedded in the NIST AI Risk Management Framework and EU AI Act conformity assessment expectations.

Scope and objectives

Modern AI red teaming addresses risks across several dimensions. Capability risks include the elicitation of dangerous knowledge in domains such as chemical, biological, radiological and nuclear (CBRN) weapons, offensive cybersecurity, and election interference. Alignment risks include deceptive behaviour, scheming, sycophancy and reward hacking. Application risks include prompt injection through untrusted tool outputs, data exfiltration in retrieval-augmented systems, jailbreaks that bypass refusal training and content-policy violations affecting children, minorities, or other protected categories. Multimodal red teaming additionally covers image, audio and video inputs and outputs.

Methodologies

Red teaming combines manual probing, automated attack generation and structured evaluation. Manual probing engages domain experts — biosecurity researchers, lawyers, doctors, intelligence analysts — to attempt to surface risks specific to their field. Automated red teaming uses adversarial language models to generate jailbreak prompts, fuzz tool calls and search the input space efficiently. Structured evaluations apply standardised attack suites such as Microsoft's PyRIT and the AI Red Teaming Agent, released in April 2025 and integrated with Azure AI Foundry. The open-source DeepTeam framework, released in November 2025, brings adversarial testing to organisations without dedicated security teams.

Anthropic's approach

Anthropic operates a Frontier Red Team comprising roughly fifteen full-time researchers reporting through its policy organisation, deliberately separated from the teams developing model defences so that attackers and defenders do not share incentives to minimise findings. The team publishes evaluation results, attack scenarios and technical analyses on a dedicated blog launched in August 2025. Anthropic's evaluations focus on four domains: biology (with external collaborations including SecureBio's Virology Capability Test and Sepal AI's bioterrorism planning experiments), cybersecurity, autonomy and CBRN. Each Claude model receives a system card describing red-team findings and mitigations before release.

OpenAI's approach

OpenAI publishes detailed system cards for each major release, including the ChatGPT Agent System Card published in July 2025. The company combines an internal red team with external networks of contracted experts and conducts pre-deployment evaluations against capability thresholds defined in its Preparedness Framework. Microsoft has red-teamed more than one hundred generative AI products and open-sourced PyRIT, contributing to transparency across the industry.

Standards and ecosystem

Industry-wide methodology converges around four principles: regular evaluation cycles, rapid response to newly discovered attacks, adaptive methodology that evolves with model capabilities and cross-team integration with safety, policy and engineering functions. The MITRE ATLAS framework catalogues adversarial machine learning techniques, and the NIST AI Risk Management Framework provides governance language adopted in many enterprise procurement processes.

Limitations

Red teaming surfaces risks but does not exhaustively characterise them. Persistent adversaries with budget and time will routinely defeat current safety training, leading some researchers to argue that defence-in-depth — combining model-level training with deployment-time monitoring, input filtering and output post-processing — is required for production deployments. Red teaming also faces ethical tensions: documented attack techniques can be misused, and access to dangerous capabilities must be carefully managed during evaluation.

References

  1. Anthropic. (2025). Frontier Red Team: Methodology and Findings. Anthropic Red Blog.
  2. OpenAI. (2025). ChatGPT Agent System Card. OpenAI.
  3. Microsoft. (2025). PyRIT and the AI Red Teaming Agent. Microsoft Security.
  4. National Institute of Standards and Technology. (2023). AI Risk Management Framework (AI RMF 1.0).
  5. Ministry of Digital Malaysia. (2024). Malaysia AI Governance Framework.