AIWiki
Malaysia

Search Results

11 results for safety

Foundations

AI Alignment

AI alignment is the field of research dedicated to ensuring that artificial intelligence systems pursue goals, values, and behaviours that are consistent with human intentions.

5 min readUpdated May 2026
Ethics & Policy

AI Ethics

AI ethics is the branch of applied ethics addressing the moral dimensions of designing, deploying, and governing artificial intelligence systems — covering fairness, accountability, transparency, privacy, and safety.

5 min readUpdated May 2026
Infrastructure

AI Guardrails

AI guardrails are runtime safety mechanisms that validate, filter, and enforce policies on large language model inputs and outputs in production systems, preventing harmful content, data leakage, prompt injection, and off-topic behaviour.

6 min readUpdated June 2026
Applications

AI Red Teaming

A structured adversarial evaluation practice in which testers attempt to elicit harmful, unsafe, or policy-violating behaviour from AI systems in order to surface risks before deployment.

5 min readUpdated May 2026
Applications

AI Safety

AI safety is a field of research and practice concerned with the development of artificial intelligence systems that behave reliably, avoid harmful outputs, and remain aligned with human values, especially as systems become more capable.

6 min readUpdated May 2026
Companies & Tools

Anthropic

Anthropic is an American AI safety company and large language model developer founded in 2021 by former OpenAI researchers, best known for developing the Claude family of AI assistants and the Constitutional AI alignment technique.

7 min readUpdated May 2026
Models

Claude (Language Model)

A family of large language models developed by Anthropic, designed with a focus on safety, helpfulness, and Constitutional AI training methods for enterprise and consumer use.

5 min readUpdated May 2026
Foundations

Constitutional AI

Constitutional AI is an alignment method developed by Anthropic that trains language models to follow a set of written ethical principles by using the model itself to critique and revise its own outputs, reducing dependence on human feedback for harmlessness.

6 min readUpdated May 2026
Foundations

Hallucination (AI)

A phenomenon in which an artificial intelligence system generates output that is factually incorrect, fabricated, or unsupported by its input, while presenting it with apparent confidence.

6 min readUpdated May 2026
Infrastructure

Prompt Injection

Prompt injection is a security vulnerability affecting large language model applications in which an attacker embeds adversarial instructions in model inputs to override the system's intended behaviour, bypass safety controls, or exfiltrate sensitive information.

7 min readUpdated June 2026
Foundations

Reinforcement Learning from Human Feedback

A machine learning technique that trains a reward model from human preference data and uses it to align large language models with human values, safety requirements, and intended behaviour through reinforcement learning.

7 min readUpdated May 2026