Search Results
11 results for “safety”
AI Alignment
AI alignment is the field of research dedicated to ensuring that artificial intelligence systems pursue goals, values, and behaviours that are consistent with human intentions.
AI Ethics
AI ethics is the branch of applied ethics addressing the moral dimensions of designing, deploying, and governing artificial intelligence systems — covering fairness, accountability, transparency, privacy, and safety.
AI Guardrails
AI guardrails are runtime safety mechanisms that validate, filter, and enforce policies on large language model inputs and outputs in production systems, preventing harmful content, data leakage, prompt injection, and off-topic behaviour.
AI Red Teaming
A structured adversarial evaluation practice in which testers attempt to elicit harmful, unsafe, or policy-violating behaviour from AI systems in order to surface risks before deployment.
AI Safety
AI safety is a field of research and practice concerned with the development of artificial intelligence systems that behave reliably, avoid harmful outputs, and remain aligned with human values, especially as systems become more capable.
Anthropic
Anthropic is an American AI safety company and large language model developer founded in 2021 by former OpenAI researchers, best known for developing the Claude family of AI assistants and the Constitutional AI alignment technique.
Claude (Language Model)
A family of large language models developed by Anthropic, designed with a focus on safety, helpfulness, and Constitutional AI training methods for enterprise and consumer use.
Constitutional AI
Constitutional AI is an alignment method developed by Anthropic that trains language models to follow a set of written ethical principles by using the model itself to critique and revise its own outputs, reducing dependence on human feedback for harmlessness.
Hallucination (AI)
A phenomenon in which an artificial intelligence system generates output that is factually incorrect, fabricated, or unsupported by its input, while presenting it with apparent confidence.
Prompt Injection
Prompt injection is a security vulnerability affecting large language model applications in which an attacker embeds adversarial instructions in model inputs to override the system's intended behaviour, bypass safety controls, or exfiltrate sensitive information.
Reinforcement Learning from Human Feedback
A machine learning technique that trains a reward model from human preference data and uses it to align large language models with human values, safety requirements, and intended behaviour through reinforcement learning.