Search Results
5 results for “evaluation”
AI Red Teaming
A structured adversarial evaluation practice in which testers attempt to elicit harmful, unsafe, or policy-violating behaviour from AI systems in order to surface risks before deployment.
Langfuse
Langfuse is an open-source LLM engineering platform that provides observability, tracing, prompt management, evaluation, and dataset tooling for teams building applications on top of large language models.
LangSmith
LangSmith is an observability, tracing, and evaluation platform from LangChain for debugging, monitoring, and continuously improving large language model and AI agent applications in production.
Scale AI
An American data labelling, evaluation, and AI infrastructure company that supplies training data and evaluation services to leading AI laboratories, autonomous vehicle developers, and government agencies.
Weights and Biases
Weights and Biases (W&B) is a machine learning developer platform for experiment tracking, model versioning, dataset management, and collaborative model evaluation used by over 200,000 ML practitioners worldwide.