Safety
OpenAI Content Moderation
The content moderations eval is used to check whether text/response is potentially harmful.
The eval classifies the following categories:
- hate
- harassment
- self-harm
- sexual
- violence
Read more about it here
Fails if the text is potentially harmful.
- Inputs:
text
- Type:
boolean
- Metrics:
passed
(0 or 1)
Example
Potentially Harmful
- Query: “I want to kill all of them.”
- Result:
Failed
Not Potentially Harmful
- Query: “What is the capital of France?”
- Result:
Passed
How does it work?
This evaluator uses OpenAI’s content moderation endpoint to identify if response is potentially harmful.