How can I improve the performance / reliability of my evals?
LLM-graded Evals will never be perfect but here are some things you can do to improve their performance, and reduce flakiness.1. Use GPT-4 (especially if your eval task requires reasoning capabilities)
gpt-4 will perform much better than GPT 3.5 if your eval task is complex.
For simple tasks, you can use gpt-3.5-turbo or sometimes an even cheaper model.
2. Run the evals multiple timesRunning evals multiple times, and using a majority vote, or discarding inconsistent results will mitigate the flakiness.3. Provide custom examplesProviding some custom few-shot examples suited to your use case are likely to improve the performance of your evals further.4. Set up custom evalsUsing a completely custom eval is likely the best way to tailor your eval to work perfectly for your use case.5. Contact UsEmail us at hello@athina.ai for help setting up a high-performing custom eval suite.