LLM-graded Evals will never be perfect but here are some things you can do to improve their performance, and reduce flakiness.

1. Use GPT-4 (especially if your eval task requires reasoning capabilities)

  • gpt-4 will perform much better than GPT 3.5 if your eval task is complex.
  • For simple tasks, you can use gpt-3.5-turbo or sometimes an even cheaper model.

2. Run the evals multiple times

Running evals multiple times, and using a majority vote, or discarding inconsistent results will mitigate the flakiness.

3. Provide custom examples

Providing some custom few-shot examples suited to your use case are likely to improve the performance of your evals further.

4. Set up custom evals

Using a completely custom eval is likely the best way to tailor your eval to work perfectly for your use case.

5. Contact Us

Email us at hello@athina.ai for help setting up a high-performing custom eval suite.