gpt-4will perform much better than GPT 3.5 if your eval task is complex.- For simple tasks, you can use
gpt-3.5-turboor sometimes an even cheaper model.
Evals
How can I improve the performance / reliability of my evals?
LLM-graded Evals will never be perfect but here are some things you can do to improve their performance, and reduce flakiness.
1. Use GPT-4 (especially if your eval task requires reasoning capabilities)