At some point in your AI product development lifecycle, you will find a need to improve performance of your model.For Generative AI models, improving performance is pretty difficult to do systematically because you don’t have a quantitative way to measure results.
Usually, people have a workflow like this during their initial prototyping phase.This worklow is fine to get an initial demo ready, but does not work great after this stage.
This workflow is similar to the previous workflow, but instead of running 1 datapoint at a time, you are running many datapoints together.However, you still don’t have ground truth data (the ideal response by the LLM) so there’s nothing to compare against.Eyeball Worklow - Run prompt on dataset with multiple datapoints - Put outputs onto a spreadsheet / CSV - Manually review (eyeball) the responses for each - RepeatThis worklow is fine pre-MVP, but is not great for iteration.
Why doesn’t this workflow work for rapid iteration?
Inspecting generations on a dataset is manual and time-consuming (even if the dataset is small!)
You don’t have quantitative metrics
You have to maintain a historical record of prompts run
You don’t have a system to compare the outputs of prompt A vs prompt B
You now have a golden dataset with your datapoints, and ideal responses.You can now set up some basic evals.Great! Now you actually have a way to improve performance systematically.The workflow looks something like thisIteration Worklow - Create golden dataset (multiple datapoints with expected responses) - Run prompt on test dataset - Option 1: Manual Review - Put outputs onto a spreadsheet / CSV - Manually compare LLM responses against expected responses - Option 2: Evaluators - Create evaluators to compare LLM response against expected response - But what metrics to use? How to compare 2 pieces of unstructured text? - Build internal tooling to: - run these evaluators, and score them - track history of runs - a UI
This is actually a good workflow for all stages.
What are the downsides of this workflow?
Difficult and time consuming to create good evals
You need to create lots of internal tooling
Does not capture data variations between your golden dataset and production data
You have to maintain a historical record of prompts run
Athina’s workflow is designed for users at any stage of the AI product development lifecycle.Athina Monitor: Demo / MVP / Production StageSetup time: < 5 mins
Preserve historical data including prompt, response, cost, token usage and latency (+ more)
UI to manually grade your responses with 👍 / 👎
This will work for single datapoint or multiple datapoints.Athina Evaluate: Development / Iteration StageSetup time: 2 minsNow that you’re really trying to focus on improving model performance, here’s how you can do it:
Configure experiments and run evaluations programmatically