You can generate synthetic datasets in Athina.
This feature is currently in beta. Please contact us if you’d like early access.
AI is only as good as your data.
But collecting robust datasets for training and testing can a major challenge.
In Athina IDE, you can generate high-quality RAG Q&A datasets using your own documents on Athina.
You can then use this generated data for evaluation, testing prompts and models, running experiments or export for fine-tuning.
Open Athina Develop.
Click Generate Synthetic Data
Select the documents you want to use to generate the dataset.
a. You can either upload a .txt
file
b. Or you can choose to generate synthetic data similar to your production logs.
Choose the number of questions you want to generate.
Choose the type of questions you want to generate.
Currently, we support the following question types:
If you need something more custom than this, please contact us.
We partnered with Fiddlecube to leverage their advanced data generation techniques.
A lot things are happening under the hood to generate high quality data:
The source data is run through a data generation pipeline, which uses large language models to generate rows with diversity.
The dataset is then measured for quality, and rigorously filtered, cleaned and de-duped to meet the described criteria.
Ultimately, the output rows will be RAG Question-Answer style rows with a query, context, and response.