Generate a Synthetic Dataset
You can generate synthetic datasets in Athina.
This feature is currently in beta. Please contact us if you’d like early access.
AI is only as good as your data.
But collecting robust datasets for training and testing can a major challenge.
In Athina IDE, you can generate high-quality RAG Q&A datasets using your own documents on Athina.
You can then use this generated data for evaluation, testing prompts and models, running experiments or export for fine-tuning.
Demo Video
How to Generate a Synthetic Dataset
-
Open Athina Develop.
-
Click Generate Synthetic Data
-
Select the documents you want to use to generate the dataset.
a. You can either upload a
.txt
fileb. Or you can choose to generate synthetic data similar to your production logs.
-
Choose the number of questions you want to generate.
-
Choose the type of questions you want to generate.
What type of synthetic data can I generate?
Currently, we support the following question types:
- Simple Q&A
- Reasoning-based Questions
- Multiple Choice Questions
- Negative Questions
- Unsafe Questions
- Conditional Questions
If you need something more custom than this, please contact us.
How does synthetic data generation work?
We partnered with Fiddlecube to leverage their advanced data generation techniques.
A lot things are happening under the hood to generate high quality data:
-
The source data is run through a data generation pipeline, which uses large language models to generate rows with diversity.
-
The dataset is then measured for quality, and rigorously filtered, cleaned and de-duped to meet the described criteria.
-
Ultimately, the output rows will be RAG Question-Answer style rows with a query, context, and response.