Generate a Synthetic Dataset
You can generate synthetic datasets in Athina.
AI is only as good as your data.
But collecting robust datasets for training and testing can a major challenge.
In Athina IDE, you can generate high-quality RAG Q&A datasets using your own documents on Athina.
You can then use this generated data for evaluation, testing prompts and models, running experiments or export for fine-tuning.
Demo Video
How to Generate a Synthetic Dataset
-
Open Athina Develop.
-
Click Generate Synthetic Data
-
Select the documents you want to use to generate the dataset.
a. You can either upload a
.txt
fileb. Or you can choose to generate synthetic data similar to your production logs.
-
Choose the number of questions you want to generate.
-
Choose the type of questions you want to generate.
What type of synthetic data can I generate?
Currently, we support the following question types:
- Simple Q&A
- Reasoning-based Questions
- Multiple Choice Questions
- Negative Questions
- Unsafe Questions
- Conditional Questions
If you need something more custom than this, please contact us.
How does synthetic data generation work?
We partnered with Fiddlecube to leverage their advanced data generation techniques.
A lot things are happening under the hood to generate high quality data:
-
The source data is run through a data generation pipeline, which uses large language models to generate rows with diversity.
-
The dataset is then measured for quality, and rigorously filtered, cleaned and de-duped to meet the described criteria.
-
Ultimately, the output rows will be RAG Question-Answer style rows with a query, context, and response.