Step-by-Step Guide to Optimizing Your Dataset for Fine-Tuning Models in Athina.
Fine-tuning a model requires structured and high-quality training data. Properly preparing data ensures the model learns effectively from relevant examples, improving its performance on specific tasks. This guide walks you through the step-by-step process of preparing data for fine-tuning in Athina AI, including uploading, processing (quality check), and formatting datasets.
In Athina, data preparation is easy with Dynamic Columns, allowing users to clean, transform, and format datasets without complex coding. You can detect errors, duplicates, and inconsistencies in datasets and even create custom evaluations to ensure data quality before fine-tuning. This results in optimized, high-quality data for better fine-tuning outcomes.Now, let’s go through the step-by-step process of preparing data for fine-tuning.
We are fine-tuning the TinyLlama (1.1B parameters) model with a 2048-token context window (Sequence Length), so the total token length for each sample must be ≤ 2048.
To check the token length for both questions and answers, we will create a tokenizer flow as shown below:
2
Next, click on “Use in Dataset” to add this flow to the fine-tuning dataset.
3
You will then be redirected to your dataset. Here, select Configure Inputs and choose the second code block as the output. This will appear in the dataset column, as shown below:
4
After this, create a custom evaluation to check whether the response exceeds the 2048-token context window.
Once high-quality data has been selected, apply the Chat Template using Execute Custom Code. Here, we use the ChatML template from OpenAI.
This is how you can prepare a fine-tuning dataset in Athina AI.By following these steps, you can properly prepare, clean, and format datasets for fine-tuning in Athina AI. This ensures your model is trained on high-quality, structured data, leading to better performance and improved results.