Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs
Yung-Chieh Chan∗, George Pu∗, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton
∗Denotes equal contribution. Work was done while Yung-Chieh was interning at Scale AI.
As large language models (LLMs) are applied to more use cases, creating high-quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high-quality human data has been the most common approach to unlock model performance, but it is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remains unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories – Answer Augmentation, Question Rephrase, and New Question – and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but generating new questions becomes optimal as this ratio increases. Across all tasks, we find that the choice of augmentation method and other design choices matter substantially more in low to mid-data regimes than in high-data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and using different LLMs for synthetic data generation.
Method
In this work, we examine techniques in synthetic data that were initially introduced in the math reasoning domains and extend the analysis of these strategies to a more diverse collection of tasks and constraints (Setlur et al., 2024; Yu et al., 2023; Li et al., 2024). We choose supervised finetuning (SFT) as the learning objective for the student model, which requires a dataset consisting of instruction-response pairs. We identify three main types of synthetic data generation strategies pertinent to an SFT instruction set – Answer Augmentation, Question Rephrasing and New Question – that are in essence, either the transformations of the seed instructions by an augmenter LLM, generation of corresponding responses by a teacher LLM or both.
In the above figure, we present an overview of our data generation methods. Given a seed instruction set, we have 3 different methods to create instruction-response pairs for fine-tuning our student model. We use an example seed instruction from the ARC-C training set with synthetic instructions and responses generated with Llama 3.1 70b Instruct.
For all experiments, we assess the effectiveness of each synthetic data generation strategy by comparing the accuracy of fine-tuned student model on the evaluation split of the dataset used in the experiment. First, we understand the effectiveness of each data generation strategy through scalability and data constraints. Next, we investigate the cost-effectiveness of creating new instructions and responses. Finally, we cover ablation studies focused on understanding the impact of different teacher models, verification of responses, and a different student model.
This paper Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs (arxiv), has been accepted to the Fine-Tuning in Machine Learning (FITML) Workshop at NeurIPS 2024.
Bibtex Citation
@article{chan2024balancing,
title={Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs},
author={Chan, Yung-Chieh and Pu, George and Shanker, Apaar and Suresh, Parth and Jenks, Penn and Heyer, John and Denton, Sam},
journal={arXiv preprint arXiv:2409.19759},
year={2024}
}