Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for Fine-Tuning LLMs
Scale’s paper, Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs (arxiv), has been accepted to the Fine-Tuning in Machine Learning (FITML) Workshop at NeurIPS 2024. This post overviews the research presented in that paper.
Paper Authors: Yung-Chieh Chan, George Pu, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton
In this paper, the Scale AI team investigates various synthetic data generation strategies for fine-tuning LLMs under different constraints, paying particular attention to cost-effectiveness. This makes this paper especially relevant for real-world enterprise fine-tuning considerations.
At Scale, we help make AI work for enterprises, so they can do better work. One way to make Enterprise AI work better for a particular enterprise task is to fine-tune LLMs. Fine-tuning requires two key steps: creating a high-quality dataset and fine-tuning the model to acquire new capabilities. We have found that most enterprises simply don’t have high-quality, task-specific datasets sitting around ready to be applied to a model, creating a data bottleneck.
Various solutions have emerged to overcome this data bottleneck, including having humans manually curate datasets, automatic or synthetic data generation, and hybrid methods. Some examples of these solutions include:
- Manual or automated enhancement of data quality
- Increasing the dataset quantity
- Extracting more informative learning signals from each data sample
While each of these methods has shown promise, their relative cost-effectiveness and performance across different tasks and data constraints remains unclear, especially in resource-constrained scenarios, as is true for most real-world enterprise use cases.
This lack of clarity poses a significant challenge for enterprises seeking to optimize their data generation strategies for specific tasks and available resources.
How we did it:
At Scale, we perform various experiments to better understand the data layer while building custom LLMs for various domains and tasks. Recently, we investigated this question using tools from Scale GenAI Platform to perform a controlled experiment grouping various synthetic data strategies and leveraging a set of small, high-quality prompts to teach a student model as much as possible from those few prompts using a smarter teacher model.
We train our models with Supervised Fine-Tuning (SFT), which involves a dataset of prompts and responses, and consider three main synthetic data strategies: Answer Augmentation, Question Rephrase, and New Question. Each strategy represents a different way of modifying the prompt or response space.
We propose a novel framework to evaluate synthetic data generation strategies under data constraints and demonstrate synthetic data effectiveness in new tasks beyond traditional mathematical and coding scenarios. We selected three types of tasks in Mathematics, General Question Answering, and Text2SQL domains.
What we found:
We show which methods squeeze out the most accuracy (y-axis) given a different number of starting prompts (per figure) and a different “budget” of synthetic data (x-axis). We also measure the "query budget ratio" — a ratio specified by q/s, where "q" is the query budget (number of times we can call an LLM) and "s" is the initial seed instruction size or the number of initial task-specific instructions available.
Surprisingly, this "shift" in the optimal method is consistent across the three tasks. With a limited budget, creating new responses is most effective. However, as we increase our budget, creating new prompts yields the most effective results.
We investigate and identify important factors affecting the performance of a fine-tuned model under constrained budgets. For example, model choice is a key factor in New Question but less important when performing Question Rephrasing.
In this study, we provide a framework to analyze the effectiveness of various synthetic data generation strategies for training LLMs under different resource constraints and task types. Our findings reveal that the optimal strategy hinges on the query budget ratio. Augmenting answers to existing questions proves most effective when this ratio is low while generating new questions becomes advantageous as the ratio increases.
We also find that the choice of augmentation strategy is less critical in data-rich scenarios, potentially leading to future cost reductions and efficiency improvements. Furthermore, question rephrasing is robust even with weaker augmentation models, highlighting the potential for cost reduction in specific scenarios.
Finally, our observations indicate that verification of synthetic responses and the specific choice of student model have less impact. These insights should guide practitioners in selecting the most suitable data generation strategies for more efficient LLM training within their specific constraints.
Conclusion
Our work addresses the fine-tuning data bottleneck challenge from a new perspective by offering a general framework that guides MLEs in defining and refining their synthetic data generation strategies to maximize cost-efficiency within budget constraints.
We can further improve the model development process by balancing the cost and effectiveness of our data generation strategies while understanding what design choices play a role. We include more details of our method, experiment framework, and results in our paper, which has been accepted to the FITML Workshop @ NeurIPS 2024.
If you want to learn more about how these results can be applied in your organization, request a demo below or visit our GenAI Solutions page.