Product
Product

How to Fine-Tune GPT-3.5 Turbo With OpenAI API

byon September 5, 2023

OpenAI recently announced the release of their GPT-3.5 Turbo fine-tuning APIs. At Scale we have always believed that building custom LLMs through fine-tuning is the key to unlocking greater performance for any given organization’s specific use case. OpenAI also named Scale as their preferred enterprise fine-tuning partner for GPT-3.5, and we have already shown early progress for the power of fine-tuning GPT-3.5 for customers like Brex. While the pre-trained GPT-3.5 model can often solve tasks with prompt engineering, the model becomes even more powerful when fine-tuned, in some cases surpassing GPT-4 in performance. Let’s walk you through how to fine-tune GPT-3.5 in this blog post.

The first and one of the most critical steps in fine-tuning is creating the right training dataset. OpenAI recommends creating a diverse training set of conversations that are similar to the conversations that the model will see in production. These datasets can be created with Scale’s Data Engine, which is trusted by leading ML teams and enterprises to provide large volumes of high-quality data. Scale has worked with OpenAI since 2019 on powering LLMs with better data. Scale's Data Engine has powered most of the leading LLMs, and is now also powering custom LLMs for leading companies like Brex, Chegg, and Accenture. Our cost-effective operations can give you expertly labeled and diverse data at any scale, making it an indispensable asset for fine-tuning.

Fine-tuning also includes reserving the right compute resources and properly implementing the training and inference code. OpenAI’s new fine-tuning APIs makes this process easy and accessible. Let's dive into how you can use this API and explore some examples.

Dataset Preparation

For this example, we will use ScienceQA, a popular dataset consisting of a diverse set of multiple-choice science questions. We used LLM Engine to fine-tune Llama 2 on this dataset in our previous blogpost, and you can follow the data preparation steps in that post. 

Now, let's convert this dataset into OpenAI’s supported format, a JSONL file with lists of conversations. Each example in the dataset should be a conversation in the same format as the chat completions endpoint, specifically a list of messages where each message has a role and content (and optionally a name). The provided assistant messages in the data should be the ideal responses you want the model to provide.

def format_chat(row): 
    return json.dumps(
        {"messages": [
            {"role": "user", "content": row["prompt"]},
            {"role": "assistant", "content": str(row["response"])},
         ]}
    )
def convert_dataset(df, file_name):
    df["conversation"] = df.apply(format_chat, axis=1),
    with open(file_name, 'w') as jsonl_file:
        for example in df["conversation"]:
            jsonl_file.write(example + '\n')

This results in our dataset looking like this:

{"messages:" [{"role": "user", "content": "Context: In a group of cows, ..."}, 
 {"role": "assistant", "content": "B"}]}
["messages": {"role": "user", "content": "Context: In a group of guppies..."},   
 {"role": "assistant", "content": "A"}]}

The OpenAI API supports training with a training and validation dataset, and provides loss numbers on both during the course of training. These samples will be your initial signal of how much the model is improving, when compared to generations from the base model on the same test conversations.

These dataset files must be uploaded to OpenAI’s file endpoint. Make sure to add your OpenAI API key to your system environment variables for authentication.

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
file = openai.File.create(
    file=open("scienceqa_train.jsonl", "rb"),
    purpose='fine-tune',
)

You should get a response similar to the following:

{
  "object": "file",
  "id": "file-CY0FPBluqbVcoHmuGLI80lqx",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 15974,
  "created_at": 1692743745,
  "status": "uploaded",
  "status_details": null
}

You can then check the status of the uploaded file using the provided file ID. 

openai.File.retrieve("file-CY0FPBluqbVcoHmuGLI80lqx")

You need to make sure that the file status is “processed” before proceeding with finetuning.

Fine-tuning

Fine-tuning your model requires only one API call. Currently, only GPT-3.5 Turbo is available for fine-tuning. You just need to provide the IDs for the train and validation data files, the model name (only "gpt-3.5-turbo" is supported), and a suffix for the output model name.

train_data = "file-CY0FPBluqbVcoHmuGLI80lqx"
val_data = "file-428ptqQcofNSLoEP0sUDsU5x"
model = openai.FineTuningJob.create(
    model = "gpt-3.5-turbo", 
    training_file = train_data,
    validation_file = val_data,
    suffix = "scienceqa"
)

You would then get a response such as this:

{
  "object": "fine_tuning.job",
  "id": "ftjob-NmNcHalWN6yGJfGnwealj7MB",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692743811,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-xxx",
  "result_files": [],
  "status": "created",
  "validation_file": "file-428ptqQcofNSLoEP0sUDsU5x",
  "training_file": "file-CY0FPBluqbVcoHmuGLI80lqx",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": null
}

There are several ways to experiment further; for example, increasing the epochs to 4 or 5 could have different results.

You can track the progress of the training job using the finetune ID.

openai.FineTuningJob.retrieve('ftjob-NmNcHalWN6yGJfGnwealj7MB')

It will have the status “running” while it is training and “succeeded” after it is finished or “failed” if it failed. A successful training job will look like this:

{
  "object": "fine_tuning.job",
  "id": "ftjob-NmNcHalWN6yGJfGnwealj7MB",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692743811,
  "finished_at": 1692744189,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:scale-ai:test:7qUNXuwT",
  "organization_id": "org-Vn7k74eir0QoobuD5SrHfwSD",
  "result_files": ["file-hNvzZShh5Phmng1oNb1Qs4sy"],
  "status": "succeeded",
  "validation_file": "file-428ptqQcofNSLoEP0sUDsU5x",
  "training_file": "file-CY0FPBluqbVcoHmuGLI80lqx",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 9909
}

Inference and Evaluation

Once you are done fine-tuning, you can view your job results. The training job will automatically upload a result file named step_metrics.csv, which contains training and test loss and accuracy for each training step.

We can use the training metrics as a rough sanity check on training stability:

from io import StringIO
def get_step_metrics(file_id):
    content = openai.File.download(file_id)
    eval_result = StringIO(result.decode())
    df = pd.read_csv(eval_result, sep=",")
    return df

However, inspecting the model samples gives the most relevant sense of model quality. We can run inference using our fine-tuned model using their chat completions API.

completion = openai.ChatCompletion.create(
    model="ft:gpt-3.5-turbo-0613:scale-ai:scienceqa:7qUNXuwT",
    messages=[
        {"role": "user", "content": "Context: A shopper is buying food at..."}
    ]
)

You will get a result like this:

{
  "id": "chatcmpl-7qUvsuFrButlxKYSs2SispgL5KUrD",
  "object": "chat.completion",
  "created": 1692746316,
  "model": "ft:gpt-3.5-turbo-0613:scale-ai:scienceqa:7qUNXuwT",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "B"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 81,
    "completion_tokens": 1,
    "total_tokens": 82
  }
}

Results

From this fine-tuning job, the model achieves 84.58% accuracy. This is pretty good! Let's see how this compares with the original base GPT-3.5 model. Since the pre-trained model was not fine-tuned on these examples, we need to provide an example in the prompt so the model learns to adhere to the format we expect from the responses.

ScienceQA Validation

GPT-3.5

GPT-4

Pre-trained (1-shot)

72.98%

74.8%

Fine-tuned

84.58%

-

 

As we can see, fine-tuning makes a big difference in the case of all of the language models. The performance gain from fine-tuning GPT-3.5 Turbo on ScienceQA was an 11.6% absolute difference, even outperforming GPT-4!

We also experimented with different numbers of training examples. OpenAI recommends starting with 50 - 100 examples, but this can vary based on the exact use case. We can roughly estimate the expected quality gain from doubling the training data size by fine-tuning on half or a fourth of your dataset and observing the quality gap. For this experiment, we randomly sampled between 50 to 5000 examples for our training set, and used the same validation set for all models.

# Training samples

Test accuracy

50

77.18%

100

78.23%

500

81.10%

1000

81.96%

2500

84.44%

5000

84.58%


We can see that performance generally increases as the number of training samples increases but plateaus at some point. However, we sampled our dataset randomly, and adding well-crafted training examples that target remaining model issues or edge cases can better improve the model’s performance. 

Currently, only the number of epochs is available for customization. Other hyperparameters are set to what OpenAI believes to be good default values that should largely work across a large variety of datasets and scenarios.

If you would like higher performance, OpenAI recommends collecting higher quality and/or more data. You may need to consider the balance and diversity of your current training dataset, collect more examples that target edge cases, and ensure there are no lingering issues in the existing examples. The Scale Data Engine can deliver large volumes of high-quality data that can maximize the performance of your fine-tuned models.

Conclusion

Fine-tuning GPT-3.5 makes a huge difference in unlocking greater performance from base GPT-3.5 models. Now you are all set to unleash the true potential of your fine-tuned model and witness the magic of powerful AI-generated responses. If you are looking to optimize your model performance with even more powerful features, Scale also offers an enterprise model customization platform. Happy fine-tuning!


The future of your industry starts here.