How to Fine-Tune Llama 2 With LLM Engine

byon July 21, 2023

Earlier this week, Meta announced the release of Llama 2. Llama 2 includes both a base pre-trained model and a fine-tuned model for chat available in three sizes. While the performance of the pre-trained model is impressive, fine-tuning the base Llama-2 model can unlock even greater performance on most language tasks. This process can be quite complicated, from reserving the right compute resources to properly implementing the training and inference code.

Scale's new open-source LLM Engine repository makes this process easy and accessible. Let's dive into how you can use this API and explore some examples.

 from llmengine import FineTune

response = FineTune.create(


You can also follow the code examples in this notebook.


For this example, we will use ScienceQA, a popular dataset consisting of a diverse set of multiple-choice science questions. Each question may have textual context and image context and contains a thorough explanation and lecture supporting the solution.

An example sample from ScienceQA

Currently, LLM Engine supports fine-tuning on prompt-completion pairs. Let's first convert this dataset into the supported format, a CSV with two columns: prompt and response.

Before you get started, install the required dependencies.

pip install datasets==2.13.1 smart_open[s3]==5.2.1 pandas==1.4.4

We can load the dataset from Hugging Face, and observe the dataset's features.

from datasets import load_dataset
from smart_open import smart_open
import pandas as pd
dataset = load_dataset('derek-thomas/ScienceQA')

A commonly used format for feeding ScienceQA examples is

Context: A baby wants to know what is inside of a cabinet. Her hand applies a force to the door, and the door opens.
Question: Which type of force from the baby's hand opens the cabinet door?
Options: (A) pull (B) push
Answer: A.

Since the format of options in the Huggingface dataset is only a list of possible answers, we need to convert this list into the example format from above by adding the enumeration prefix.

choice_prefixes = [chr(ord('A') + i) for i in range(26)] # A-Z
def format_options(options, choice_prefixes):
    return ' '.join([f'({c}) {o}' for c, o in zip(choice_prefixes, options)])

Now, let's write the formatting function to turn a single sample from this dataset into a prompt and response to feed into the model.

def format_prompt(r, choice_prefixes):
    options = format_options(r['choices'], choice_prefixes)
    return f'''Context: {r["hint"]}\nQuestion:{r["question"]}\nOptions:{options}\nAnswer:'''
def format_response(r, choice_prefixes):
    return choice_prefixes[r['answer']] 

Finally, let's construct the dataset. Note that some samples in ScienceQA only have an image for context - we'll be skipping those in this example as Llama-2 is purely a language model, and cannot accept image inputs.

 def convert_dataset(ds):
    prompts = [format_prompt(i, choice_prefixes) for i in ds if i['hint'] != '']
    labels = [format_response(i, choice_prefixes) for i in ds if i['hint'] != '']
    df = pd.DataFrame.from_dict({'prompt': prompts, 'response': labels})
    return df 

LLM Engine supports training with a training and validation dataset. If you only provide a training dataset, the LLM Engine will randomly split 10% from the dataset for validation. Splitting your dataset prevents the model from overfitting on your training data, resulting in poor generalization to live data it might see during inference.

These dataset files must be stored in a publicly accessible URL so that LLM Engine can read them. For this example, we will be saving the datasets to s3. We also have the preprocessed train dataset and validation dataset publicly available in Github Gists - you can directly replace train_url and val_url with these links.

train_url = 's3://...'
val_url = 's3://...'
df_train = convert_dataset(dataset['train'])
with smart_open(train_url, 'wb') as f:
df_val = convert_dataset(dataset['validation'])
with smart_open(val_url, 'wb') as f:

Now, we can start fine-tuning via the LLM Engine API.


First, we need to install LLM Engine.

pip install scale-llm-engine

Next, you'll need to set up your Scale API key. Follow the instructions in the README to acquire your unique API key. Advanced users can also follow the guide for self-hosting LLM Engine to avoid needing a Scale API key.

import os
os.environ['SCALE_API_KEY'] = 'xxx'

Once you have everything set up, fine-tuning your model requires only one API call. Today, we'll be working with the 7 billion parameter version of Llama-2, which proves powerful enough for most use cases.

from llmengine import FineTune
response = FineTune.create(
run_id = response.fine_tune_id

With the run_id, you can monitor the status of your job and get live updates on per-epoch metrics, like train and validation loss.

ScienceQA is a large dataset, so it may take an hour or two for training to complete.

while True:
    job_status = FineTune.get(run_id).status
    # Returns one of `PENDING`, `STARTED`, `SUCCESS`, `RUNNING`,
    if job_status == 'SUCCESS':
#Logs for completed or running jobs can be fetched with
logs = FineTune.get_events(run_id)

Inference & Evaluation

Once you are done fine-tuning, you can start generating responses to any input. However, before that, let's ensure the model exists and is ready to accept inputs.

ft_model = FineTune.get(run_id).fine_tuned_model

It may take a few minutes for your first inference call to be served. After that, inference should be relatively quick.

Now, let's evaluate the performance of our Llama-2 model fine-tuned on ScienceQA.

 import pandas as pd

#Helper a function to get outputs for fine-tuned model with retries
def get_output(prompt: str, num_retry: int = 5):
    for _ in range(num_retry):
    response = Completion.create(
      return response.output.text.strip()
    except Exception as e:
  return ""
#Read the test data
test = pd.read_csv(val_url)
test["prediction"] = test["prompt"].apply(get_output)
print(f"Accuracy: {(test['response'] == test['prediction']).mean() * 100:.2f}%")

This achieves 82.15% accuracy. This is pretty good! Let's see how this compares with the original base Llama-2 model. Since the pre-trained model was not fine-tuned on these examples, we need to provide an example in the prompt so the model learns to adhere to the format we expect from the responses. We also see how things compare against fine-tuning a model of similar size, Mosaic's MPT.

*Results from MM-COT paper.

As we can see, fine-tuning makes a huge difference in the case of both language models. The performance gain from fine-tuning Llama-2 on ScienceQA was a 26.59% absolute difference! This is in addition to the fact that inference with a fine-tuned model is cheaper than few-shot example prompts, due to shorter prompt length. This fine-tuned Llama-2-7B model also outperforms GPT-3.5, a 175 billion parameter model!

We can also see that the new Llama-2 model outperforms MPT in both the fine-tuned and few-shot prompting settings, showcasing its strength as both a base and fine-tunable model.

As a side experiment, we also used LLM Engine to fine-tune and evaluate LLAMA-2's performance on several tasks from GLUE, a set of commonly used NLP benchmark datasets.


Fine-tuning makes a huge difference in unlocking greater performance from base pre-trained models. Now you are all set to unleash the true potential of your fine-tuned model and witness the magic of powerful AI-generated responses. Happy fine-tuning! 🚀

The future of your industry starts here.