Earlier this week, Meta announced the release of Llama 2. Llama 2 includes both a base pre-trained model and a fine-tuned model for chat available in three sizes. While the performance of the pre-trained model is impressive, fine-tuning the base Llama-2 model can unlock even greater performance on most language tasks. This process can be quite complicated, from reserving the right compute resources to properly implementing the training and inference code.
Scale's new open-source LLM Engine repository makes this process easy and accessible. Let's dive into how you can use this API and explore some examples.
from llmengine import FineTune
response = FineTune.create(
model="llama-2-7b",
training_file="s3://my-bucket/path/to/training-file.csv",
)
print(response.json())
You can also follow the code examples in this notebook.
Dataset
For this example, we will use ScienceQA, a popular dataset consisting of a diverse set of multiple-choice science questions. Each question may have textual context and image context and contains a thorough explanation and lecture supporting the solution.
An example sample from ScienceQA
Currently, LLM Engine supports fine-tuning on prompt-completion pairs. Let's first convert this dataset into the supported format, a CSV with two columns: prompt
and response
.
Before you get started, install the required dependencies.
pip install datasets==2.13.1 smart_open[s3]==5.2.1 pandas==1.4.4
We can load the dataset from Hugging Face, and observe the dataset's features.
from datasets import load_dataset
from smart_open import smart_open
import pandas as pd
dataset = load_dataset('derek-thomas/ScienceQA')
dataset['train'].features
A commonly used format for feeding ScienceQA examples is
Context: A baby wants to know what is inside of a cabinet. Her hand applies a force to the door, and the door opens.
Question: Which type of force from the baby's hand opens the cabinet door?
Options: (A) pull (B) push
Answer: A.
Since the format of options
in the Huggingface dataset is only a list of possible answers, we need to convert this list into the example format from above by adding the enumeration prefix.
choice_prefixes = [chr(ord('A') + i) for i in range(26)] # A-Z
def format_options(options, choice_prefixes):
return ' '.join([f'({c}) {o}' for c, o in zip(choice_prefixes, options)])
Now, let's write the formatting function to turn a single sample from this dataset into a prompt
and response
to feed into the model.
def format_prompt(r, choice_prefixes):
options = format_options(r['choices'], choice_prefixes)
return f'''Context: {r["hint"]}\nQuestion:{r["question"]}\nOptions:{options}\nAnswer:'''
def format_response(r, choice_prefixes):
return choice_prefixes[r['answer']]
Finally, let's construct the dataset. Note that some samples in ScienceQA only have an image for context - we'll be skipping those in this example as Llama-2 is purely a language model, and cannot accept image inputs.
def convert_dataset(ds):
prompts = [format_prompt(i, choice_prefixes) for i in ds if i['hint'] != '']
labels = [format_response(i, choice_prefixes) for i in ds if i['hint'] != '']
df = pd.DataFrame.from_dict({'prompt': prompts, 'response': labels})
return df
LLM Engine supports training with a training and validation dataset. If you only provide a training dataset, the LLM Engine will randomly split 10% from the dataset for validation. Splitting your dataset prevents the model from overfitting on your training data, resulting in poor generalization to live data it might see during inference.
These dataset files must be stored in a publicly accessible URL so that LLM Engine can read them. For this example, we will be saving the datasets to s3. We also have the preprocessed train dataset and validation dataset publicly available in Github Gists - you can directly replace train_url
and val_url
with these links.
train_url = 's3://...'
val_url = 's3://...'
df_train = convert_dataset(dataset['train'])
with smart_open(train_url, 'wb') as f:
df_train.to_csv(f)
df_val = convert_dataset(dataset['validation'])
with smart_open(val_url, 'wb') as f:
df_val.to_csv(f)
Now, we can start fine-tuning via the LLM Engine API.
Fine-Tuning
First, we need to install LLM Engine.
pip install scale-llm-engine
Next, you'll need to set up your Scale API key. Follow the instructions in the README to acquire your unique API key. Advanced users can also follow the guide for self-hosting LLM Engine to avoid needing a Scale API key.
import os
os.environ['SCALE_API_KEY'] = 'xxx'
Once you have everything set up, fine-tuning your model requires only one API call. Today, we'll be working with the 7 billion parameter version of Llama-2, which proves powerful enough for most use cases.
from llmengine import FineTune
response = FineTune.create(
model="llama-2-7b",
training_file=train_url,
validation_file=val_url,
hyperparameters={
'lr':2e-4,
},
suffix='science-qa-llama'
)
run_id = response.fine_tune_id
With the run_id
, you can monitor the status of your job and get live updates on per-epoch metrics, like train and validation loss.
ScienceQA is a large dataset, so it may take an hour or two for training to complete.
while True:
job_status = FineTune.get(run_id).status
# Returns one of `PENDING`, `STARTED`, `SUCCESS`, `RUNNING`,
# `FAILURE`, `CANCELLED`, `UNDEFINED` or `TIMEOUT`
print(job_status)
if job_status == 'SUCCESS':
break
time.sleep(60)
#Logs for completed or running jobs can be fetched with
logs = FineTune.get_events(run_id)
Inference & Evaluation
Once you are done fine-tuning, you can start generating responses to any input. However, before that, let's ensure the model exists and is ready to accept inputs.
ft_model = FineTune.get(run_id).fine_tuned_model
It may take a few minutes for your first inference call to be served. After that, inference should be relatively quick.
Now, let's evaluate the performance of our Llama-2 model fine-tuned on ScienceQA.
import pandas as pd
#Helper a function to get outputs for fine-tuned model with retries
def get_output(prompt: str, num_retry: int = 5):
for _ in range(num_retry):
try:
response = Completion.create(
model=ft_model,
prompt=prompt,
max_new_tokens=1,
temperature=0.01
)
return response.output.text.strip()
except Exception as e:
print(e)
return ""
#Read the test data
test = pd.read_csv(val_url)
test["prediction"] = test["prompt"].apply(get_output)
print(f"Accuracy: {(test['response'] == test['prediction']).mean() * 100:.2f}%")
This achieves 82.15% accuracy. This is pretty good! Let's see how this compares with the original base Llama-2 model. Since the pre-trained model was not fine-tuned on these examples, we need to provide an example in the prompt so the model learns to adhere to the format we expect from the responses. We also see how things compare against fine-tuning a model of similar size, Mosaic's MPT.
*Results from MM-COT paper.
As we can see, fine-tuning makes a huge difference in the case of both language models. The performance gain from fine-tuning Llama-2 on ScienceQA was a 26.59% absolute difference! This is in addition to the fact that inference with a fine-tuned model is cheaper than few-shot example prompts, due to shorter prompt length. This fine-tuned Llama-2-7B model also outperforms GPT-3.5, a 175 billion parameter model!
We can also see that the new Llama-2 model outperforms MPT in both the fine-tuned and few-shot prompting settings, showcasing its strength as both a base and fine-tunable model.
As a side experiment, we also used LLM Engine to fine-tune and evaluate LLAMA-2's performance on several tasks from GLUE, a set of commonly used NLP benchmark datasets.
Conclusion
Fine-tuning makes a huge difference in unlocking greater performance from base pre-trained models. Now you are all set to unleash the true potential of your fine-tuned model and witness the magic of powerful AI-generated responses. Happy fine-tuning! 🚀