Guide to Large Language Models
Large language models (LLMs) are transforming how we create, understand our world, and how we work. We created this guide to help you understand what LLMs are and how you can use these models to unlock the power of your data and accelerate your business.
What are Large Language Models?
Large language models (LLMs) are machine learning models trained on massive amounts of text data that can classify, summarize, and generate text. LLMs such as OpenAI’s GPT-4, Google’s PaLM 2, Cohere’s Command model, and Anthropic’s Claude, have demonstrated the ability to generate human-like text, often with impressive coherence and fluency. Until the arrival of ChatGPT, the most well-known examples of large language models were GPT-3 and BERT, which have been trained on vast amounts of text data from the internet and other sources. Generally, LLMs are capable of a wide variety of natural language processing (NLP) applications, including copywriting, content summarization, code generation and debugging, chatbots, question answering, and translation.
At a high level, large language models are language prediction models. These models aim to predict the most likely next word given the words provided as input to the model, also called prompts. These models generate text one word at a time based on a statistical analysis of all the “tokens” they have ingested during training (tokens are strings of characters that are combined to form words). LLMs have a wide array of capabilities and applications that we will explore in this guide.
Despite the significant progress in making these models more capable and widely accessible, many organizations are still uncertain about how to adopt them properly. From the Scale Zeitgeist 2023 report, we found that while most respondents (60%) are experimenting with generative models or plan on working with them in the next year, only 21% have these models in production. Many organizations cited a lack of the software and tools, expertise, and changing company culture as key challenges to adoption. We wrote this guide to help you get a better understanding of large language models and how you can start adopting them for your use cases.
Why are Large Language Models important?
Large language models have revolutionized natural language processing and have a wide range of applications. These models are transforming how we create, understand our world, and conduct business. Large language models help us write content like blogs, emails, or ad copy more quickly and creatively. They enable developers to write code more efficiently and help them find bugs in large code bases. Developers can also integrate their applications with LLMs using English-language prompts without needing a machine learning background, accelerating innovation. Large language models summarize long-form content so that we can quickly understand the most critical information from reports, news articles, and company knowledge bases. Chatbots are finally living up to their promise of enabling businesses to streamline operations while improving customer service.
Large Language Models are more available to a wider audience than ever, as companies such as OpenAI, Anthropic, Google, Stability AI, and Cohere, provide APIs or open-source models to be used by the larger community. Additionally, the talent pool of machine learning engineers is growing and new roles such as "prompt engineer" are becoming popular (source).
Due to the large amount of data they have been trained on, large language models generalize to a wide range of tasks and styles. These models can be given an example of a problem and are then able to solve problems of a similar type. However, out of the box, these models are poor specialists. To take full advantage of LLMs, businesses need to fine-tune models on their proprietary data. For example, consider a financial services company looking to perform investment research. As base models only have access to outdated publicly available data, they will provide generic information about stocks or other assets but often will be unable or will flat-out refuse to provide investment advice. Alternatively, a fine-tuned model with access to private research reports and databases is able to provide unique investment insights that can lead to higher productivity and investment returns.
When used properly, LLMs help organizations to empower their employees, increase their efficiency, and are the foundation for better customer experience. We will now explore how these models work and how to deploy them properly to maximize the benefits for your business.
Common Use Cases for Large Language Models
LLMs are capable of a wide variety of tasks, the most common of which we will outline here. We will also discuss some domain-specific tasks for a select few industries.
Classification and Content Moderation
Large language models can perform a wide range of natural language processing tasks, including classification tasks. Classification is the process of assigning a given input to one or multiple predefined categories or classes. For example, a model might be trained to classify a sentence as either positive or negative in sentiment. Beyond sentiment analysis, LLMs can be used to detect the reasons for customers' calls (no more needing to sit through long phone menus to get to the right agent) or properly organize user feedback between UX suggestions, bug reports, or feature requests.
Content moderation is also a common application of LLMs classification power. A common use case is flagging if users are posting toxic and inappropriate content. LLM can be fine-tuned to quickly adapt to new policies, making them highly versatile tools for content moderation.
Classification: Classify each statement as "Bearish", "Bullish", or "Neutral"
Text Generation
One of the most impressive capabilities of large language models is their ability to generate human-like text. Large language models can produce coherent and well-written prose on almost any topic in an instant. This ability makes them a valuable tool for a variety of applications, such as automatically generating responses to customer inquiries or even creating original content for social media posts. Users can request that the response is written in a specific tone, from humorous to professional, and can mimic the writing styles of authors such as William Shakespeare or Dale Carnegie.
Text Extraction
Large language models can also extract key information from unstructured text. This can be particularly helpful for search applications or more real-time use cases like call center optimization, such as automatically parsing a customer's name and address without a structured input. LLMs are particularly adept at text extraction because they importantly understand the context of words and phrases and can filter extraneous information from important details.
Summarization
Large language models can also perform text summarization, which is the process of creating a concise summary of a given piece of text that retains its key information and ideas. This can be useful for analyzing financial statements, historical market data, and other proprietary data sources and providing a summary of the documents for financial analysts. Text extraction combined with text summarization is a very powerful combination. For companies or industries with a large corpus of written data, these two properties can be combined to retrieve relevant information, such as unstructured text or knowledge in a structured database, and then summarize it accordingly for human consumption while also citing the specific source.
Question Answering
LLMs can retrieve data from knowledge bases, documents, or user-supplied text to answer specific questions, such as what types of products to select or what assets provide the highest returns. When a model is fine-tuned on domain-specific data and refined with RLHF, these models can answer questions incredibly accurately. For example, consider an eCommerce chatbot:
Search
Search engines like Bing, Google, and You.com already have or will incorporate LLMs into their search engines. Companies are also looking to implement LLMs as a form of enterprise search. While base foundation models are unreliable for citing facts, they summarize search results well. As we highlight throughout this guide, it is important to ensure that a base model is fine-tuned and aligned for any enterprise use case.
Software Programming
These models are also reshaping how developers write software. Code-completion tools like OpenAI's codex and Github Copilot give developers a powerful tool to increase their efficiency and debug their code. Programmers can ask for functions from scratch, or provide existing functions and ask the LLM to help them debug it. As the context window size increases, these tools will be able to help analyze entire code bases as well (source).
General Assistants
Large language models can be used for tasks such as data analysis, content generation, and even helping to design new products. The ability to quickly process and analyze large amounts of data can also help businesses make better decisions, increase employee productivity, and stay ahead of the competition.
Industry Use Cases
Let's quickly explore how a few industries are adopting Generative AI to improve their business:
- Insurance companies use AI to increase the operational efficiency of claims processing. Claims are often highly complicated, and Generative AI excels at properly routing, summarizing, and classifying these claims.
- Retail and eCommerce companies have for years tried to adopt customer chatbots, but they have failed to live up to their promise of streamlining operations and providing a better user experience. But now, with the latest generative chatbots that are fine-tuned on company data, chatbots finally provide engaging discussions and recommendations that dynamically respond to customer input.
- Financial services companies are building assistants for investment research that analyze financial statements, historical market data, and other proprietary data sources and provide detailed summaries, interactive charts, and even take action with plugins. These tools increase the efficiency and effectiveness of investors by surfacing the most relevant trends and providing actionable insights to help improve returns.
Overall, large language models offer a wide range of potential benefits for businesses.
A Brief History of Language Models
To better contextualize the impact of these models, it is essential to understand the history of natural language processing. While the field of natural language processing (NLP) began in the 1940s after World War II, and the concept for using neural networks for natural language processing dates back to the 1980s, it was not until relatively recently that the combination of processing power via GPUs and data necessary to train very large models became widely available. Symbolic and statistical natural language processing were the dominant paradigms from the 1950s through the 2010s.
Recurrent Neural Networks (RNNs) were popularized in the 1980s. RNNs are a basic form of artificial neural network that can handle sequential data, but they struggle with long-term dependencies. In 1997, LSTMs were invented; LSTMs are a type of RNN that can manage long-term dependencies better due to their gating mechanism. Around 2007, LSTMs began to revolutionize speech recognition, with this architecture being used in many commercial speech-to-text applications.
Throughout the early 2000s and 2010s, trends shifted to deep neural nets, leading to rapid improvement on the state of the art for NLP tasks. In 2017, the now-dominant transformer architecture was introduced to the world (source), changing the entire field of AI and machine learning. Transformers use an attention mechanism to process entire sequences at once, making them more computationally efficient and capable of handling complex contextual relationships in data. Compared to RNN and LSTM models, the transformer architecture is easier to parallelize, allowing training on larger datasets.
2018 was a seminal year in the development of language models built on this transformer architecture, with the release of both BERT from Google and the original GPT from OpenAI.
BERT, which stands for Bidirectional Encoder Representations from Transformers, was one of the first large language models to achieve state-of-the-art results on a wide range of natural language processing tasks in 2018. BERT is widely used in business today for its classification capabilities, as it is a relatively lightweight model and inexpensive to run in production. BERT was state of the art when it was first unveiled, but has now been surpassed in nearly all benchmarks by more modern generative models such as GPT-4.
The GPT family of models differs from BERT in that GPT models are generative and have a significantly larger scale leading to GPT models outperforming BERT on a wide range of tasks. GPT-2 was released in 2019, and GPT-3 was announced in 2020 and made generally available in 2021. Google released the open-source T5/FLAN (Fine-tuned Language Net) model and announced LaMDA in 2021, pushing the state of the art with highly capable models.
In 2022 the open-source BLOOM language model, the more powerful GPT-3 text-davinci-003, and ChatGPT were released, capturing headlines and catapulting LLMs to popular attention.In 2023, GPT-4 and Google's Bard chatbot were announced. Bard was originally running LaMDA, but Google has since replaced it with the more powerful PaLM 2 model.
There are now several competitive models including Anthropic’s Claude, Cohere’s Command Model, Stability AI’s StableLM. We expect to see these models to continue to improve and gain new capabilities over the next few years. In addition to text, multimodal models will be able to ingest and respond with images and videos, with a coherent understanding of the relationships between these modalities. Models will hallucinate less and will be able to more reliably interact with tools and databases. Developer ecosystems will proliferate around these models as a backend, ushering in an era of accelerated innovation and productivity. While we do expect to see larger models, we expect model builders will focus more on high quality data to improve model performance.
Model Size and Performance
Over time, LLMs have become more capable as they've increased in size. Model size is typically determined by its training dataset size measured in tokens (parts of words) or by its number of parameters (the number of values the model can change as it learns).
- BERT (2018) was 3.7B tokens and 240 million parameters (source).
- GPT-2 (2019) was 9.5B tokens and 1.5 billion parameters (source).
- GPT-3 (2020) has 499B tokens and 175B parameters (source).
- PaLM (2022) was 780 Billion tokens and 540 billion parameters (source).
As these models scaled in size, their capabilities continued to increase, providing more incentive for companies to build applications and entire businesses on top of these models. This trend continued until very recently.
But now, model builders are grappling with the fact that we may have reached a sort of plateau, a point at which additional model size yields diminishing performance improvements. Deepmind's paper on training compute-optimal LLMs, (source), showed that for every doubling of model size the number of training tokens should also be doubled. Most LLMs are already trained on enormous amounts of data, including most of the internet, so expanding dataset size by a large degree is increasingly difficult. Larger models will still outperform smaller models, but we are seeing model builders focusing less on increasing size and instead focusing on incredibly high-quality data for pre-training, combined with techniques like supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and prompt engineering to optimize model performance.
Fine-Tuning Large Language Models
What is fine-tuning of an LLM?
Fine-tuning is a process by which an LLM is adapted to specific tasks or domains by training it on a smaller, more targeted dataset. This can help the model better understand the nuances of the specific tasks or domains and improve its performance on those particular tasks. Through human evaluations on prompt distribution, OpenAI found that outputs from their 1.3B parameter InstructGPT model were preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters (source). Fine-tuned models perform better and are much less likely to respond with toxic content or hallucinate (make up information). The approach for fine-tuning these models included a wide array of different domains, though still a tiny subset compared to the entirety of internet data.
This principle of fine-tuning increasing task-specific performance also applies to single domains, such as a particular industry or specific task. Fine-tuning large language models (LLMs) makes them incredibly valuable for businesses. For example, a company that provides language translation services could fine-tune an LLM to understand better the nuances of a particular language or domain, such as legal documents or insurance claims. This understanding helps the model generate more accurate and fluent translations, leading to better customer satisfaction and potentially even higher revenues.
Another example is a business that generates product descriptions for an e-commerce website. By fine-tuning an LLM to understand the characteristics of different products and their features, the model could generate more informative and compelling descriptions, which could increase sales and customer engagement. Fine-tuning an LLM can help businesses tailor the model's capabilities to their specific needs and improve their performance in various tasks and domains.
What is the process of fine-tuning?
Fine-tuning an LLM generally consists of the following high-level process:
- Identify the task or domain you want to fine-tune the model for, which could be anything from language translation to text summarization to generating product descriptions.
- Gather a targeted dataset relevant to the task or domain you want to fine-tune the model for. This dataset should be large enough to provide the model with sufficient information to learn from but not so large that it takes a long time to train. A few hundred training examples is the minimum recommended amount, with more data increasing model quality. Ensure the data is relevant to your industry and specific use case.
- Use a machine learning framework, library, or a tool like Scale GenAI platform to train the LLM on the smaller dataset. This will involve providing the model with the original data, "input text," and the corresponding desired output, such as summarization or classification of the text into a set of predefined categories, and then allowing the model to learn from this data by adjusting its internal parameters.
- Monitor the model's performance as it trains and make any necessary adjustments to the training process to improve the output. This could involve changing the size of the training dataset, adjusting the model's learning rate, or modifying the model's architecture.
- Once the model has been trained, evaluate its performance on the specific task or domain you fine-tuned it for. This will involve providing the model with input text and comparing its actual output to the desired output.
Reinforcement Learning from Human Feedback (RLHF)
What is RLHF?
Reinforcement learning from human feedback (RLHF) is a methodology to train machine learning models by soliciting feedback from human users. RLHF allows for more efficient learning. Instead of attempting to write a loss function that will result in the model behaving more like a human, RLHF includes humans as active participants in the training process. RLHF results in models that align more closely with human expectations, a typical qualitative measure of model performance.
Models trained with RLHF
Models trained with RLHF, such as InstructGPT and ChatGPT, have the benefit of generally being more helpful and more aligned with a user's goals (source). These models are better at following instructions and tend not to make up facts (hallucinate) as often as models trained with other methods. Additionally, these models perform as well as traditional models but at a substantially smaller size (InstructGPT is 1.3 billion parameters, compared to GPT-3 at 175 billion parameters).
InstructGPT
OpenAI API uses GPT-3 based language models to perform natural language tasks on user prompts, but these models can generate untruthful or toxic outputs. To improve the models' safety and alignment with user intentions, OpenAI developed InstructGPT models using reinforcement learning from human feedback (RLHF). These models are better at following instructions and generate less toxic content. They have been in beta on the API for over a year and are now the default language models on the API. OpenAI believes that fine-tuning language models with human input is a powerful way to align them more closely with human values and make them more reliable (source).
ChatGPT
ChatGPT is a large language model that has been developed specifically for the task of conversational text generation. This model was initially trained with supervised fine-tuning with humans interacting to create a conversational dataset. The model was then fine-tuned with RLHF, with humans ranking model outputs which were then used to improve the model.
One of the key features of ChatGPT is its ability to maintain the context of a conversation and generate relevant responses. As such, it is a valuable tool for applications such as search engines or chatbots, where the ability to generate coherent and appropriate responses is essential. In addition, ChatGPT can be fine-tuned for even more specific applications, allowing it to achieve even better performance on specialized tasks.
Overall, ChatGPT has made large language models more accessible to a wider range of users than previous large language models.
While the high-level steps for fine-tuning are simple, to accurately improve the performance of the model for a specific task, expertise is required.
LLM Prompt Engineering
What is Prompt engineering?
Prompt engineering is the process of carefully designing the input text, or "prompt," that is fed into an LLM. By providing a well-crafted prompt, it is possible to control the model's output and guide it to generate more desirable responses. The ability to control model outputs is useful for various applications, such as generating text, answering questions, or translating sentences. Without prompt engineering, an LLM may generate irrelevant, incoherent, or otherwise undesirable responses. By using prompt engineering, it is possible to ensure that the model generates the desired output and makes the most of its advanced capabilities.
Prompt engineering is a nascent field, but a new career is already emerging, that of the "Prompt Engineer."
What does a prompt engineer do?
A prompt engineer for large language models (LLMs) is responsible for designing and crafting the input text, or "prompts," that are fed into the models. They must have a deep understanding of LLM capabilities and the specific tasks and applications it will be used for. The prompt engineer must be able to identify the desired output and then design prompts that are carefully crafted to guide the model to generate that output. In practice, this may involve using specific words or phrases, providing context or background information, or framing the prompt in a particular way. The prompt engineer must be able to work closely with other team members and adapt to changing requirements, datasets, or models. Prompt engineering is critical in ensuring that LLMs are used effectively and generate the desired output.
How do you prompt an LLM?
Prompt Engineering for an LLM generally consists of the following high-level process:
- Identify the task or application you want to use the LLM for, such as generating text, answering questions, or summarizing reports.
- Determine the specific output you want the LLM to generate, which could be a paragraph of text, a single value for classification, or lines of code.
- Carefully design a prompt to guide the LLM to generate the desired output. Be as specific as possible and provide context or background information to ensure that the language is clear.
- Feed the prompt into the LLM and observe the output it generates.
- If the output is not what you desired, modify the prompt and try again.
- Following these high level can help you get the most out of your model and make it more useful for a variety of applications. To quickly get started with prompt engineering for large language models, try out Scale Spellbook today.
Below we provide an overview of a few popular prompt engineering techniques:
Popular Prompt Engineering Techniques
Ensuring Brand Fidelity
In combination with RLHF and domain-specific fine-tuning, prompt engineering can help ensure that model responses reflect your brand guidelines and company policies. By specifying an identity for your model in a prompt, you can enforce the desired model behavior in various scenarios.
For instance, let's say that you are Acme Corp., a financial services company. A user has landed on your website by accident and is asking for advice on a particular pair of running shoes.
This response is an example of an AI hallucination or the model fabricating results. Though the company does not sell running shoes, it gladly responds with a suggestion. Let's update the default prompt, or system message, to cover this edge case.
Default Prompt: We will specify a default prompt, which is added to every session to define the default behavior of the chatbot. In this example, we will use this default prompt:
"You are AcmeBot, a bot designed to help users with financial services questions. AcmeBot responses should be informative and actionable. AcmeBot's responses should always be positive and engaging. If a user asks for a product or service unrelated to financial services, AcmeBot should apologize and simply inform the user that you are a virtual assistant for Acme Corp, a financial services company and cannot assist with their particular request, but that you would be happy to assist with any financial questions the user has."
With this default prompt in place, the model now behaves as we expect:
Improved Information Parsing
By specifying the desired template for the response, you can steer the model to return data in the format that is required by your application. For example, say you are a financial institution integrating existing backend systems with a natural language interface powered by an LLM. Your backend systems require a specific format to accept any data, which an LLM will not provide out of the box. Let's look at an example:
This response is accurate, but it is missing context that our backend systems need to parse this data properly. Let's specify the template we need to receive an appropriate response. Depending on the application, this template can also be added as part of a default prompt.
Now our data can be parsed by our backend system!
Adversarial or “Red-team” prompting
Chat models are often designed to be deployed in public-facing applications, where it's important they do not produce toxic, harmful, or embarrassing responses, even when users intentionally seek such material. Adversarial prompts are designed to elicit disallowed output, tricking or confusing a chat model into violating the policies its creators intended.
One typical example is prompt injection, otherwise referred to as instruction injection. Models are trained to follow user instructions but are also given a directive by a default prompt to behave in certain ways, such as not revealing details about how the model works or what the default prompt is. However, with clever prompting, the model can be tricked to disregard its programming and follow user instructions that conflict with its training or default prompt.
Below we explore a simple example of an instruction injection, followed by an example using a model that has been properly trained, fine-tuned, and with a default prompt that prevents it from falling prey to these common adversarial techniques:
Adversarial prompt with poor response:
Adversarial prompt with desired response:
Adversarial prompt engineering is an entire topic unto itself, including other techniques such as role-playing and fictionalization, unusual text formats and obfuscated tasks, prompt echoing, and dialog injection. We have only scratched the surface of prompt engineering here, but there are a wide array of different techniques to control model responses. Prompt engineering is evolving quickly, and experienced practitioners have spent much time developing an intuition for optimizing prompts for a desired model output. Additionally, each model is slightly different and responds to the same prompts with slightly different behaviors, so learning these differences adds another layer of complexity. The best way to get familiar with prompt engineering is to get hands on and start prompting models.
Conclusion
As we have seen, LLMs are versatile tools that can be applied to a wide variety of use cases. These models have already had a transformative impact on the business landscape, with billions of dollars being spent in 2023 alone. Nearly every industry is working on adopting these models into their specific use cases, from insurance companies looking to optimize claims processing, wealth managers looking for unique insights across a large number of portfolios to help them improve returns, or eCommerce companies looking to make it easier to purchase their products.
To optimize investments in LLMs, it is critical that businesses understand how to properly implement them. Using base foundation models out of the box is not sufficient for specific use cases. These models need to be fine-tuned on proprietary data, improved with human feedback, and prompted properly to ensure that their outputs are reliable and accomplish the task at hand.
At Scale we help companies realize the value of large language models, including those that are building large language models and those that are looking to adopt them to make their businesses better. Our GenAI Platform provides comprehensive testing and evaluation and data solutions to unlock the full value of AI.