Enhancing LLMs with Retrieval Augmented Generation

by Clemens Viernickel andBryan Zhu on October 24, 2023

Retrieval Augmented Generation (RAG) has captured the attention of the Generative AI community because it significantly extends the available use cases and applications of Generative AI models. RAG means to retrieve and then inject external information into the prompt of an LLM call so that the model can generate outputs with specific context from unique data sources. In this blog post, we're going to discuss why RAG is important, how it works, and how to improve its usefulness for enterprise use cases.

What Is RAG and Why Does It Matter?

LLMs are trained on many billions of tokens, providing extensive knowledge and powerful reasoning capabilities. In many cases, Enterprises may then opt to fine-tune an LLM with highly specialized data to further enhance the model's knowledge and capabilities for a specific use case or domain. However, after the training and fine-tuning are complete, the “knowledge” of the model, i.e., the data it can use to generate responses, is fixed. Hence, when asking the model a question about data or documents it has not seen before, it cannot answer, such as in this example conversation:

In practice, it is not possible to continuously retrain the model on all the latest data before calling it. We want the model to “apply” its knowledge to new data. A good analogy would be teaching a law school student how to assess whether a transaction document has a tax deed. There is specific tax law course content that the student learns during the curriculum. However, after graduating, the student will encounter entirely new cases in practice. The student needs to apply her learned skills to assess tax deeds. But if she wants to know how many of the law firm's transaction documents had a tax deed, she needs to have access to all the relevant documents and information of this specific case and go through them. The course content alone will not suffice, as it would be impossible to include all the cases in the university course.

The easiest solution to this problem would be to include all the required new information in the prompt when calling the model. In the image above, this would mean taking the text of all the transaction documents, feeding it into the model, and asking, "How many of these mention a tax deed?". However, there is a problem: the model has a limited "context window." This means it can only take a maximum size of input words for its prompt due to computational constraints. While the latest models like GPT-4 can take up to 32,000 tokens (about 50 pages of text), this is still not enough to answer questions about large volumes of documents or data, not even enough for the sample question above.

The solution involves introducing an additional step before calling the LLM: retrieving the most relevant information from a so-called "knowledge base" and then adding this retrieved data to the prompt.

RAG vs. Fine-Tuning

Based on RAG's success in providing many practical LLM applications, a debate has emerged about when to use RAG and Fine-Tuning. This is a false dichotomy because these two techniques are complementary. Fine-tuning is for changing and adjusting the model's behavior, i.e., "teaching" the model new skills like "writing patent litigation claims." RAG is about providing additional context to the model at the time it is called. The two techniques are often needed in concert. For example, a tax lawyer needs both specialized training (fine-tuning) AND access to the relevant case documents (RAG) in order to assess their content.

Fine-tuning involves changing the actual weights of a pre-trained model, using highly domain- and task-specific data. There are multiple techniques, including instruction fine-tuning (prompt-response pairs) and RLHF. The most important part of fine-tuning is developing a high-quality dataset in the necessary domain and with data relevant to the use cases for which the model will be used.

RAG on the other hand is inserting data into the prompt (context window) of the model during inference. The steps here involve chunking, embedding, vector stores, reranking, and more all of which we will dive deeper into now.

How Does A RAG System Work?

A RAG implementation typically includes three primary components: pre-processing, retrieval, and reasoning, each of which have further sub-components. Pre-processing takes the raw data that should be used by the LLM and transforms it into a format that can be used for retrieval at inference time, or when the published model is asked to produce an output such as an answer to a question. This process involves data connectors, chunking, chunk-processing, metadata extraction, embedding generation, and storing embeddings in a vector store. Retrieval is the core process of searching a vector store for the embeddings most similar to a user query and then re-ranking the results for relevance. Finally, reasoning is the actual call to the language model combining the original prompt and the retrieved context, generating the answer that the user will eventually read.

Let’s look at each of the components in a bit more detail.


The first step in RAG is to transform the raw data that should be used for retrieval into a format so that it can be effectively searched based on the user prompt. For example, the relevant database might be a collection of 10,000 legal transaction documents. The output of the pre-processing step is often called a "knowledge base." This is a combination of a vector store with searchable embeddings and a set of metadata that is associated with these embeddings. 

After reading in the raw text (which for PDF documents can require additional OCR or language models), the first step is to split or "chunk" the raw text into smaller pieces. The most basic way is just by a fixed number of tokens. The ideal size of chunks is determined by the individual use case and input data and is to be iteratively optimized based on the system accuracy (see below). 

Next, it is often required to "sanitize" the chunks to improve their semantic meaning. For example, in legal contract documents, we want to ensure that the chunks always start and finish with clauses or sub-sections. When data is extracted from tables, HTML documents, or similar, sanitation might involve removing tags or other unnecessary artifacts.

An important but often overlooked next step is to extract metadata from the chunks and overall input documents. This could involve steps like generating summaries of each chunk, extracting references, headings, document metadata (like timestamps, etc.), and more. The metadata depends on the specific use case.

The next step is to generate embeddings for the chunks and store them in a vector store. The embedding model and vector used vary widely and depend on the relevant business requirements. For example, some embedding models or vector stores can be hosted locally, while others are hosted externally. The embedding model can also be fine-tuned for specific types of text, leading to higher overall retrieval performance.

Lastly, we store the metadata in a relevant metadata store. This can either be directly with the chunks in the vector store (if it allows metadata storage) or in a separate relational database.

When accessing data from external applications and sources, the pre-processing step is often automated with so-called data connectors. A data connector would first authenticate to a source of data like Google Drive and then asynchronously execute the pre-processing steps on a regular basis to ensure the knowledge base stays in sync with the data stored in the external resource.

The following is an example of how to create a knowledge base using Scale’s EGP APIs and link it to a Google Drive data connector:

# create a new knowledge base
url = ""
payload = {
    "embedding_config": { "embedding_model": "sentence-transformers/all-MiniLM-L12-v2" },
    "knowledge_base_name": "example_knowledge_base"
response =, json=payload, headers=headers)
knowledge_base_id = response.json()["data"]["knowledge_base_id"]
# upload documents to the knowledge base using a Google Drive data connector
url = f"{knowledge_base_id}/uploads"
payload = {
    "data_source_config": {
        "source": "GoogleDrive",
        "drive_id": "DRIVE_ID_GOES_HERE"
    "chunking_strategy_config": {
        "strategy": "character",
        "separator": "\n\n",
        "chunk_size": 1000,
        "chunk_overlap": 200
response =, json=payload, headers=headers)


The core retrieval process has three elements: translating the user query into embeddings (question encoding), similarity search, and reranking.

To search the newly stored embeddings in the vector store, we need to translate the question the user has asked in natural language (e.g., “How many litigations did we have in March?”) into embedding space as well. This is usually done using either the same model used to embed the chunks during the pre-processing step, or another model trained alongside that model.

Next, we can query the vector store by running a similarity search between the question embeddings and all the embeddings stored in the vector store. The result is a list of chunk ids that are ordered by similarity score from most to least relevant. While a k nearest neighbor (kNN) search is the most common technique, some vector stores offer more advanced search options for higher relevance results. 

We also have to specify how many results (chunks) we want to get from the vector store. A higher number of results makes it more likely that the correct piece of information is contained in the results, but the LLM context window is limited, which puts a constraint on the number of chunks we can return. Additionally, dealing with a very large number of retrieved chunks can increase the time the full generation pipeline takes due to various downstream pieces such as reranking, filtering, and more. Furthermore, more results make it more difficult to rank the chunks in a way that the most relevant is up top.

Often, we would like to use a stronger reranking model than kNN vector search, such as a cross-encoder, but it would be too expensive to run on the thousands or millions of chunks present in the vector store. In this case, we employ a two-stage reranking process: first, we narrow down the chunks to the top K (e.g. K=500) using this vector store retrieval, and then we rerank only the retrieved chunks with the second-stage reranking model. This allows us to search the whole vector store while also getting the benefits of the reranker.

In our experience, the reranking step is a crucial element in the retrieval process which ensures a high-quality LLM response. Hence, more and more companies are giving special attention to reranking with techniques like maximum marginal relevance (MMR) gaining more attention. Cohere recently even published a model specifically for reranking. At Scale, we have also observed that fine-tuning the reranking model on the specific dataset that is used for retrieval can also dramatically improve results.

In many RAG solutions, the entire retrieval step is abstracted in a single API call to an existing knowledge base, with lower-level APIs available to further customize the individual sub-steps and configurations.

url = f"{knowledge_base_id}/query"
payload = {
    "include_embeddings": True,
    "query": "Among all our transaction documents this month, how many mention a tax deed?",
    "top_k": 10
response =, json=payload, headers=headers)

Another important step that is often performed during retrieval is metadata filtering. This includes matching certain names, keywords, dates, etc. in the user query and using the previously extracted metadata per chunk in order to filter out irrelevant pieces of information. For example, if the user asked for contracts for the month of March, we can filter out all the chunks that are from documents of a different month, no matter their semantic relevance. Performing the filtering on the results of the vector store search is called post-filtering. Another approach is to do pre-filtering, which would mean applying the filter of metadata already during or before the similarity search. This can improve the speed of the similarity search for large knowledge bases; however, not all vector stores support this technique.

At Scale, we observe that metadata filtering, in connection with high-quality metadata extraction during pre-processing, is typically one of the most critical operations that drive retrieval accuracy. It is particularly critical in cases where the RAG system has access to many documents that are very semantically similar, like 1000s of transaction documents in a law firm.


The last step is to fetch the relevant chunks based on the IDs returned by the vector store and compose the final prompt for the language model based on the initial use question and the retrieved content. One of the challenges here is often to ensure that the language model is answering the user question only or mostly based on the information that was found during retrieval. This includes making sure that the model answers “I don’t know” if the relevant information is not found instead of making up a response. For example, when asking a question about a specific litigation case, but the document is not in the knowledge base, we want to ensure that the LLM answers “I don’t have information about this case” instead of making up a response based on similar documents. One way to achieve this is accurate prompt engineering, outlining strict rules for the LLM when answering questions based on retrieved context. However, this is often not enough, especially for smaller open source models that have been trained on significantly less data compared to models like GPT-4.

An effective way to alleviate this problem is to perform RAG fine-tuning. RAG fine-tuning is a specialized case of instruction fine-tuning, where the model is trained on prompt-response pairs with an added bit of context or data for each pair. This conditions the model to “expect” retrieved context when answering questions and makes it much less likely the model will hallucinate when used in a RAG system.

For example, at Scale we have fine-tuned an MPT-7B model for RAG using this technique and have observed dramatic improvements on hallucination and “helpful answers”.

In many RAG systems, the reasoning step is followed by a process of generating references or citations for the model response. This means that one or multiple sentences in the model response include a reference to the exact chunk that was used to generate this response. There are multiple techniques to generate these citations like similarity matching or even having another LLM generate these matches.

To put everything together, let’s go back to our earlier example. When asking the retrieval augmented LLM the question “How many of our transaction documents last month included a tax deed, we get a very cohesive answer, including citations:

Behind the scenes, between prompt and response we can follow all the previously discussed steps: ingestion, embedding, metadata storage, retrieval, reranking and LLM reasoning, working in concert to generate the final answer.

How to Measure Accuracy?

One of the most important questions for RAG systems used in any business context is how accurate they are. However, evaluating RAG accuracy is both novel and multi-dimensional, which is why it is often harder than expected. 

At most companies, the chosen path of evaluation is to manually check (using human judgment)  a set of questions for which the ground truth answer is known for correctness. In addition, it is often important to evaluate additional aspects, for example, if the retrieved context was relevant for the answer. At Scale, we are able to evaluate even very large sets of test questions using evaluation APIs and our network of human experts. However, in order to enable rapid experimentation, it is also advisable to use an automated way of evaluating RAG accuracy in addition to human evaluation.

In this blog, we want to look at two techniques Scale is using for internal and external RAG benchmarking: Single Word Benchmark (SWB) and Span Evaluation Benchmark (SEB).

Single Word Benchmark

Here we are using the documents in the knowledge base to formulate a set of questions which have definite answers of one or a maximum of two words. For example, “Who is the author of this document?” or “What was the EBITDA in 2022?”. We then record the ground truth answers for these short questions. During evaluation, we can now use exact or fuzzy matching to compare the response (or use another LLM to compare) of the RAG system for the evaluation questions with the ground truth answers recorded. This technique evaluates the end-to-end performance of the RAG system, which is beneficial to test the effectiveness of all the components operating together. However, the evaluation does not give any insight into which parts of the RAG system (e.g., pre-processing, retrieval, reasoning) is at fault for wrong answers. Hence, it is often important to pair such end to end evaluation with a retrieval-only evaluation benchmark.

Span Evaluation Benchmark

For this benchmark, we also use the set of documents in the database to formulate test questions. These questions can be open-ended and do not need to have short answers. Importantly, when formulating the question, we mark the exact span of words (e.g., page 10, word 40-120), which contains the information required to answer the question. This could be one or multiple spans. We call these spans “ground truth context”. We do not care about the ground truth answer for the question at this point. Now, during evaluation time, we are running the RAG system on the collected test questions and are obtaining a set of reranked chunks after the retrieval step. Instead of feeding these into the LLM, we are now programmatically matching these chunks with the ground truth spans we have collected. This allows us to automatically evaluate the accuracy of the pre-processing and retrieval steps with high precision. Because the LLM generation is not relevant in this evaluation it helps to isolate a problem within a RAG system.

How can we improve accuracy?

Now that we have built the RAG system and an evaluation method, the natural next question is centered around “how can we improve the accuracy of our RAG system”? Based on what we outlined above, at Scale we typically see three core levels for improving retrieval performance, listed below with practical suggestions.

Chunking and Embedding

  • Vary the size of chunks based on the document type and content
  • Creating custom chunking logic based on the known document format (e.g. joining tables across pages, separating chunks based on section headers)
  • Perform more advanced metadata extraction and chunk sanitization during pre-processing
  • Use higher-performance and/or fine-tuned embedding models
  • Use higher performance vector stores

Reranking and Filtering

  • Implement more advanced reranking algorithms or even fine-tune a cross-encoder to increase the relevance of the top retrieved data
  • Filter the retrieved chunks based on metadata like dates, named entities and cross-references

RAG Fine-Tuning

  • Fine-tune an LLM to always expect a certain type of context when answering questions
  • This is most important when using smaller models and when the context is highly domain-specific

What’s next?

Retrieval Augmented Generation is a fast-moving field, and this overview does not fully capture what is possible with RAG. At Scale, we are helping leading enterprises customize large language models with fine-tuning and RAG using our Enterprise Generative AI Platform to help them unlock their most important use cases. We are building highly effective RAG systems leveraging our comprehensive set of  EGP APIs, some of which were shown in this article. To learn more about using these APIs for your project, book a demo below.

In a future series, we will dive deeper into

  • Structured data retrieval, how to combine document retrieval with information from structured sources like relational databases.
  • RAG Evaluation APIs, how to leverage our easy-to-use experiment framework to give you peace of mind for the accuracy of your RAG system.
  • Agents, how to build more advanced generative AI solutions using RAG by chaining retrieval with other APIs and apps using agents.

Scale discs

The future of your industry starts here.