To properly adopt Generative AI for the enterprise, you first need a solid understanding of the Generative AI stack. At the base are foundation models, such as OpenAI's GPT-4, Google's PaLM 2, Cohere's Command model, or Anthropic's Claude in the case of LLMs or Stability AI's Stable Diffusion for image generation models. These models provide the base or foundational capabilities for Generative AI applications. Next is the data engine that provides the data customization and fine-tuning required to enable the base model to use proprietary enterprise data properly. Then a development platform is needed to build LLM apps, compare prompts and model variants, and deploy applications to production.

Typically, enterprise-grade deployments of Generative AI involve some degree of internal development of your own applications. Companies typically do this so they can customize and fine-tune models to optimize performance on their specific use cases, improve security and safety, and ensure observability and reliability.

Customize and fine-tune models and apps for peak performance

Enterprises have unique needs that require extensive fine-tuning and prompt engineering of base foundation models. Open-source and commercial models are great generalists, but for enterprise use cases, they are poor specialists - especially ones that require "knowledge" of domain- or company-specific data. Base models are trained on publicly available internet data, not on a law firm's private documents, a wealth manager's research reports, or any company's internal databases. This specific data and context is the key to helping a model go from generic responses to actionable insights for specific use cases.

Small fine-tuned models are cheaper, faster, and perform better at specific tasks than base foundation models. For example, Google's Med-PaLM 2 is a language model fine-tuned on a curated corpus of medical information and Q&As. Med-PaLM2 is 10 times smaller than GPT-4 but actually performs better on medical exams.

Source: Towards Expert-Level Medical Question Answering with Large Language Models, https://arxiv.org/abs/2305.09617

Another example is Vicuna-13B, a chatbot trained by fine-tuning Meta’s open-source LLaMA model. Vicuna is a 13B parameter model fine-tuned on approximately 70K shared user conversations. Vicuna is more than 13 times smaller than ChatGPT and provides the same response quality as ChatGPT in over 90% of cases.

GOAT is another fine-tuned LLaMA model that outperforms GPT-4 on arithmetic tasks, achieving state-of-the-art performance on the BIG-bench arithmetic sub-task.

Every organization has business-critical tasks that rely on proprietary data and processes and will benefit from a fine-tuned vs. base foundation model. While ChatGPT can provide general tips for a bank's customer support representative, a model fine-tuned on the transcripts of actual calls from a bank's customers can guide reps on specific actions for callers' concerns while following company policies - like the fastest path to resolve a billing dispute or the bank's best checking account option for a given customer segment.

In addition, enterprises may want greater control and flexibility over which models they use and when. Being able to compare multiple models can improve both performance and cost, rather than being locked into using one model or provider for the use case (or many use cases across a business).

Improve security and safety

Off-the-shelf applications typically require data to pass through the app provider's cloud. A custom-built application can remain in an enterprise's virtual private cloud - or even on-premises - for cases where data security is critical.

With a purpose-built application, data stays within the existing environment, so access control of any deployed LLM app can mirror existing role-based access controls.

Ensure observability and reliability

Without rigorous evaluation and monitoring capabilities, generative models are prone to hallucinations and can provide false, harmful, or unsafe results. Companies face significant risks to their brand, especially when deployed in customer-facing settings or when handling sensitive information.

By customizing an enterprise app, your teams can define how they measure the performance of your applications and set up appropriate monitoring processes.

In addition to monitoring traffic and latency, you must consider operational priorities when setting up these monitoring processes. For example, suppose a financial firm has created an insider trading detection app powered by Generative AI. The security and compliance team will need to be immediately alerted about the detection of insider trading and any misclassification. The only way to achieve this is by directly embedding real-time monitoring and logging of prompts and model responses into the custom app. The security and compliance team can then take action from these alerts to prevent further damage.

As we mentioned, organizations need to consider each layer of the stack from Generative AI applications, a robust development platform, a data engine for customizing and fine-tuning models on proprietary data, and base foundation models.

Below we lay out the key considerations for your build vs. buy decision for each layer of the stack.

Applications

Description: These interfaces allow customers or employees to interact with Generative AI models, such as a chat agent to ask questions or a copilot that makes suggestions based on what the user is working on.

Build: Building your own application is best when performance is dependent on access to proprietary data, and when data, model inputs, and/or outputs are highly sensitive.

Buy: Buying these apps is most appropriate for a fast start on less sensitive or generic use cases where proprietary data is not required.

Development Platform

Description: Companies seeking to build their own apps need the tooling to experiment with, develop, and deploy these apps. This tooling helps teams to compare generative models, fine-tune them, play with prompts, and then deploy apps to production.

Build: Typically, building an in-house development platform is limited to those seeking to sell it to customers, such as Google Vertex. Some companies build their own internal platforms strictly for their own use, but these platforms are difficult to maintain and costly to keep up to date, particularly in a fast-moving field.

Buy: Buying a development platform frees your resources to focus on core competencies and building valuable applications for your business instead of standing up another piece of infrastructure that needs to be maintained. Many open-source and commercial solutions on the market today are robust, reliable, and cost-effective for building and deploying Generative AI applications, including Scale Spellbook.

Data Engine

Description: A data engine helps teams to collect, curate, and annotate data, so their Generative AI models can produce high-quality outputs using this high-quality data. This typically includes human experts to validate the data and the tooling to help them do so efficiently and effectively.

Build: A substantial investment is required to build in-house tooling to fine-tune models and assemble and train a workforce of human experts to produce and rank data at a very high quality.

Building a data engine is only appropriate for extremely sensitive use cases with strict data privacy requirements.

Buy: To accelerate deployment time, most enterprises should consider buying to access state-of-the-art tooling immediately and leverage vetted human experts to improve their model performance.

There are also options to buy even for the most sensitive use cases, where the data remains within an organization. External partners can be given access to VPCs with the appropriate role-based access controls to perform the annotation, customization, and fine-tuning work needed to improve model performance.

Base Foundation Models

Description: At the core of any Generative AI application is one or more "base" models such as OpenAI's GPT-4, Anthropic's Claude, Cohere's Command model, or open-source models like T5-FLAN, StableLM, or BLOOM.

Build: Companies may choose to train their own base model (for example, BloombergGPT) when the performance of existing models - even with fine-tuning - is insufficient to meet their needs.

Buy: Commercial providers like OpenAI, Anthropic, and Cohere provide API access to their pre-trained models, typically charging based on usage.

There are tradeoffs to both commercial and open-source models, so it is essential to carefully consider your use case and the models available before investing. For example, some commercial models, such as GPT-4, do not offer fine-tuning. In contrast, open-source models provide this capability and more flexibility but also require the company to host the model themselves.