Diffusion models have the power to generate any image that you can imagine. This is the guide you need to ensure you can use them to your advantage whether you are a creative artist, software developer, or business executive.
With the Release of Dall-E 2, Google’s Imagen, Stable Diffusion, and Midjourney, diffusion models have taken the world by storm, inspiring creativity and pushing the boundaries of machine learning.
These models can generate a near-infinite variety of images from text prompts, including the photo-realistic, the fantastical, the futuristic, and of course the adorable.
These capabilities redefine what it means for humanity to interact with silicon, giving us superpowers to generate almost any image that we can imagine. Even with their advanced capabilities, diffusion models do have limitations which we will cover later in the guide. But as these models are continuously improved or the next generative paradigm takes over, they will enable humanity to create images, videos, and other immersive experiences with simply a thought.
In this guide, we explore diffusion models, how they work, their practical applications, and what the future may have in store.
Generative models are a class of machine learning models that can generate new data based on training data. Other generative models include Generative adversarial networks (GANs), Variational Autoencoders (VAEs), and Flow-based models. Each can produce high-quality images, but they all have limitations that make them inferior to diffusion models.
At a high level, Diffusion models work by destroying training data by adding noise and then learn to recover the data by reversing this noising process. In Other words, Diffusion models can generate coherent images from noise.
Diffusion models train by adding noise to images, which the model then learns how to remove. The model then applies this denoising process to random seeds to generate realistic images.
Combined with text-to-image guidance, these models can be used to create a near-infinite variety of images from text alone by conditioning the image generation process. Inputs from embeddings like CLIP can guide the seeds to provide powerful text-to-image capabilities.
Diffusion models can complete various tasks, including image generation, image denoising, inpainting, outpainting, and bit diffusion.
Popular diffusion models include Open AI’s Dall-E 2, Google’s Imagen, and Stability AI's Stable Diffusion.
Simply put, Diffusion models are generative tools that enable users to create almost any image they can imagine.
Diffusion models represent that zenith of generative capabilities today. However, these models stand on the shoulders of giants, owing their success to over a decade of advancements in machine learning techniques, the widespread availability of massive amounts of image data, and improved hardware.
For some context, below is a brief outline of significant machine learning developments.
What about diffusion models makes them so strikingly different from their predecessors? The most apparent answer is their ability to generate highly realistic imagery and match the distribution of real images better than GANs. Also, diffusion models are more stable than GANs, which are subject to mode collapse, where they only represent a few modes of the true distribution of data after training. This mode collapse means that in the extreme case, only a single image would be returned for any prompt, though the issue is not quite as extreme in practice. Diffusion models avoid the problem as the diffusion process smooths out the distribution, resulting in diffusion models having more diversity in imagery than GANs.
Diffusion models also can be conditioned on a wide variety of inputs, such as text for text-to-image generation, bounding boxes for layout-to-image generation, masked images for inpainting, and lower-resolution images for super-resolution.
The applications for diffusion models are vast, and the practical uses of these models are still evolving. These models will greatly impact Retail and eCommerce, Entertainment, Social Media, AR/VR, Marketing, and more.
Web applications such Open AI’s Dall-E 2 and Stable Diffusion’s DreamStudio make diffusion models readily available. These tools provide a quick and easy way for beginners to start with diffusion models, allowing you to generate images with prompts and perform inpainting and outpainting. DreamStudio offers more control over the output parameters, while Dall-E 2’s interface is simpler with fewer frills. Each platform provides free credits to new users, but will charge a usage fee once those credits are depleted.
You can also search through a wide array of curated images on aggregation sites like Lexica.art, which provides an even easier way to get started and get inspired by what the broader community has been creating and learn how to build better prompts
Prompts are how you can control the outputs for Diffusion models. Diffusion models are verbose and take two primary inputs and translate these into a fixed point in its model’s latent space, a seed integer, and a text prompt. The seed integer is generally automatically generated, and the user provides the text prompt. Continuous experimentation via Prompt engineering is critical to getting the perfect outcomes. We explored Dall-E 2 and Stable Diffusion and have consolidated our best tips and tricks to getting the most out of your prompts, including prompt length, artistic style, and key terms to help you sculpt the images you want to generate.
In general, there are three main components to a prompt:
Frame + Subject + Style + an optional Seed.
1. Frame - The frame of an image is the type of image to be generated. This is combined with the Style later in the prompt to provide an overall look and feel of the image. Examples of Frames include photograph, digital illustration, oil painting, pencil drawing, one-line drawing, and matte painting.
2. Subject - The main subject for generated images can be anything you can dream up.
3. Style - The style of an image has several facets, key ones being the lighting, the theme, the art influence, or the period.
4. Seed
To summarize a good way to structure your prompts is to include the elements of “[frame] [main subject] [style type] [modifiers]” or “A [frame type] of a [main subject], [style example]” And an optional seed. The order of these exact phrases may alter your outcome, so if you are looking for a particular result it is best to experiment with all of these values until you are satisfied with the result.
4. Prompt Length
Generally, prompts should be just as verbose as you need them to be to get the desired result. It is best to start with a simple prompt to experiment with the results returned and then refine your prompts, extending the length as needed.
However, many fine-tuned prompts already exist that should be reused or modified.
Modifiers such as "ultra-realistic," "octane render," and "unreal engine" tend to help refine the quality of images, as you can see in some of the examples below.
5. Additional Tips
A few additional items are worth mentioning.
Placing the primary subject of the image closer to the beginning of the prompt tends to ensure that subject is included in the image. For instance, compare the two prompts
There are combinations of subject and location that tend to yield poor results. For instance, "A black Velvet Couch on the surface of the moon" yields uneven results, with different backgrounds and missing couches entirely. However, a similar prompt, "A black velvet couch in a desert" tends to reflect the intent of the prompt, capturing the velvet material, the black color, and the characteristics of the scene more accurately. Presumably, there are more desert images contained in the training data, making the model better at creating coherent scenes for deserts than the moon.
Prompt engineering is an ever-evolving topic, with new tips and tricks being uncovered daily. As more businesses discover the power of diffusion models to help solve their problems, it is likely that a new type of career, "Prompt Engineer" will emerge.
As powerful as they are, Diffusion models do have limitations, some of which we will explore here. Disclaimer: given the rapid pace of development, these limitations are noted as of October 2022.
Diffusion models' flexibility gives them more capabilities than just pure image generation.
As you can see, inpainting is quite powerful for editing images quickly and generating new scenes dynamically.
In the future, expect the capabilities of these tools to be even more efficient, without requiring the user to edit a mask. By simply describing the desired edits, i.e., “Replace the background with a futuristic cityscape," the image will automatically be edited, with no mouse clicks or keystrokes needed.
2. Outpainting
Outpainting enables users to extend their creativity by continuing an image beyond its original borders - adding visual elements in the same style, or taking a story in new directions simply by using natural language description.
Starting with a real-world image or a generated image, you can extend that image beyond the original borders until you have a larger, coherent scene.
Outpainting requires an extra layer of prompt refinement in order to generate coherent scenes, but enables you to quickly create large images that would take significantly longer to create with traditional methods.
Outpainting enables impressive amounts of creativity and the ability to build large-scale scenes that remain coherent in theme and style. Similar to inpainting, there is room to improve this capability by making it even simpler to generate the desired scenes by providing a single prompt and getting the exact image you are looking for.
3. Diffusion for Video Generation
As we have seen, generating static images is exciting and offers many practical applications. Several recently announced models take the capabilities of diffusion models and extend them to create videos. These capabilities have not yet made it to the hands of a broader audience but will be coming very soon.
4. Diffusion Model Image Curation Sites
Curation sites like Lexica.art provide highly curated collections of generated images and the prompts that were used to create them. Millions of images are indexed, so there is a good chance that the image that you originally thought you would have to generate already exists and is just a quick search away. Searches are low latency, you don’t need to wait for one to two minutes for a diffusion model to generate images, and you get the images nearly instantly. This is great for experimenting or searching for types of images or exploring prompts. Lexica is also a great way to learn how to prompt to get the results you are looking for (include a couple of examples here)
The obvious application for diffusion models is to be integrated into design tools to empower artists to be even more creative and efficient. In fact, the first wave of these tools has already been announced, including Microsoft Designer which integrates Dall-E 2 into its tooling. There are significant opportunities in the Retail and eCommerce space, with generative designs for products, fully generated catalogs, alternate angle generation, and much more.
Product design will be empowered with powerful new design tools, that will enhance their creativity and provide the capability to see what products look like in the context of homes, offices, and other scenes. With advancements in 3D diffusion, full 3D renders of products can be created with a prompt. Taking this to the extreme, these 3D renders can then be printed as a 3D model and come to life in the real world.
Marketing will be transformed, as ad creative can be dynamically generated, providing massive efficiency gains, and the ability to test different creatives will increase the effectiveness of ads.
The entertainment industry will begin incorporating diffusion models into special effects tooling, which will enable faster and more cost-effective productions. This will lead to more creative and wild entertainment concepts that are limited today due to the high costs of production. Similarly, Augmented and Virtual Reality experiences will be improved with the near-real-time content generation capabilities of the models. Users will be able to alter their world at will, with just the sound of their voice.
A new generation of tooling is being developed around these models, which will unlock a wide range of capabilities.
The vast capabilities of Diffusion models are inspiring, and we don't yet know the true extent of their limitations.
Foundation models are bound to expand their capabilities over time, and progress is accelerating rapidly. As these models improve, the way humanity interacts with machines will change fundamentally. As Roon stated in his blog Text is the Universal Interface, "soon, prompting may not look like "engineering" at all but a simple dialogue with the machine."
The opportunities for advancing our society, art, and business are plentiful, but technology needs to be embraced quickly to see these benefits. Businesses need to take advantage of this new functionality or risk falling dramatically behind. We look forward to a future where humans are a prompt away from creating anything we can imagine, unlocking unlimited productivity and creativity. The best time to get started on this journey is now, and we hope that this guide serves as a stong foundation for that journey.