Diffusion models and the diffusers
library from 🤗¶
Diffusion models are at the forefront of the research and the applications of Gen-AI in Computer Vision. Many research papers and innovations are published every day in the field. Most of the open-source models and results are released and implemented in the 🤗 (Hugging Face) libraries: transformers for LLMs and diffusers for Diffusion Models. Both of them are high-level libraries built on top of PyTorch.
In this exercise we are going to have fun with some of the amazing capabilities of the diffusers
library along with the 🤗 ecosystem of Models, an online repository where open-source models are hosted and are available for use free of charge.
Please note: always check the license of the models before using them in a professional setting because some restrict commercial use.
It is worth noting here that the 🤗 also includes freely-available Datasets and Spaces, where people can publish demos of interesting applications of Gen-AI and more.
IMPORTANT: when using this notebook within the Udacity Workspace, you need to restart the notebook when requested because you will run out of GPU memory otherwise. Be on the lookout for the RESTART NOW message
NOTE: because we are using a lot of different models, the first time you run them you will see a lot of messages from the diffusers library alerting you that it is downloading files from the internet. Those are expected, and they do NOT constitute an error. Just continue on.
Let's start by importing the elements we are going to use:
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
import torch
Now we're going to see a few applications in the field of image generation and editing, as well as video generation.
Unconditional generation¶
In this type of image generation, the generator is free to do whatever it wants, i.e., it is in "slot machine" mode: we pull the lever and the model will generate something related to the training set it has been trained on. The only control we have here is the random seed.
You can see all the available unconditional diffusion models compatible with the diffusers
library here. For example, at the moment the first few results look like this:
You can substitute the model_name
value in the cell below with any of these model names, like for example google/ddpm-cifar10-32
or WiNE-iNEFF/Minecraft-Skin-Diffusion-V3
:
rand_gen = torch.manual_seed(12418351)
model_name = 'google/ddpm-celebahq-256'
model = DiffusionPipeline.from_pretrained(model_name).to("cuda")
image = model(generator=rand_gen).images[0]
image
While most of them can be used this way, some require some special handling. In that case, the code needed to use them is typically reported in the Model Card that can be accessed by simply clicking on the name of the models in this list.
Text-to-image¶
This is a class of conditional image generation models. The conditioning happens through text: we provide a text prompt, and the model creates an image following that description.
You can find a list of available text-to-image models here.
For example, here we use the Stable Diffusion XL Turbo (a new version of Stable Diffusion XL optimized for super-fast inference):
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/sdxl-turbo",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
prompt = "A photo of a wild horse jumping in the desert dramatic sky intricate details National Geographic 8k high details"
rand_gen = torch.manual_seed(423122981)
image = pipe(
prompt=prompt,
num_inference_steps=1, # For this model you can use 1, but for normal Stable Diffusion you should use 25 or 50
guidance_scale=1.0, # For this model 1 is fine, for normal Stable Diffusion you should use 6 or 7, or up to 10 or so
negative_prompt=["overexposed", "underexposed"],
generator=rand_gen
).images[0]
image
RESTART NOW: please restart the notebook, then start running from the next cell and continue on
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
import torch
This is another example, where we use a nice model by Playground AI that generates artistic images instead of photorealistic ones:
pipe = AutoPipelineForText2Image.from_pretrained(
"playgroundai/playground-v2-1024px-aesthetic",
torch_dtype=torch.float16,
use_safetensors=True,
add_watermarker=False,
variant="fp16"
).to("cuda")
prompt = "A scifi astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
rand_gen = torch.manual_seed(42312981)
image = pipe(prompt=prompt, guidance_scale=3.0, generator=rand_gen).images[0]
image
Image-to-image¶
In the image-to-image task we condition the production of the Diffusion Model through an input image. There are many ways of doing this. Here we look at transforming a barebone sketch of a scene in a beautiful, highly-detailed representation of the same.
Let's start by creating a sketch. We could create that manually, but since we're here, let's use SDXL-Turbo instead.
NOTE: it is important for the sketch not to be too detailed and complicated. Flat colors typically work best, although this is not an absolute rule.
PLEASE RESTART NOW to free GPU memory, then continue on from the next cell
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
import torch
prompt = "A tree and a house, made by a child with 3 colors"
rand_gen = torch.manual_seed(423121)
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/sdxl-turbo",
torch_dtype=torch.float16,
variant="fp16"
).to("cuda")
image = pipe(
prompt=prompt,
num_inference_steps=2,
guidance_scale=2,
generator=rand_gen
).images[0]
image
image.save("sketch.png")
PLEASE RESTART NOW to free GPU memory, then continue on from the next cell
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
import PIL
import torch
image = PIL.Image.open("sketch.png")
Now we can use the Kandinsky model to generate an image that respects the subjects and their positions in our sketch:
from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
import numpy as np
original_image = image.copy().resize((768, 768))
prompt = "A photograph of a house in the fall, high details, broad daylight"
negative_prompt = "low quality, bad quality"
rand_gen = torch.manual_seed(67806801)
image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, generator=rand_gen).to_tuple()
new_image = pipeline(
image=original_image,
image_embeds=image_embeds,
negative_image_embeds=negative_image_embeds,
height=768,
width=768,
strength=0.35,
generator=rand_gen
).images[0]
fig = make_image_grid([original_image.resize((512, 512)), new_image.resize((512, 512))], rows=1, cols=2)
fig
Inpainting¶
Diffusion models can also be used to do inpainting, which means filling regions of an image according to a prompt (or just according to the sourroundings of the hole to fill).
Typically, we start from an image and a mask. The mask indicates the pixels to be inpainted, i.e., removed from the original image and filled with new content generated by the model.
PLEASE RESTART NOW to free GPU memory, then continue on from the next cell
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
import torch
import torch
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid
pipeline = AutoPipelineForInpainting.from_pretrained(
"kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
init_image = load_image("monalisa.png").resize((512, 512))
mask_image = load_image("monalisa_mask.png").resize((512, 512))
import matplotlib.pyplot as plt
prompt = "oil painting of a woman, sfumato, renaissance, low details, Da Vinci"
negative_prompt = "bad anatomy, deformed, ugly, disfigured"
rand_gen = torch.manual_seed(74294536)
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image, generator=rand_gen, guidance_scale=1.5).images[0]
fig = make_image_grid([init_image, mask_image, image], rows=1, cols=3)
fig
Beyond images¶
Diffusion models can also be used for video generation. At the moment of the writing, this field is still in its infancy, but it is progressing fast so keep an eye on the available models as there might be much better ones by the time you are reading this.
The list of available model for text-to-video is available here
from helpers import get_video
from IPython.display import Video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16").to("cuda")
prompt = "Earth sphere from space"
rand_gen = torch.manual_seed(42312981)
frames = pipe(prompt, generator=rand_gen).frames
Video(get_video(frames, "earth.mp4"))
We can also generate a video starting from an image. For example, let's consider the following image (which was generated with Stable Diffusion XL and then outpainted using DALLE-2):
PLEASE RESTART NOW to free GPU memory, then continue on from the next cell
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
from helpers import get_video
from IPython.display import Video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
# These two settings lower the VRAM usage
pipe.enable_model_cpu_offload()
# pipe.unet.enable_forward_chunking()
# Load the conditioning image
image = load_image("in_the_desert_outpaint.png")
image = image.resize((1024, 576))
generator = torch.manual_seed(999)
res = pipe(
image,
decode_chunk_size=2,
generator=generator,
num_inference_steps=15,
num_videos_per_prompt=1
)
Video(get_video(res.frames[0], "horse2.mp4"))
Overall the animation looks a bit off: the appearance of the legs can definitely be improved, although the overall motion seems correct. Let's try with a different object which has less parts:
image = load_image("xwing.jpeg")
image = image.resize((1024, 576))
generator = torch.manual_seed(999)
res = pipe(
image,
decode_chunk_size=2,
generator=generator,
num_inference_steps=25,
num_videos_per_prompt=1
)
Video(get_video(res.frames[0], "xwing.mp4"))
The animation is definitely more realistic here, but still there's a lot to be desired. However, feel free to try other images and prompts and see what you get!
Created: 2024-10-23