Diffusion models and the `diffusers` library from 🤗¶

Diffusion models are at the forefront of the research and the applications of Gen-AI in Computer Vision. Many research papers and innovations are published every day in the field. Most of the open-source models and results are released and implemented in the 🤗 (Hugging Face) libraries: transformers for LLMs and diffusers for Diffusion Models. Both of them are high-level libraries built on top of PyTorch.

In this exercise we are going to have fun with some of the amazing capabilities of the diffusers library along with the 🤗 ecosystem of Models, an online repository where open-source models are hosted and are available for use free of charge.

Please note: always check the license of the models before using them in a professional setting because some restrict commercial use.

It is worth noting here that the 🤗 also includes freely-available Datasets and Spaces, where people can publish demos of interesting applications of Gen-AI and more.

IMPORTANT: when using this notebook within the Udacity Workspace, you need to restart the notebook when requested because you will run out of GPU memory otherwise. Be on the lookout for the RESTART NOW message

NOTE: because we are using a lot of different models, the first time you run them you will see a lot of messages from the diffusers library alerting you that it is downloading files from the internet. Those are expected, and they do NOT constitute an error. Just continue on.

Let's start by importing the elements we are going to use:

In [ ]:

  Copied!     
 
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid

import torch
from diffusers import DiffusionPipeline, AutoPipelineForText2Image from diffusers.utils import load_image, make_image_grid import torch

Now we're going to see a few applications in the field of image generation and editing, as well as video generation.

Unconditional generation¶

In this type of image generation, the generator is free to do whatever it wants, i.e., it is in "slot machine" mode: we pull the lever and the model will generate something related to the training set it has been trained on. The only control we have here is the random seed.

You can see all the available unconditional diffusion models compatible with the diffusers library here. For example, at the moment the first few results look like this:

You can substitute the model_name value in the cell below with any of these model names, like for example google/ddpm-cifar10-32 or WiNE-iNEFF/Minecraft-Skin-Diffusion-V3:

In [ ]:

  Copied!     
 
rand_gen = torch.manual_seed(12418351)

model_name = 'google/ddpm-celebahq-256'

model = DiffusionPipeline.from_pretrained(model_name).to("cuda")
image = model(generator=rand_gen).images[0]
image
rand_gen = torch.manual_seed(12418351) model_name = 'google/ddpm-celebahq-256' model = DiffusionPipeline.from_pretrained(model_name).to("cuda") image = model(generator=rand_gen).images[0] image

While most of them can be used this way, some require some special handling. In that case, the code needed to use them is typically reported in the Model Card that can be accessed by simply clicking on the name of the models in this list.

Text-to-image¶

This is a class of conditional image generation models. The conditioning happens through text: we provide a text prompt, and the model creates an image following that description.

You can find a list of available text-to-image models here.

For example, here we use the Stable Diffusion XL Turbo (a new version of Stable Diffusion XL optimized for super-fast inference):

In [ ]:

  Copied!     
 
pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/sdxl-turbo", 
    torch_dtype=torch.float16, 
    variant="fp16"
).to("cuda")
pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16" ).to("cuda")

In [ ]:

  Copied!     
 
prompt = "A photo of a wild horse jumping in the desert dramatic sky intricate details National Geographic 8k high details"

rand_gen = torch.manual_seed(423122981)

image = pipe(
    prompt=prompt, 
    num_inference_steps=1, # For this model you can use 1, but for normal Stable Diffusion you should use 25 or 50
    guidance_scale=1.0, # For this model 1 is fine, for normal Stable Diffusion you should use 6 or 7, or up to 10 or so
    negative_prompt=["overexposed", "underexposed"], 
    generator=rand_gen
).images[0]

image
prompt = "A photo of a wild horse jumping in the desert dramatic sky intricate details National Geographic 8k high details" rand_gen = torch.manual_seed(423122981) image = pipe( prompt=prompt, num_inference_steps=1, # For this model you can use 1, but for normal Stable Diffusion you should use 25 or 50 guidance_scale=1.0, # For this model 1 is fine, for normal Stable Diffusion you should use 6 or 7, or up to 10 or so negative_prompt=["overexposed", "underexposed"], generator=rand_gen ).images[0] image

RESTART NOW: please restart the notebook, then start running from the next cell and continue on

In [ ]:

  Copied!     
 
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid

import torch
from diffusers import DiffusionPipeline, AutoPipelineForText2Image from diffusers.utils import load_image, make_image_grid import torch

This is another example, where we use a nice model by Playground AI that generates artistic images instead of photorealistic ones:

In [ ]:

  Copied!     
 
pipe = AutoPipelineForText2Image.from_pretrained(
    "playgroundai/playground-v2-1024px-aesthetic",
    torch_dtype=torch.float16,
    use_safetensors=True,
    add_watermarker=False,
    variant="fp16"
).to("cuda")
pipe = AutoPipelineForText2Image.from_pretrained( "playgroundai/playground-v2-1024px-aesthetic", torch_dtype=torch.float16, use_safetensors=True, add_watermarker=False, variant="fp16" ).to("cuda")

In [ ]:

  Copied!     
 
prompt = "A scifi astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
rand_gen = torch.manual_seed(42312981)

image  = pipe(prompt=prompt, guidance_scale=3.0, generator=rand_gen).images[0]
image
prompt = "A scifi astronaut in a jungle, cold color palette, muted colors, detailed, 8k" rand_gen = torch.manual_seed(42312981) image = pipe(prompt=prompt, guidance_scale=3.0, generator=rand_gen).images[0] image

Image-to-image¶

In the image-to-image task we condition the production of the Diffusion Model through an input image. There are many ways of doing this. Here we look at transforming a barebone sketch of a scene in a beautiful, highly-detailed representation of the same.

Let's start by creating a sketch. We could create that manually, but since we're here, let's use SDXL-Turbo instead.

NOTE: it is important for the sketch not to be too detailed and complicated. Flat colors typically work best, although this is not an absolute rule.

PLEASE RESTART NOW to free GPU memory, then continue on from the next cell

In [ ]:

  Copied!     
 
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid

import torch
from diffusers import DiffusionPipeline, AutoPipelineForText2Image from diffusers.utils import load_image, make_image_grid import torch

In [ ]:

  Copied!     
 
prompt = "A tree and a house, made by a child with 3 colors"
rand_gen = torch.manual_seed(423121)

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/sdxl-turbo", 
    torch_dtype=torch.float16, 
    variant="fp16"
).to("cuda")

image  = pipe(
    prompt=prompt, 
    num_inference_steps=2,
    guidance_scale=2,
    generator=rand_gen
).images[0]
image
prompt = "A tree and a house, made by a child with 3 colors" rand_gen = torch.manual_seed(423121) pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16" ).to("cuda") image = pipe( prompt=prompt, num_inference_steps=2, guidance_scale=2, generator=rand_gen ).images[0] image

In [ ]:

  Copied!     
 
image.save("sketch.png")
image.save("sketch.png")

PLEASE RESTART NOW to free GPU memory, then continue on from the next cell

In [ ]:

  Copied!     
 
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
import PIL
import torch
from diffusers import DiffusionPipeline, AutoPipelineForText2Image from diffusers.utils import load_image, make_image_grid import PIL import torch

In [ ]:

  Copied!     
 
image = PIL.Image.open("sketch.png")
image = PIL.Image.open("sketch.png")

Now we can use the Kandinsky model to generate an image that respects the subjects and their positions in our sketch:

In [ ]:

  Copied!     
 
from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline

prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")

In [ ]:

  Copied!     
 
import numpy as np

original_image = image.copy().resize((768, 768))

prompt = "A photograph of a house in the fall, high details, broad daylight"
negative_prompt = "low quality, bad quality"
    
rand_gen = torch.manual_seed(67806801)
image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, generator=rand_gen).to_tuple()

new_image = pipeline(
    image=original_image, 
    image_embeds=image_embeds, 
    negative_image_embeds=negative_image_embeds, 
    height=768, 
    width=768, 
    strength=0.35,
    generator=rand_gen
).images[0]
fig = make_image_grid([original_image.resize((512, 512)), new_image.resize((512, 512))], rows=1, cols=2)
fig
import numpy as np original_image = image.copy().resize((768, 768)) prompt = "A photograph of a house in the fall, high details, broad daylight" negative_prompt = "low quality, bad quality" rand_gen = torch.manual_seed(67806801) image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, generator=rand_gen).to_tuple() new_image = pipeline( image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.35, generator=rand_gen ).images[0] fig = make_image_grid([original_image.resize((512, 512)), new_image.resize((512, 512))], rows=1, cols=2) fig

Inpainting¶

Diffusion models can also be used to do inpainting, which means filling regions of an image according to a prompt (or just according to the sourroundings of the hole to fill).

Typically, we start from an image and a mask. The mask indicates the pixels to be inpainted, i.e., removed from the original image and filled with new content generated by the model.

PLEASE RESTART NOW to free GPU memory, then continue on from the next cell

In [ ]:

  Copied!     
 
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid

import torch
from diffusers import DiffusionPipeline, AutoPipelineForText2Image from diffusers.utils import load_image, make_image_grid import torch

In [ ]:

  Copied!     
 
import torch
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForInpainting.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
import torch from diffusers import AutoPipelineForInpainting from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16 ) pipeline.enable_model_cpu_offload()

In [ ]:

  Copied!     
 
init_image = load_image("monalisa.png").resize((512, 512))
mask_image = load_image("monalisa_mask.png").resize((512, 512))
init_image = load_image("monalisa.png").resize((512, 512)) mask_image = load_image("monalisa_mask.png").resize((512, 512))

In [ ]:

  Copied!     
 
import matplotlib.pyplot as plt

prompt = "oil painting of a woman, sfumato, renaissance, low details, Da Vinci"
negative_prompt = "bad anatomy, deformed, ugly, disfigured"

rand_gen = torch.manual_seed(74294536)
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image, generator=rand_gen, guidance_scale=1.5).images[0]
fig = make_image_grid([init_image, mask_image, image], rows=1, cols=3)
fig
import matplotlib.pyplot as plt prompt = "oil painting of a woman, sfumato, renaissance, low details, Da Vinci" negative_prompt = "bad anatomy, deformed, ugly, disfigured" rand_gen = torch.manual_seed(74294536) image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=init_image, mask_image=mask_image, generator=rand_gen, guidance_scale=1.5).images[0] fig = make_image_grid([init_image, mask_image, image], rows=1, cols=3) fig

Beyond images¶

Diffusion models can also be used for video generation. At the moment of the writing, this field is still in its infancy, but it is progressing fast so keep an eye on the available models as there might be much better ones by the time you are reading this.

The list of available model for text-to-video is available here

In [ ]:

  Copied!     
 
from helpers import get_video
from IPython.display import Video


pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16").to("cuda")

prompt = "Earth sphere from space"

rand_gen = torch.manual_seed(42312981)
frames = pipe(prompt, generator=rand_gen).frames

Video(get_video(frames, "earth.mp4"))
from helpers import get_video from IPython.display import Video pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16").to("cuda") prompt = "Earth sphere from space" rand_gen = torch.manual_seed(42312981) frames = pipe(prompt, generator=rand_gen).frames Video(get_video(frames, "earth.mp4"))

We can also generate a video starting from an image. For example, let's consider the following image (which was generated with Stable Diffusion XL and then outpainted using DALLE-2):

No description has been provided for this image

PLEASE RESTART NOW to free GPU memory, then continue on from the next cell

In [ ]:

  Copied!     
 
import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
from helpers import get_video
from IPython.display import Video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
# These two settings lower the VRAM usage
pipe.enable_model_cpu_offload()
# pipe.unet.enable_forward_chunking()
import torch from diffusers import StableVideoDiffusionPipeline from diffusers.utils import load_image, export_to_video from helpers import get_video from IPython.display import Video pipe = StableVideoDiffusionPipeline.from_pretrained( "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" ).to("cuda") # These two settings lower the VRAM usage pipe.enable_model_cpu_offload() # pipe.unet.enable_forward_chunking()

In [ ]:

  Copied!     
 
# Load the conditioning image
image = load_image("in_the_desert_outpaint.png")
image = image.resize((1024, 576))
# Load the conditioning image image = load_image("in_the_desert_outpaint.png") image = image.resize((1024, 576))

In [ ]:

  Copied!     
 
generator = torch.manual_seed(999)
res = pipe(
    image, 
    decode_chunk_size=2, 
    generator=generator, 
    num_inference_steps=15, 
    num_videos_per_prompt=1
)
generator = torch.manual_seed(999) res = pipe( image, decode_chunk_size=2, generator=generator, num_inference_steps=15, num_videos_per_prompt=1 )

In [ ]:

  Copied!     
 
Video(get_video(res.frames[0], "horse2.mp4"))
Video(get_video(res.frames[0], "horse2.mp4"))

Overall the animation looks a bit off: the appearance of the legs can definitely be improved, although the overall motion seems correct. Let's try with a different object which has less parts:

In [ ]:

  Copied!     
 
image = load_image("xwing.jpeg")
image = image.resize((1024, 576))

generator = torch.manual_seed(999)
res = pipe(
    image, 
    decode_chunk_size=2, 
    generator=generator, 
    num_inference_steps=25, 
    num_videos_per_prompt=1
)
image = load_image("xwing.jpeg") image = image.resize((1024, 576)) generator = torch.manual_seed(999) res = pipe( image, decode_chunk_size=2, generator=generator, num_inference_steps=25, num_videos_per_prompt=1 )

In [ ]:

  Copied!     
 
Video(get_video(res.frames[0], "xwing.mp4"))
Video(get_video(res.frames[0], "xwing.mp4"))

The animation is definitely more realistic here, but still there's a lot to be desired. However, feel free to try other images and prompts and see what you get!

Last update: 2024-10-23
Created: 2024-10-23

Diffusion models and the diffusers library from 🤗¶

Unconditional generation¶

Text-to-image¶

Image-to-image¶

Inpainting¶

Beyond images¶

Diffusion models and the `diffusers` library from 🤗¶