[230831] 졸업프로젝트

⭐️·2023년 8월 31일
0

졸업프로젝트

목록 보기
1/4

다음을 참고하여 Stable Diffusion이 어떻게 작동하는지 공부하였다.

Wait, how does this even work?

Unlike what you might expect at this point, StableDiffusion doesn't actually run on magic.
It's a kind of "latent diffusion model". Let's dig into what that means.

You may be familiar with the idea of super-resolution:
it's possible to train a deep learning model to denoise an input image -- and thereby turn it into a higher-resolution
version. The deep learning model doesn't do this by magically recovering the information that's missing from the noisy, low-resolution
input -- rather, the model uses its training data distribution to hallucinate the visual details that would be most likely
given the input. To learn more about super-resolution, you can check out the following Keras.io tutorials:

When you push this idea to the limit, you may start asking -- what if we just run such a model on pure noise?
The model would then "denoise the noise" and start hallucinating a brand new image. By repeating the process multiple
times, you can get turn a small patch of noise into an increasingly clear and high-resolution artificial picture.

This is the key idea of latent diffusion, proposed in
High-Resolution Image Synthesis with Latent Diffusion Models in 2020.
To understand diffusion in depth, you can check the Keras.io tutorial
Denoising Diffusion Implicit Models.

Now, to go from latent diffusion to a text-to-image system,
you still need to add one key feature: the ability to control the generated visual contents via prompt keywords.
This is done via "conditioning", a classic deep learning technique which consists of concatenating to the
noise patch a vector that represents a bit of text, then training the model on a dataset of {image: caption} pairs.

This gives rise to the Stable Diffusion architecture. Stable Diffusion consists of three parts:

  • A text encoder, which turns your prompt into a latent vector.
  • A diffusion model, which repeatedly "denoises" a 64x64 latent image patch.
  • A decoder, which turns the final 64x64 latent patch into a higher-resolution 512x512 image.

First, your text prompt gets projected into a latent vector space by the text encoder,
which is simply a pretrained, frozen language model. Then that prompt vector is concatenated
to a randomly generated noise patch, which is repeatedly "denoised" by the diffusion model over a series
of "steps" (the more steps you run the clearer and nicer your image will be -- the default value is 50 steps).

Finally, the 64x64 latent image is sent through the decoder to properly render it in high resolution.

All-in-all, it's a pretty simple system -- the Keras implementation
fits in four files that represent less than 500 lines of code in total:

But this relatively simple system starts looking like magic once you train on billions of pictures and their captions.
As Feynman said about the universe: "It's not complicated, it's just a lot of it!"

0개의 댓글