Can AI Replace Human Designers? The Magic of Diffusion Models
From VAE to Diffusion Models, understanding the Zero-Shot Text-to-Image through the versions of AIGC software DALL-E
2022 is an incredible year of AIGC (AI-Generated Content). There are 3 outstanding text-to-image AI software published online and got some excellent results:
DALL-E, the AIGC software from OpenAI, activated online and got million user registrations.
Artist won an art contest with AIGC image software Midjourney.
The firm StabilityAI got valued at one billion for its AIGC software Stable Diffusion was released in September and updated in November.
The text-to-image allows the user to input an arbitrary text description and automatically generates the corresponding images that match the description. This type of AI is easy to learn and can generate various fantastic images without human artists involved.
Although these AI are all for business, their algorithms and code are almost open-sourced. This article will go through the historical versions of DALL-E, the AIGC AI from OpenAI, as an example the explain the text-to-image algorithm architectures and iterations.
DALL-E
DALL-E is an open-sourced algorithm from OpenAI. The initial version uses VQ-VAE-2 as its architecture design. The design was updated to diffusion models in the second version.
VAE: Variational Autoencoderer
VAE has lots of applications in many fields like prediction, data mining, etc., but the vanilla VAE is seldom applied to image encoding. It’s because that vanilla VAE is easy to overfit or fail to converge when training on high-dimensioned data like images. The reason is that the latent space of the VAE encoder is a continuous space, for example, a 256-D floating number. When we want to decode a continuous latent with a powerful decoder to fit high-dimension training data, the posterior collapse may happen: the decode ignores the latent variable and memorizes the origin data.
To fix the above problem, DALL-E uses dVAE (Discrete Variational Autoencoder) as its main design.
VQ-VAE: Vector Quantised-Variational Autoencoder
VQ-VAE is a type of discrete VAE. Different from continuous VAE, VQ-VAE discrete its latent variable with the following method:
- Encode latent variable z with the original VAE encoder with dimension D.
- Build a learnable codebook within an embedding space with dimension K × D. The K indicates the codebook size.
- Replace z with the closest embedding vector in the codebook.
According to the paper, the design of the codebook allows VQ-VAE to learn more effective VAE latent variables and avoid posterior collapse. Moreover, the VQ-VAE reduce the human adjusting by replacing the human assumption about the prior probability distribution of latent variable with a learnable codebook.
The initial DALL-E uses VQ-VAE-2, the architecture is almost the same as VQ-VAE but adds the multi-scale hierarchical design to allow VQ-VAE to encode/decode images with higher resolution.
DALL-E: Zero-Shot Text-to-Image
This is the paper of the initial DALL-E. It applies a conditional VQ-VAE design, using the encoded text prompt by CLIP as the conditional vector. This paper is not very novel, but it provides some practical methods and training techniques to make the training more stable. Most important, it did generate some impressive images.
DALL-E 2: Diffusion Model
In the DALL-2 paper, Hierarchical Text-Conditional Image Generation with CLIP Latents, the overall structure is changed. The biggest change is it deprecates VAE and uses Diffusion Model.
The Basic Understanding of Diffusion Model
The diffusion models (DMs) were boosted in 2020 after the paper Denoising Diffusion Probabilistic Models (DDPM). The diffusion models treat image generation as an image denoising task. When we expect the denoising model to recover a very high-quality image from a white noise input, the single tiny change of a pixel may change the final output greatly. That’s the main idea of diffusion models.
The diffusion models utilize randomness as a source of creativity, trying to denoise/decode breathtaking images from randomized white noise.
The following picture is the generation pipeline of diffusion models. It started from random noise, denoises the source image from the previous layer step by step, and outputs a high-resolution final guess of the meaningful image behind the random noise.
That pipeline looks simple and beautiful, and we can easily control the depth by stacking the steps to control the final image quality. A noise predictor is learned and aims to predict the noise in every step, and the predicted noise will be removed straightforwardly step by step.
There are some basic understandings of diffusion models:
- The training process of diffusion models includes 2 processes, adding noise and removing noise. Each process contains multiple timesteps. The overall adding and removing noise process is called the diffusion process.
- Only the removing noise process was applied during image generation. That is called denoising, or reverse diffusion process during mathematics proofing.
- The noise predictor shares parameters across all timesteps. The noise predictor has different behavior through timesteps by utilizing a timestep indicator as an additional input.
- Most works design the noise predictor as a UNet.
- It’s not necessary to perform the diffusion process on the original data space. For example, Stable Diffusion applies the diffusion process onto the latent space and gets state-of-the-art results.
- The output could be manipulated with additional input features. That kind of trick is called guidance or guided diffusion.
Read the following articles if you want to know more details about diffusion models. The first one is focused on illustrating, and the second one talks more about mathematical derivation.
DALL-E 2
That’s back to the paper of DALL-E 2. The following picture shows the overview of the work. Although the decoder is designed as a diffusion model, there are 2 differences against the original diffusion models: prior and decoder.
The prior is a transformation from text embedding to image embedding. There are 2 prior designs in the paper. The first design is the autoregressive prior (AR prior), and it’s actually a transformer design. The second is the diffusion prior, it applies the guidance design to guide a randomized noise to a meaningful image embedding with the text embedding input.
The decoder is a diffusion model that generates a high-resolution final image with the guidance of the image embedding outputted by the prior. According to the paper, it chooses GUIDE as the diffusion model structure and adds some training techniques to boost the final results like random dropout.
Here’s the GitHub source:
Why Diffusion Models?
From the evolutions of DALL-E and other AIGC products, the mainstream technique of AIGC is transferred to diffusion models from VAE and GAN. The paper, Diffusion Models Beat GANs on Image Synthesis, even makes a declaration of victory for diffusion models in the title. After some surveys, here are some viewpoints from myself to explain why diffusion models are more popular than VAE and GAN today.
Mathematics Proofing
Like VAE, diffusion models rely on proofing the ELBO (Evidence Lower Bound) for the probability of generated images and maximizing the ELBO during the training process. The learning difficulty for diffusion models is more friendly than VAE because the data dimension will not be reduced and the information propagation will not be compressed too.
On the other hand, GANs have lots of weaknesses. The biggest problem is that it’s hard to stably train a GAN. It’s not always easy to keep Nash equilibrium between the generator and discriminator during every training on any kind of data. Moreover, the adversarial loss can NOT explicitly indicate the output image quality.
Scalability of Diffusion Process
The multi-timesteps make it easy to scale up your model by stacking the timesteps without increasing the parameter size. That design makes the noise predictor more learnable. According to the experiment result from DDIM, the reconstruction error getting decreased when we stack more diffusion timesteps. But it’s also the biggest shortcoming of diffusion models because more diffusion timesteps need more computation time.
Easy to Expand
It is easy to expand the diffusion model to the other network structure like combining it with GAN or VAE, or working in latent space like Stable Diffusion. By the way, this kind of work was most likely to be categorized as a diffusion model, so the other classic methods are harder to attract people’s eyes.
After Understanding the Algorithm
The diffusion model is an AIGC technique that is stabler and more elegant than other methods. After understanding the breakthrough of mathematics and machine learning, we also need to realize that the backbone of all the algorithms is just trying to make a random noise become a beautiful image. It is data-driven and far far away from the creativity of humans.
If you had tried some AIGC services, you’ll find that it’s not as stable as you thought and you might have to try more than 10 times to get a single acceptable result. There are also lots of variables to affect your result, like the random seed, selection of text prompt, algorithm parameters and etc.
Let’s start trying on your own with the Stable Diffusion 2 web demo:
Finally, if you are an artist or designer, please remember that AI is your friend but your enemy. You can utilize those AIGC tools like how your boss pushes you, feeding AI with ridiculous incredible ideas like a colorful shining black like a rainbow, releasing your imagination, and maximizing your creativity.
If I have seen further,it is by standing on the shoulders of giants. — Isaac Newton