AI 能取代設計師嗎？擴散模型 Diffusion Model是什麼黑魔法？

講解圖片生成軟體 DALL-E 的演算法進化路徑，理解 VAE, Diffusion Model, Zero-Shot Text-to-Image 技術原理

14 min readNov 12, 2022

2022 年是圖像生成 AI 是一個具有指標性的一年。這一年有 3 個指摽性的 Text-to-Image 圖像生成式 AI 軟體服務上線：DALL-E,MidJourney, Stable Diffusion，並且各自取得了不錯的成果：

OpenAI 的圖像生成軟體服務 DALL-E 上架後得到百萬用戶註冊
藝術家透過 MidJourney 生成作品並得到了美國 Colorado 州的設計金獎
圖像生成 AI Stable Diffusion 9 月上線，新創團隊 StabilityAI 獲得 10 億估值

Text-to-Image 允許用戶輸入任何一段文字描述，機器就會自動生成符合文字描述的各種 AI 圖像。這種 AI 對使用者的學習成本非常低，而且讓一般人的想像力也能夠得到解放。

Jason Allen / Midjourney — via Discord。圖片來源：AI artwork wins art competition and artists are upset

有趣的是，雖然這些 AI 都是商業用途，但是其中的演算法與程式碼大部分都是開源的。本篇文章利用 OpenAI 的圖像生成軟體服務 DALL-E 的技術演變，說明 Text-to-Image 的技術架構演變流程，也會提供 Source Code 給大家。

DALL-E

DALL-E 是 OpenAI 推出的開源圖像生成演算法。其第一版本的架構使用 VQ-VAE v2的架構，後來推出了 v2 改成了 Diffusion Model 的設計。以下從 VAE 開始回顧，慢慢的帶到 DALL-E 今日的技術架構：

Blog Post Explained- Creating Images from Text using DALL·E

Introduction & Overview

medium.com

VAE: Variational Autoencoder

之前的文章介紹過 VAE 的方法與原理。其中也說了 VAE 在對圖像進行編碼是特別困難的，因為圖像的維度太大，要不是學習無法收斂就是過度擬合 overfitting。

其實根本的原因是因為 VAE 的編碼出的 latent space 是一個連續空間，比方 encoder 的編碼結果是一個 256-D 的連續浮點數。當後面接上能力較強的 decoder 時，很容易讓 decoder 直接記憶原圖，而忽略了 latent variable 的訊息意義。這個問題稱為 Posterior Collapse。

為了避開 VAE 的編碼的問題，DALL-E 採用了 dVAE (Discrete Variational Autoencoder) 作為主設計架構。

VQ-VAE: Vector Quantised-Variational Autoencoder

VQ-VAE 屬於 dVAE 的一種。不同於連續型 (continuous) 編碼的 VAE ，VQ-VAE 的編碼結果是離散的 (discrete)。具體的離散化方法如下：

基於傳統的 VAE 建構 encoder，假設輸出編碼為 z 維度是 D
使用維度 K × D 的 embedding space 作為編碼庫 codebook，其中 K 表示編碼庫的大小
將 z 用 codebook 中與之最接近的 embedding vector 取代，完成離散編碼

根據論文，VQ-VAE 的 codebook 設計能避免 posterior collapse，比 VAE 能夠學到更重要的特徵表示。其次，VQ-VAE 的交由機器學習的 codebook 取代了 VAE 對 latent variable 機率分佈的人為假設，減少了人工調適的成分。

初版 DALL-E 發表的時候，使用的最終設計是 VQ-VAE-2，主要架構與 VQ-VAE 是相同的，但是使用了多層次架構 multi-scale hierarchical VQ-VAE，能夠編碼/生成更高解析度的圖像。

DALL-E: Zero-Shot Text-to-Image

這是第一版 DALL-E 的主要論文，其實就是一個 Conditional VQ-VAE 的架構，而作為 Condition 的向量就是從使用者輸入的文字 prompt text 編碼提過 CLIP 編碼來的。對於 Conditional VAE 不熟悉的同學，可以參考之前的文章《在實務中用生成網路 VAE 做半監督學習的原理與技巧》；而關於 CLIP 的原理與使用方式，可以參考我之前的文章《用 10 分鐘搭建萬物識別的 Live Demo》。

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed…

arxiv.org

這篇論文在整個架構上不是特別的新穎，即使在當年的審視基礎上。但論文給出了一些比較實用的方法與技巧，讓整個模型訓練起來比較穩定，也的確生成出了令人驚豔的圖片效果。

DALL-E 2 ：擴散模型 Diffusion Model

在 2022 的 DALL-E 2 論文《Hierarchical Text-Conditional Image Generation with CLIP Latents》中，雖然與上一個版本只差了一年，但是整體架構已經改變了許多。最大的改變就是從 VAE 遷移到了 擴散模型 Diffusion Model。

VAE（第2行）与Diffusion Model（第 4 行）的差別。圖中的 z 指的都是 latent space，x 代表的是圖片。圖片來源：What are Diffusion Models?

Hierarchical Text-Conditional Image Generation with CLIP Latents

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and…

arxiv.org

Diffusion Model 基礎理解

Diffusion Model 在圖像生成領域開始大放異彩應該是 2020 的《Denoising Diffusion Probabilistic Models (DDPM)》這篇文章開始。Diffusion Model 把圖像生成視為一個 denoise 任務。當 denoise 模型的目標任務是將圖片恢復到非常高清的水準時，只要 noise 非常嚴重時，只要稍微改變一個 pixel 的結果就會讓整個模型的變得十分不同。

Diffusion Model 就是用這個邏輯，跟 GAN 一樣把隨機數視為機器的創造力，然後試著從這些完全隨機的白噪音圖片慢慢解碼出各種高清圖片。

下圖是 Diffusion Model 的生成流程。Step 1 會從一個完全隨機亂數開始，然後慢慢地把亂數去掉，最終在 Step 50 留下一個機器最終認為的隱藏在這堆亂數背後的高清圖片。

Diffusion Model 的架構看起非常簡單而優美，也可以控制整體深度來控制生成的品質。訓練的時候手工給數據加上噪音，然後利用一個 noise predictor 來預測加噪的數據，可以說特別的直觀。

Diffusion Model 有幾個理解重點：

Diffusion Model 的訓練包含加噪階段與還原階段，每個階段由多個 timestep 組成。整個 加噪+還原 的過程稱為 diffusion process。
圖片生成時，會單獨使用還原階段。這個生成過程稱為 denoise，數學推導時也稱之為 reverse diffusion。
每個 timestep 的 noise predictor 是共用參數的，但是會額外輸入一個時間參數，讓每個 timestep 的 noise predictor 的行為會稍有不同。
通常會使用 UNet 結構作為 noise predictor。
不一定要在圖像空間上做 denoise。例如下圖的 Stable Diffusion 就降維到 latent space 做 denoise 取得很好的效果。
可以加入其他特徵影響 denoise 的結果，進而控制內容。這個技巧稱為 guidance 或 guided diffusion。

Stable Diffusion 在 latent space (z) 維度進行 denoise，且加入其他語意特徵作為 guidance，透過 Self-Attention (QKV) 影響 UNet 的行為，達到控制生成結果的目的。圖片來源：Stable Diffusion Paper

有興趣了解更多的同學，推薦下面兩篇文章。第一篇注重於圖解，第二篇注重於數學推導，兩者一起看可以更全面的理解 Diffusion Model。

The Illustrated Stable Diffusion

AI image generation is the most recent AI capability blowing people's minds (mine included). The ability to create…

jalammar.github.io

What are Diffusion Models?

Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of…

lilianweng.github.io

DALL-E 2

回到 DALL-E 2 論文，下圖是論文中的整體架構圖。其 decoder 生成器已經是一個 Diffusion Model 的架構。除此之外，該論文與常規 Diffusion Model 不同的地方有兩個: prior 與 decoder。

Prior 是利用文本特徵 (Text Embedding) 估計圖像特徵 (Image Embedding) 的轉換函數。論文中提供了兩種方法。第一種是 Autoregressive Prior (AR prior)，其實就是利用 Transformer 架構作為轉換函數。第二種就是 Diffusion Prior，使用 CLIP 提取的 text embedding 作為 guidance 來引導亂數生成的 image embedding，達到轉換的目的。

Decoder 則是一個 diffusion model 的架構，利用 prior 輸出的 image embedding 作為 guidance，引導隨機生成的亂數圖片慢慢變成高清圖片。根據論文，這部分大致採用 GUIDE 的模型設計，額外加上一些類似 random dropout的技巧來提昇效果。

有興趣的同學，可以直接在 Github 找到 source code：

GitHub - lucidrains/DALLE2-pytorch: Implementation of DALL-E 2, OpenAI's updated text-to-image…

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch. Yannic Kilcher summary…

github.com

Ｗhy Diffusion Model?

從 DALL-E 與其他主流軟體的技術演進來看，AI 生成領域的主流技術已經慢慢從 VAE, GAN 轉移到了 Diffusion Model。在 2021 的論文《Diffusion Models Beat GANs on Image Synthesis》更是直接在題目上發表 Diffusion Model 的勝利宣言。在觀察了一些論點之後，我整理了幾個 Diffusion Model 比 GAN, VAE 更受關注的一些原因。

數理支撐

Diffusion Model 的數學理論與 VAE 類似，都是尋找生成圖片的理論機率下限 (ELBO, Evidence Lower Bound)，訓練過程會尋找最佳參數來提升這個下限。而在 Diffusion Model 中甚至省略了 VAE 中訊息降維壓縮的過程，因此更加減少了學習難度。

對比之下 GAN 存在著不少缺點，最為人詬病的問題就是很難訓練成功，因為其需要控制 Generator, Descriminator 之間博弈的奈許均衡 Nash Equilibrium。同時對抗損失 Adversarial Loss 的絕對值並不能拿來評判生成圖片的品質，讓 GAN 的結果更加不可控。

Diffusion Process

Diffusion Process 的多層 timestep 設計可以在不增加模型參數量的情況下增加網路層數，減少了每一層 noise predictor 的預測難度。根據 DDIM 的實驗結果，增加 Diffusion Process 的 timestep 數可以得到更好的重建結果。不過同時這也是 Diffusion Model 最大的缺點，因為增加的 timestep 會使得運算時間倍數上升。

可擴展性佳

Diffusion Process 可以接入到其他網路結構中，例如在特徵空間 latent space 進行低維度的 Diffusion Process 再接上 GAN 或是 VAE 的生成網路也是可以的，例如 Stable Diffusion。但是這種引入了 Diffusion Process 的文章通常會被歸類為 Diffusion Model，因此讓 GAN, VAE 的方法在 AI 生成領域的曝光度沒有之前那麼高。

理解演算法之後

Diffusion Model 是一種更穩定、更優雅的 AI 生成範式。在 AI 生成大放異彩的同時，除了觀察本身背後的技術突破與演進之外，要知道這些演算法的最根本原理都只是把隨機變數變成一張精美圖片而已，背後仍然是數據驅動 data driven 的數學模型，並不能取代真正的創造力。

只要實際玩過一些生成結果就會發現，其實 AI 生成的結果還是比較不穩定的，可能要經過幾十次的嘗試才會比較有比較好的結果。這其中包含了許多原因，例如隨機亂數、text prompt 的選擇、演算法參數等等。

想要自己手動玩玩看的朋友，我個人推薦的是 MidJourney，因為它是部署成 Discord 聊天機器人的方式，所以任何人用手機都可以玩。不需要任何的 GPU、不需要下載程式碼、也不需要懂程式。唯一的門檻可能就是你需要懂一點點英文。具體的教學，可以參考強者我朋友 Yu-Han Wu (Rainnie) 的教學影片：

全中文 MidJourney 教學 Youtube

最後，如果你今天是個設計師，請記得 AI 並不是你敵人，而是盟友。你可以利用這些 AI 生成工具，讓電腦給你提出各種千奇百怪的設計方案，例如五彩斑斕的黑、變大同時能變小的手機殼，激發創造不同的想像力。然後你站在 AI 的肩膀上，創造出屬於自己的，獨一無二的作品。

AI 能取代設計師嗎？擴散模型 Diffusion Model是什麼黑魔法？

講解圖片生成軟體 DALL-E 的演算法進化路徑，理解 VAE, Diffusion Model, Zero-Shot Text-to-Image 技術原理

DALL-E

Blog Post Explained- Creating Images from Text using DALL·E

Introduction & Overview

VAE: Variational Autoencoder

VQ-VAE: Vector Quantised-Variational Autoencoder

DALL-E: Zero-Shot Text-to-Image

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed…

DALL-E 2 ：擴散模型 Diffusion Model

Hierarchical Text-Conditional Image Generation with CLIP Latents

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and…

Diffusion Model 基礎理解

The Illustrated Stable Diffusion

AI image generation is the most recent AI capability blowing people's minds (mine included). The ability to create…

What are Diffusion Models?

Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of…

DALL-E 2

GitHub - lucidrains/DALLE2-pytorch: Implementation of DALL-E 2, OpenAI's updated text-to-image…

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch. Yannic Kilcher summary…

Ｗhy Diffusion Model?

數理支撐

Diffusion Process

可擴展性佳

理解演算法之後

Written by Rice Yang

No responses yet

AI 能取代設計師嗎？擴散模型 Diffusion Model是什麼黑魔法？

講解圖片生成軟體 DALL-E 的演算法進化路徑，理解 VAE, Diffusion Model, Zero-Shot Text-to-Image 技術原理

DALL-E

Blog Post Explained- Creating Images from Text using DALL·E

Introduction & Overview

VAE: Variational Autoencoder

VQ-VAE: Vector Quantised-Variational Autoencoder

DALL-E: Zero-Shot Text-to-Image

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed…

DALL-E 2 ： 擴散模型 Diffusion Model

Hierarchical Text-Conditional Image Generation with CLIP Latents

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and…

Diffusion Model 基礎理解

The Illustrated Stable Diffusion

AI image generation is the most recent AI capability blowing people's minds (mine included). The ability to create…

What are Diffusion Models?

Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of…

DALL-E 2

GitHub - lucidrains/DALLE2-pytorch: Implementation of DALL-E 2, OpenAI's updated text-to-image…

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch. Yannic Kilcher summary…

Ｗhy Diffusion Model?

數理支撐

Diffusion Process

可擴展性佳

理解演算法之後

Written by Rice Yang

No responses yet

DALL-E 2 ：擴散模型 Diffusion Model