https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html

In this post, we are looking into the third type of generative models: flow-based generative models. Different from GAN and VAE, they explicitly learn the probability density function of the input data.

So far, I’ve written about two types of generative models, GAN and VAE. Neither of them explicitly learns the probability density function of real data, Flow-based Deep Generative Models - 图1(where Flow-based Deep Generative Models - 图2),— because it is really hard! Taking the generative model with latent variables as an example, Flow-based Deep Generative Models - 图3 can hardly be calculated as it is intractable to go through all possible values of the latent code Flow-based Deep Generative Models - 图4.
Flow-based deep generative models conquer this hard problem with the help of normalizing flows, a powerful statistics tool for density estimation. A good estimation of Flow-based Deep Generative Models - 图5 makes it possible to efficiently complete many downstream tasks: sample unobserved but realistic new data points (data generation), predict the rareness of future events (density estimation), infer latent variables, fill in incomplete data samples, etc.

  1. Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.
  2. Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).
  3. Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution Flow-based Deep Generative Models - 图6Flow-based Deep Generative Models - 图7 and therefore the loss function is simply the negative log-likelihood.

Flow-based Deep Generative Models - 图8

Fig. 1. Comparison of three categories of generative models.
_

Linear Algebra Basics Recap

We should understand two key concepts before getting into the flow-based generative model: the Jacobian determinant and the change of variable rule. Pretty basic, so feel free to skip.

Jacobian Matrix and Determinant

Given a function of mapping a Flow-based Deep Generative Models - 图9 -dimensional input vector Flow-based Deep Generative Models - 图10 to a Flow-based Deep Generative Models - 图11 -dimensional output vector, Flow-based Deep Generative Models - 图12, the matrix of all first-order partial derivatives of this function is called the Jacobian matrix, Flow-based Deep Generative Models - 图13 where one entry on the i-th row and j-th column is Flow-based Deep Generative Models - 图14

Flow-based Deep Generative Models - 图15
The determinant is one real number computed as a function of all the elements in a squared matrix. Note that the determinant only exists for square matrices. The absolute value of the determinant can be thought of as a measure of “how much multiplication by the matrix expands or contracts space”.
The determinant of a nxn matrix Flow-based Deep Generative Models - 图16 is:
Flow-based Deep Generative Models - 图17
where the subscript under the summation Flow-based Deep Generative Models - 图18are all permutations of the set {1, 2, …, n}, so there are Flow-based Deep Generative Models - 图19 items in total; Flow-based Deep Generative Models - 图20indicates the signature of a permutation.
The determinant of a square matrix Flow-based Deep Generative Models - 图21 detects whether it is invertible: If Flow-based Deep Generative Models - 图22 then Flow-based Deep Generative Models - 图23 is not invertible (a singular matrix with linearly dependent rows or columns; or any row or column is all 0); otherwise, if Flow-based Deep Generative Models - 图24, Flow-based Deep Generative Models - 图25 is invertible.
The determinant of the product is equivalent to the product of the determinants:
Flow-based Deep Generative Models - 图26(proof)

Change of Variable Theorem

Let’s review the change of variable theorem specifically in the context of probability density estimation, starting with a single variable case.
Given a random variable Flow-based Deep Generative Models - 图27 and its known probability density function Flow-based Deep Generative Models - 图28, we would like to construct a new random variable using a 1-1 mapping function Flow-based Deep Generative Models - 图29. The function Flow-based Deep Generative Models - 图30 is invertible, so Flow-based Deep Generative Models - 图31. Now the question is how to infer the unknown probability density function of the new variable, Flow-based Deep Generative Models - 图32?
Flow-based Deep Generative Models - 图33
By definition, the integral Flow-based Deep Generative Models - 图34, is the sum of an infinite number of rectangles of infinitesimal width Flow-based Deep Generative Models - 图35, The height of such a rectangle at position Flow-based Deep Generative Models - 图36 is the value of the density function Flow-based Deep Generative Models - 图37. Here Flow-based Deep Generative Models - 图38 indicates the ratio between the area of rectangles defined in two different coordinate of variables Flow-based Deep Generative Models - 图39 and Flow-based Deep Generative Models - 图40 respectively.
The multivariable version has a similar format:
Flow-based Deep Generative Models - 图41
where Flow-based Deep Generative Models - 图42 is the Jacobian determinant of the function Flow-based Deep Generative Models - 图43. The full proof of the multivariate version is out of the scope of this post; ask Google if interested ;)

What is Normalizing Flows?

Being able to do good density estimation has direct applications in many machine learning problems, but it is very hard. For example, since we need to run backward propagation in deep learning models, the embedded probability distribution (i.e. posterior Flow-based Deep Generative Models - 图44, is expected to be simple enough to calculate the derivative easily and efficiently. That is why Gaussian distribution is often used in latent variable generative models, even though most of real world distributions are much more complicated than Gaussian.
Here comes a Normalizing Flow (NF) model for better and more powerful distribution approximation. A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions. Flowing through a chain of transformations, we repeatedly substitute the variable for the new one according to the change of variables theorem and eventually obtain a probability distribution of the final target variable.
Flow-based Deep Generative Models - 图45
Fig. 2. Illustration of a normalizing flow model, transforming a simple distribution Flow-based Deep Generative Models - 图46, to a complex one Flow-based Deep Generative Models - 图47, step by step.
As defined in Fig. 2,
Flow-based Deep Generative Models - 图48
Then let’s convert the equation to be a function of Flow-based Deep Generative Models - 图49, so that we can do inference with the base distribution.
Flow-based Deep Generative Models - 图50
() A note on the “inverse function theorem”: If Flow-based Deep Generative Models - 图51 and Flow-based Deep Generative Models - 图52, we have:
Flow-based Deep Generative Models - 图53
(
) A note on “Jacobians of invertible function”: The determinant of the inverse of an invertible matrix is the inverse of the determinant:
, becauseFlow-based Deep Generative Models - 图54. Given such a chain of probability density functions, we know the relationship between each pair of consecutive variables. We can expand the equation of the output Flow-based Deep Generative Models - 图55, step by step until tracing back to the initial distribution Flow-based Deep Generative Models - 图56.
Flow-based Deep Generative Models - 图57
The path traversed by the random variables Flow-based Deep Generative Models - 图58, is the flow and the full chain formed by the successive distributions Flow-based Deep Generative Models - 图59 is called a normalizing flow. Required by the computation in the equation, a transformation function Flow-based Deep Generative Models - 图60 should satisfy two properties:

  1. It is easily invertible.
  2. Its Jacobian determinant is easy to compute.

    Models with Normalizing Flows

With normalizing flows in our toolbox, the exact log-likelihood of input data Flow-based Deep Generative Models - 图61, becomes tractable. As a result, the training criterion of flow-based generative model is simply the negative log-likelihood (NLL) over the training dataset Flow-based Deep Generative Models - 图62:
Flow-based Deep Generative Models - 图63

RealNVP

The RealNVP (Real-valued Non-Volume Preserving; Dinh et al., 2017) model implements a normalizing flow by stacking a sequence of invertible bijective transformation functions. In each bijection Flow-based Deep Generative Models - 图64, known as affine coupling layer, the input dimensions are split into two parts:

  • The first Flow-based Deep Generative Models - 图65 dimensions stay same;
  • The second part, Flow-based Deep Generative Models - 图66 to Flow-based Deep Generative Models - 图67 dimensions, undergo an affine transformation (“scale-and-shift”) and both the scale and shift parameters are functions of the first Flow-based Deep Generative Models - 图68 dimensions.

Flow-based Deep Generative Models - 图69
where Flow-based Deep Generative Models - 图70 and Flow-based Deep Generative Models - 图71 are scale and translation functions and both map Flow-based Deep Generative Models - 图72. The Flow-based Deep Generative Models - 图73 operation is the element-wise product.
Now let’s check whether this transformation satisfy two basic properties for a flow transformation.
Condition 1: “It is easily invertible.”
Yes and it is fairly straightforward.
Flow-based Deep Generative Models - 图74
Condition 2: “Its Jacobian determinant is easy to compute.”
Yes. It is not hard to get the Jacobian matrix and determinant of this transformation. The Jacobian is a lower triangular matrix.
Flow-based Deep Generative Models - 图75
Hence the determinant is simply the product of terms on the diagonal.
Flow-based Deep Generative Models - 图76
So far, the affine coupling layer looks perfect for constructing a normalizing flow :)
Even better, since (i) computing Flow-based Deep Generative Models - 图77 does not require computing the inverse of s or t and (ii) computing the Jacobian determinant does not involve computing the Jacobian of s or t, those functions can be arbitrarily complex; i.e. both s and t can be modeled by deep neural networks.
In one affine coupling layer, some dimensions (channels) remain unchanged. To make sure all the inputs have a chance to be altered, the model reverses the ordering in each layer so that different components are left unchanged. Following such an alternating pattern, the set of units which remain identical in one transformation layer are always modified in the next. Batch normalization is found to help training models with a very deep stack of coupling layers.
Furthermore, RealNVP can work in a multi-scale architecture to build a more efficient model for large inputs. The multi-scale architecture applies several “sampling” operations to normal affine layers, including spatial checkerboard pattern masking, squeezing operation, and channel-wise masking. Read the paper for more details on the multi-scale architecture.

NICE

The NICE (Non-linear Independent Component Estimation; Dinh, et al. 2015) model is a predecessor of RealNVP. The transformation in NICE is the affine coupling layer without the scale term, known as additive coupling layer.
Flow-based Deep Generative Models - 图78

Glow

The Glow (Kingma and Dhariwal, 2018) model extends the previous reversible generative models, NICE and RealNVP, and simplifies the architecture by replacing the reverse permutation operation on the channel ordering with invertible 1x1 convolutions.
Flow-based Deep Generative Models - 图79
Fig. 3. One step of flow in the Glow model. (Image source: Kingma and Dhariwal, 2018)
There are three substeps in one step of flow in Glow.
Substep 1: Activation normalization (short for “actnorm”)
It performs an affine transformation using a scale and bias parameter per channel, similar to batch normalization, but works for mini-batch size 1. The parameters are trainable but initialized so that the first minibatch of data have mean 0 and standard deviation 1 after actnorm.
Substep 2: Invertible 1x1 conv
Between layers of the RealNVP flow, the ordering of channels is reversed so that all the data dimensions have a chance to be altered. A 1×1 convolution with equal number of input and output channels is a generalization of any permutation of the channel ordering.
Say, we have an invertible 1x1 convolution of an input Flow-based Deep Generative Models - 图80 tensor Flow-based Deep Generative Models - 图81with a weight matrix Flow-based Deep Generative Models - 图82 of size Flow-based Deep Generative Models - 图83. The output is a Flow-based Deep Generative Models - 图84 tensor, labeled as Flow-based Deep Generative Models - 图85. In order to apply the change of variable rule, we need to compute the Jacobian determinant Flow-based Deep Generative Models - 图86

Both the input and output of 1x1 convolution here can be viewed as a matrix of size Flow-based Deep Generative Models - 图87. Each entry Flow-based Deep Generative Models - 图88 in Flow-based Deep Generative Models - 图89 is a vector of Flow-based Deep Generative Models - 图90 channels and each entry is multiplied by the weight matrix Flow-based Deep Generative Models - 图91 to obtain the corresponding entry Flow-based Deep Generative Models - 图92 in the output matrix respectively. The derivative of each entry is and there are h \times w such entries in total:
\log \left\vert\det \frac{\partial\texttt{conv2d}(\mathbf{h}; \mathbf{W})}{\partial\mathbf{h}}\right\vert = \log (\vert\det\mathbf{W}\vert^{h \cdot w}\vert) = h \cdot w \cdot \log \vert\det\mathbf{W}\vert
The inverse 1x1 convolution depends on the inverse matrix \mathbf{W}^{-1}. Since the weight matrix is relatively small, the amount of computation for the matrix determinant (tf.linalg.det) and inversion (tf.linalg.inv) is still under control.
Substep 3: Affine coupling layer
The design is same as in RealNVP.
Flow-based Deep Generative Models - 图93
Fig. 4. Three substeps in one step of flow in Glow. (Image source: Kingma and Dhariwal, 2018)

Models with Autoregressive Flows

The autoregressive constraint is a way to model sequential data, \mathbf{x} = [x1, \dots, x_D]: each output only depends on the data observed in the past, but not on the future ones. In other words, the probability of observing x_i is conditioned on x_1, \dots, x{i-1} and the product of these conditional probabilities gives us the probability of observing the full sequence:
p(\mathbf{x}) = \prod{i=1}^{D} p(x_i\vert x_1, \dots, x{i-1}) = \prod{i=1}^{D} p(x_i\vert x{1:i-1})
How to model the conditional density is of your choice. It can be a univariate Gaussian with mean and standard deviation computed as a function of x{1:i-1}, or a multilayer neural network with x{1:i-1} as the input.
If a flow transformation in a normalizing flow is framed as an autoregressive model — each dimension in a vector variable is conditioned on the previous dimensions — this is an autoregressive flow.
This section starts with several classic autoregressive models (MADE, PixelRNN, WaveNet) and then we dive into autoregressive flow models (MAF and IAF).

MADE

MADE (Masked Autoencoder for Distribution Estimation; Germain et al., 2015) is a specially designed architecture to enforce the autoregressive property in the autoencoder efficiently. When using an autoencoder to predict the conditional probabilities, rather than feeding the autoencoder with input of different observation windows D times, MADE removes the contribution from certain hidden units by multiplying binary mask matrices so that each input dimension is reconstructed only from previous dimensions in a given ordering in a single pass.
In a multilayer fully-connected neural network, say, we have L hidden layers with weight matrices \mathbf{W}^1, \dots, \mathbf{W}^L and an output layer with weight matrix \mathbf{V}. The output \hat{\mathbf{x}} has each dimension \hat{x}i = p(x_i\vert x{1:i-1}).
Without any mask, the computation through layers looks like the following:
\begin{aligned} \mathbf{h}^0 &= \mathbf{x} \ \mathbf{h}^l &= \text{activation}^l(\mathbf{W}^l\mathbf{h}^{l-1} + \mathbf{b}^l) \ \hat{\mathbf{x}} &= \sigma(\mathbf{V}\mathbf{h}^L + \mathbf{c}) \end{aligned}
Flow-based Deep Generative Models - 图94
Fig. 5. Demonstration of how MADE works in a three-layer feed-forward neural network. (Image source: Germain et al., 2015)
To zero out some connections between layers, we can simply element-wise multiply every weight matrix by a binary mask matrix. Each hidden node is assigned with a random “connectivity integer” between 1 and D-1; the assigned value for the k-th unit in the l-th layer is denoted by m^lk. The binary mask matrix is determined by element-wise comparing values of two nodes in two layers.
\begin{aligned} \mathbf{h}^l &= \text{activation}^l((\mathbf{W}^l \color{red}{\odot \mathbf{M}^{\mathbf{W}^l}}) \mathbf{h}^{l-1} + \mathbf{b}^l) \ \hat{\mathbf{x}} &= \sigma((\mathbf{V} \color{red}{\odot \mathbf{M}^{\mathbf{V}}}) \mathbf{h}^L + \mathbf{c}) \ M^{\mathbf{W}^l}
{k’, k} &= \mathbf{1}{m^l{k’} \geq m^{l-1}k} = \begin{cases} 1, & \text{if } m^l{k’} \geq m^{l-1}k\ 0, & \text{otherwise} \end{cases} \ M^{\mathbf{V}}{d, k} &= \mathbf{1}{d \geq m^L_k} = \begin{cases} 1, & \text{if } d > m^L_k\ 0, & \text{otherwise} \end{cases} \end{aligned}
A unit in the current layer can only be connected to other units with equal or smaller numbers in the previous layer and this type of dependency easily propagates through the network up to the output layer. Once the numbers are assigned to all the units and layers, the ordering of input dimensions is fixed and the conditional probability is produced with respect to it. See a great illustration in Fig. 5. To make sure all the hidden units are connected to the input and output layers through some paths, the m^l_k is sampled to be equal or greater than the minimal connectivity integer in the previous layer, \min
{k’} m_{k’}^{l-1}.
MADE training can be further facilitated by:

  • Order-agnostic training: shuffle the input dimensions, so that MADE is able to model any arbitrary ordering; can create an ensemble of autoregressive models at the runtime.
  • Connectivity-agnostic training: to avoid a model being tied up to a specific connectivity pattern constraints, resample m^l_k for each training minibatch.

    PixelRNN

    PixelRNN (Oord et al, 2016) is a deep generative model for images. The image is generated one pixel at a time and each new pixel is sampled conditional on the pixels that have been seen before.
    Let’s consider an image of size n \times n, \mathbf{x} = {x1, \dots, x{n^2}}, the model starts generating pixels from the top left corner, from left to right and top to bottom (See Fig. 6).
    Flow-based Deep Generative Models - 图95
    Fig. 6. The context for generating one pixel in PixelRNN. (Image source: Oord et al, 2016)
    Every pixel xi is sampled from a probability distribution conditional over the the past context: pixels above it or on the left of it when in the same row. The definition of such context looks pretty arbitrary, because how visual attention is attended to an image is more flexible. Somehow magically a generative model with such a strong assumption works.
    One implementation that could capture the entire context is the _Diagonal BiLSTM
    . First, apply the skewing operation by offsetting each row of the input feature map by one position with respect to the previous row, so that computation for each row can be parallelized. Then the LSTM states are computed with respect to the current pixel and the pixels on the left.
    Flow-based Deep Generative Models - 图96
    Fig. 7. (a) PixelRNN with diagonal BiLSTM. (b) Skewing operation that offsets each row in the feature map by one with regards to the row above. (Image source: Oord et al, 2016)
    \begin{aligned} \lbrack \mathbf{o}i, \mathbf{f}_i, \mathbf{i}_i, \mathbf{g}_i \rbrack &= \sigma(\mathbf{K}^{ss} \circledast \mathbf{h}{i-1} + \mathbf{K}^{is} \circledast \mathbf{x}i) & \scriptstyle{\text{; }\sigma\scriptstyle{\text{ is tanh for g, but otherwise sigmoid; }}\circledast\scriptstyle{\text{ is convolution operation.}}} \ \mathbf{c}_i &= \mathbf{f}_i \odot \mathbf{c}{i-1} + \mathbf{i}i \odot \mathbf{g}_i & \scriptstyle{\text{; }}\odot\scriptstyle{\text{ is elementwise product.}}\ \mathbf{h}_i &= \mathbf{o}_i \odot \tanh(\mathbf{c}_i) \end{aligned}
    where \circledast denotes the convolution operation and \odot is the element-wise multiplication. The input-to-state component \mathbf{K}^{is} is a 1x1 convolution, while the state-to-state recurrent component is computed with a column-wise convolution \mathbf{K}^{ss} with a kernel of size 2x1.
    The diagonal BiLSTM layers are capable of processing an unbounded context field, but expensive to compute due to the sequential dependency between states. A faster implementation uses multiple convolutional layers without pooling to define a bounded context box. The convolution kernel is masked so that the future context is not seen, similar to MADE. This convolution version is called PixelCNN.
    Flow-based Deep Generative Models - 图97
    _Fig. 8. PixelCNN with masked convolution constructed by an elementwise product of a mask tensor and the convolution kernel before applying it. (Image source: http://slazebni.cs.illinois.edu/spring17/lec13_advanced.pdf))

    WaveNet

    WaveNet (Van Den Oord, et al. 2016) is very similar to PixelCNN but applied to 1-D audio signals. WaveNet consists of a stack of causal convolution which is a convolution operation designed to respect the ordering: the prediction at a certain timestamp can only consume the data observed in the past, no dependency on the future. In PixelCNN, the causal convolution is implemented by masked convolution kernel. The causal convolution in WaveNet is simply to shift the output by a number of timestamps to the future so that the output is aligned with the last input element.
    One big drawback of convolution layer is a very limited size of receptive field. The output can hardly depend on the input hundreds or thousands of timesteps ago, which can be a crucial requirement for modeling long sequences. WaveNet therefore adopts dilated convolution (animation), where the kernel is applied to an evenly-distributed subset of samples in a much larger receptive field of the input.
    Flow-based Deep Generative Models - 图98
    Fig. 9. Visualization of WaveNet models with a stack of (top) causal convolution layers and (bottom) dilated convolution layers. (Image source: Van Den Oord, et al. 2016)
    WaveNet uses the gated activation unit as the non-linear layer, as it is found to work significantly better than ReLU for modeling 1-D audio data. The residual connection is applied after the gated activation.
    \mathbf{z} = \tanh(\mathbf{W}{f,k}\circledast\mathbf{x})\odot\sigma(\mathbf{W}{g,k}\circledast\mathbf{x})
    where \mathbf{W}{f,k} and \mathbf{W}{g,k} are convolution filter and gate weight matrix of the k-th layer, respectively; both are learnable.

    Masked Autoregressive Flow

    Masked Autoregressive Flow (MAF; Papamakarios et al., 2017) is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network. MAF is very similar to Inverse Autoregressive Flow (IAF) introduced later. See more discussion on the relationship between MAF and IAF in the next section.
    Given two random variables, \mathbf{z} \sim \pi(\mathbf{z}) and \mathbf{x} \sim p(\mathbf{x}) and the probability density function \pi(\mathbf{z}) is known, MAF aims to learn p(\mathbf{x}). MAF generates each xi conditioned on the past dimensions \mathbf{x}{1:i-1}.
    Precisely the conditional probability is an affine transformation of \mathbf{z}, where the scale and shift terms are functions of the observed part of \mathbf{x}.

  • Data generation, producing a new \mathbf{x}:

xi \sim p(x_i\vert\mathbf{x}{1:i-1}) = zi \odot \sigma_i(\mathbf{x}{1:i-1}) + \mui(\mathbf{x}{1:i-1})\text{, where }\mathbf{z} \sim \pi(\mathbf{z})

  • Density estimation, given a known \mathbf{x}:

p(\mathbf{x}) = \prod{i=1}^D p(x_i\vert\mathbf{x}{1:i-1})
The generation procedure is sequential, so it is slow by design. While density estimation only needs one pass the network using architecture like MADE. The transformation function is trivial to inverse and the Jacobian determinant is easy to compute too.

Inverse Autoregressive Flow

Similar to MAF, Inverse autoregressive flow (IAF; Kingma et al., 2016) models the conditional probability of the target variable as an autoregressive model too, but with a reversed flow, thus achieving a much efficient sampling process.
First, let’s reverse the affine transformation in MAF:
zi = \frac{x_i - \mu_i(\mathbf{x}{1:i-1})}{\sigmai(\mathbf{x}{1:i-1})} = -\frac{\mui(\mathbf{x}{1:i-1})}{\sigmai(\mathbf{x}{1:i-1})} + xi \odot \frac{1}{\sigma_i(\mathbf{x}{1:i-1})}
If let:
\begin{aligned} & \tilde{\mathbf{x}} = \mathbf{z}\text{, }\tilde{p}(.) = \pi(.)\text{, }\tilde{\mathbf{x}} \sim \tilde{p}(\tilde{\mathbf{x}}) \ & \tilde{\mathbf{z}} = \mathbf{x} \text{, }\tilde{\pi}(.) = p(.)\text{, }\tilde{\mathbf{z}} \sim \tilde{\pi}(\tilde{\mathbf{z}})\ & \tilde{\mu}i(\tilde{\mathbf{z}}{1:i-1}) = \tilde{\mu}i(\mathbf{x}{1:i-1}) = -\frac{\mui(\mathbf{x}{1:i-1})}{\sigmai(\mathbf{x}{1:i-1})} \ & \tilde{\sigma}(\tilde{\mathbf{z}}{1:i-1}) = \tilde{\sigma}(\mathbf{x}{1:i-1}) = \frac{1}{\sigmai(\mathbf{x}{1:i-1})} \end{aligned}
Then we would have,
\tilde{x}i \sim p(\tilde{x}_i\vert\tilde{\mathbf{z}}{1:i}) = \tilde{z}i \odot \tilde{\sigma}_i(\tilde{\mathbf{z}}{1:i-1}) + \tilde{\mu}i(\tilde{\mathbf{z}}{1:i-1}) \text{, where }\tilde{\mathbf{z}} \sim \tilde{\pi}(\tilde{\mathbf{z}})
IAF intends to estimate the probability density function of \tilde{\mathbf{x}} given that \tilde{\pi}(\tilde{\mathbf{z}}) is already known. The inverse flow is an autoregressive affine transformation too, same as in MAF, but the scale and shift terms are autoregressive functions of observed variables from the known distribution \tilde{\pi}(\tilde{\mathbf{z}}). See the comparison between MAF and IAF in Fig. 10.
Flow-based Deep Generative Models - 图99
Fig. 10. Comparison of MAF and IAF. The variable with known density is in green while the unknown one is in red.
Computations of the individual elements \tilde{x}i do not depend on each other, so they are easily parallelizable (only one pass using MADE). The density estimation for a known \tilde{\mathbf{x}} is not efficient, because we have to recover the value of \tilde{z}_i in a sequential order, \tilde{z}_i = (\tilde{x}_i - \tilde{\mu}_i(\tilde{\mathbf{z}}{1:i-1})) / \tilde{\sigma}i(\tilde{\mathbf{z}}{1:i-1}), thus D times in total.

Base distribution Target distribution Model Data generation Density estimation
MAF \mathbf{z}\sim\pi(\mathbf{z}) \mathbf{x}\sim p(\mathbf{x}) xi = z_i \odot \sigma_i(\mathbf{x}{1:i-1}) + \mui(\mathbf{x}{1:i-1}) Sequential; slow One pass; fast
IAF \tilde{\mathbf{z}}\sim\tilde{\pi}(\tilde{\mathbf{z}}) \tilde{\mathbf{x}}\sim\tilde{p}(\tilde{\mathbf{x}}) \tilde{x}i = \tilde{z}_i \odot \tilde{\sigma}_i(\tilde{\mathbf{z}}{1:i-1}) + \tilde{\mu}i(\tilde{\mathbf{z}}{1:i-1}) One pass; fast Sequential; slow

VAE + Flows

In Variational Autoencoder, if we want to model the posterior p(\mathbf{z}\vert\mathbf{x}) as a more complicated distribution rather than simple Gaussian. Intuitively we can use normalizing flow to transform the base Gaussian for better density approximation. The encoder then would predict a set of scale and shift terms (\mu_i, \sigma_i) which are all functions of input \mathbf{x}. Read the paper for more details if interested.