flow-based generative model

{{Short description|Statistical model used in machine learning}}

{{Machine learning bar}}

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow,{{cite journal |last1=Tabak |first1=Esteban G. |last2=Vanden-Eijnden |first2=Eric |title=Density estimation by dual ascent of the log-likelihood |journal=Communications in Mathematical Sciences |date=2010 |volume=8 |issue=1 |pages=217–233 |doi=10.4310/CMS.2010.v8.n1.a11 |url=https://projecteuclid.org/journals/communications-in-mathematical-sciences/volume-8/issue-1/Density-estimation-by-dual-ascent-of-the-log-likelihood/cms/1266935020.full}}{{cite journal |last1=Tabak |first1=Esteban G. |last2=Turner |first2=Cristina V. |title=A family of nonparametric density estimation algorithms |journal=Communications on Pure and Applied Mathematics |date=2012 |volume=66 |issue=2 |pages=145–164 |doi=10.1002/cpa.21423 |hdl=11336/8930 |s2cid=17820269 |url=https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21423|hdl-access=free }}{{cite journal |first1=George |last1=Papamakarios |first2=Eric |last2=Nalisnick |first3=Danilo |last3=Jimenez Rezende |first4=Shakir |last4=Mohamed |first5=Balaji |last5=Bakshminarayanan |year=2021 |title=Normalizing flows for probabilistic modeling and inference |journal=Journal of Machine Learning Research |volume=22 |issue=1 |pages=2617–2680 |arxiv=1912.02762 |url=https://dl.acm.org/doi/abs/10.5555/3546258.3546315 }} which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

The direct modeling of likelihood provides many advantages. For example, the negative log-likelihood can be directly computed and minimized as the loss function. Additionally, novel samples can be generated by sampling from the initial distribution, and applying the flow transformation.

In contrast, many alternative generative modeling methods such as variational autoencoder (VAE) and generative adversarial network do not explicitly represent the likelihood function.

Method

File:Normalizing-flow.svg

Let z_0 be a (possibly multivariate) random variable with distribution p_0(z_0).

For i = 1, ..., K, let z_i = f_i(z_{i-1}) be a sequence of random variables transformed from z_0. The functions f_1, ..., f_K should be invertible, i.e. the inverse function f^{-1}_i exists. The final output z_K models the target distribution.

The log likelihood of z_K is (see derivation):

: \log p_K(z_K) = \log p_0(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{df_i(z_{i-1})}{dz_{i-1}}\right|

To efficiently compute the log likelihood, the functions f_1, ..., f_K should be easily invertible, and the determinants of their Jacobians should be simple to compute. In practice, the functions f_1, ..., f_K are modeled using deep neural networks, and are trained to minimize the negative log-likelihood of data samples from the target distribution. These architectures are usually designed such that only the forward pass of the neural network is required in both the inverse and the Jacobian determinant calculations. Examples of such architectures include NICE,{{cite arXiv | eprint=1410.8516| last1=Dinh| first1=Laurent| last2=Krueger| first2=David| last3=Bengio| first3=Yoshua| title=NICE: Non-linear Independent Components Estimation| year=2014| class=cs.LG}} RealNVP,{{cite arXiv | eprint=1605.08803| last1=Dinh| first1=Laurent| last2=Sohl-Dickstein| first2=Jascha| last3=Bengio| first3=Samy| title=Density estimation using Real NVP| year=2016| class=cs.LG}} and Glow.{{cite arXiv | eprint=1807.03039| last1=Kingma| first1=Diederik P.| last2=Dhariwal| first2=Prafulla| title=Glow: Generative Flow with Invertible 1x1 Convolutions| year=2018| class=stat.ML}}

= Derivation of log likelihood =

Consider z_1 and z_0. Note that z_0 = f^{-1}_1(z_1).

By the change of variable formula, the distribution of z_1 is:

: p_1(z_1) = p_0(z_0)\left|\det \frac{df_1^{-1}(z_1)}{dz_1}\right|

Where \det \frac{df_1^{-1}(z_1)}{dz_1} is the determinant of the Jacobian matrix of f^{-1}_1.

By the inverse function theorem:

: p_1(z_1) = p_0(z_0)\left|\det \left(\frac{df_1(z_0)}{dz_0}\right)^{-1}\right|

By the identity \det(A^{-1}) = \det(A)^{-1} (where A is an invertible matrix), we have:

: p_1(z_1) = p_0(z_0)\left|\det \frac{df_1(z_0)}{dz_0}\right|^{-1}

The log likelihood is thus:

: \log p_1(z_1) = \log p_0(z_0) - \log \left|\det \frac{df_1(z_0)}{dz_0}\right|

In general, the above applies to any z_i and z_{i-1}. Since \log p_i(z_i) is equal to \log p_{i-1}(z_{i-1}) subtracted by a non-recursive term, we can infer by induction that:

: \log p_K(z_K) = \log p_0(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{df_i(z_{i-1})}{dz_{i-1}}\right|

Training method

As is generally done when training a deep learning model, the goal with normalizing flows is to minimize the Kullback–Leibler divergence between the model's likelihood and the target distribution to be estimated. Denoting p_\theta the model's likelihood and p^* the target distribution to learn, the (forward) KL-divergence is:

: D_{\text{KL}}[p^{*}(x) \| p_{\theta}(x)] = -\mathop{\mathbb{E}}_{p^{*}(x)}[\log p_{\theta}(x)] + \mathop{\mathbb{E}}_{p^{*}(x)}[\log p^{*}(x) ]

The second term on the right-hand side of the equation corresponds to the entropy of the target distribution and is independent of the parameter \theta we want the model to learn, which only leaves the expectation of the negative log-likelihood to minimize under the target distribution. This intractable term can be approximated with a Monte-Carlo method by importance sampling. Indeed, if we have a dataset \{x_{i}\}_{i=1}^N of samples each independently drawn from the target distribution p^*(x), then this term can be estimated as:

: -\hat{\mathop{\mathbb{E}}}_{p^{*}(x)}[\log p_{\theta}(x)] = -\frac{1}{N} \sum_{i=0}^{N} \log p_{\theta}(x_{i})

Therefore, the learning objective

: \underset{\theta}{\operatorname{arg\,min}}\ D_{\text{KL}}[p^{*}(x) \| p_{\theta}(x)]

is replaced by

: \underset{\theta}{\operatorname{arg\,max}}\ \sum_{i=0}^{N} \log p_{\theta}(x_{i})

In other words, minimizing the Kullback–Leibler divergence between the model's likelihood and the target distribution is equivalent to maximizing the model likelihood under observed samples of the target distribution.{{Cite journal |last1=Papamakarios |first1=George |last2=Nalisnick |first2=Eric |last3=Rezende |first3=Danilo Jimenez |last4=Shakir |first4=Mohamed |last5=Balaji |first5=Lakshminarayanan |date=March 2021 |title=Normalizing Flows for Probabilistic Modeling and Inference |journal=Journal of Machine Learning Research |url=https://jmlr.org/papers/v22/19-1028.html |volume=22 |issue=57 |pages=1–64 |arxiv=1912.02762}}

A pseudocode for training normalizing flows is as follows:{{Cite journal |last1=Kobyzev |first1=Ivan |last2=Prince |first2=Simon J.D. |last3=Brubaker |first3=Marcus A. |date=November 2021 |title=Normalizing Flows: An Introduction and Review of Current Methods |url=https://ieeexplore.ieee.org/document/9089305 |journal=IEEE Transactions on Pattern Analysis and Machine Intelligence |volume=43 |issue=11 |pages=3964–3979 |doi=10.1109/TPAMI.2020.2992934 |pmid=32396070 |arxiv=1908.09257 |s2cid=208910764 |issn=1939-3539}}

  • INPUT. dataset x_{1:n}, normalizing flow model f_\theta(\cdot), p_0 .
  • SOLVE. \max_\theta \sum_j \ln p_\theta(x_j) by gradient descent
  • RETURN. \hat\theta

Variants

= Planar Flow =

The earliest example.{{cite arXiv | eprint=1505.05770| author1=Danilo Jimenez Rezende| last2=Mohamed| first2=Shakir| title=Variational Inference with Normalizing Flows| year=2015| class=stat.ML}} Fix some activation function h, and let \theta = (u, w, b) with the appropriate dimensions, thenx = f_\theta(z) = z + u h(\langle w, z \rangle + b)The inverse f_\theta^{-1} has no closed-form solution in general.

The Jacobian is |\det (I + h'(\langle w, z \rangle + b) uw^T)| = |1 + h'(\langle w, z \rangle + b) \langle u, w\rangle|.

For it to be invertible everywhere, it must be nonzero everywhere. For example, h = \tanh and \langle u, w \rangle > -1 satisfies the requirement.

= Nonlinear Independent Components Estimation (NICE) =

Let x, z\in \R^{2n} be even-dimensional, and split them in the middle. Then the normalizing flow functions arex = \begin{bmatrix}

x_1 \\ x_2

\end{bmatrix}= f_\theta(z) = \begin{bmatrix}

z_1 \\z_2

\end{bmatrix} + \begin{bmatrix}

0 \\ m_\theta(z_1)

\end{bmatrix}where m_\theta is any neural network with weights \theta.

f_\theta^{-1} is just z_1 = x_1, z_2 = x_2 - m_\theta(x_1), and the Jacobian is just 1, that is, the flow is volume-preserving.

When n=1, this is seen as a curvy shearing along the x_2 direction.

= Real Non-Volume Preserving (Real NVP) =

The Real Non-Volume Preserving model generalizes NICE model by:x = \begin{bmatrix}

x_1 \\ x_2

\end{bmatrix}= f_\theta(z) = \begin{bmatrix}

z_1 \\ e^{s_\theta(z_1)} \odot z_2

\end{bmatrix} + \begin{bmatrix}

0 \\ m_\theta(z_1)

\end{bmatrix}

Its inverse is z_1 = x_1, z_2 = e^{-s_\theta (x_1)}\odot (x_2 - m_\theta (x_1)), and its Jacobian is \prod^n_{i=1} e^{s_\theta(z_{1, })}. The NICE model is recovered by setting s_\theta = 0.

Since the Real NVP map keeps the first and second halves of the vector x separate, it's usually required to add a permutation (x_1, x_2) \mapsto (x_2, x_1) after every Real NVP layer.

= Generative Flow (Glow) =

In generative flow model, each layer has 3 parts:

  • channel-wise affine transformy_{cij} = s_c(x_{cij} + b_c)with Jacobian \prod_c s_c^{HW}.
  • invertible 1x1 convolutionz_{cij} = \sum_{c'} K_{cc'} y_{cij}with Jacobian \det(K)^{HW}. Here K is any invertible matrix.
  • Real NVP, with Jacobian as described in Real NVP.

The idea of using the invertible 1x1 convolution is to permute all layers in general, instead of merely permuting the first and second half, as in Real NVP.

= Masked Autoregressive Flow (MAF) =

An autoregressive model of a distribution on \R^n is defined as the following stochastic process:{{Cite journal |last1=Papamakarios |first1=George |last2=Pavlakou |first2=Theo |last3=Murray |first3=Iain |date=2017 |title=Masked Autoregressive Flow for Density Estimation |url=https://proceedings.neurips.cc/paper/2017/hash/6c1da886822c67822bcf3679d04369fa-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30|arxiv=1705.07057 }}

\begin{align}

x_1 \sim& N(\mu_1, \sigma_1^2)\\

x_2 \sim& N(\mu_2(x_1), \sigma_2(x_1)^2)\\

&\cdots \\

x_n \sim& N(\mu_n(x_{1:n-1}), \sigma_n(x_{1:n-1})^2)\\

\end{align}where \mu_i: \R^{i-1} \to \R and \sigma_i: \R^{i-1} \to (0, \infty) are fixed functions that define the autoregressive model.

By the reparameterization trick, the autoregressive model is generalized to a normalizing flow:\begin{align}

x_1 =& \mu_1 + \sigma_1 z_1\\

x_2 =& \mu_2(x_1) + \sigma_2(x_1) z_2\\

&\cdots \\

x_n =& \mu_n(x_{1:n-1}) + \sigma_n(x_{1:n-1}) z_n\\

\end{align}The autoregressive model is recovered by setting z \sim N(0, I_{n}).

The forward mapping is slow (because it's sequential), but the backward mapping is fast (because it's parallel).

The Jacobian matrix is lower-diagonal, so the Jacobian is \sigma_1 \sigma_2(x_1)\cdots \sigma_n(x_{1:n-1}).

Reversing the two maps f_\theta and f_\theta^{-1} of MAF results in Inverse Autoregressive Flow (IAF), which has fast forward mapping and slow backward mapping.{{Cite journal |last1=Kingma |first1=Durk P |last2=Salimans |first2=Tim |last3=Jozefowicz |first3=Rafal |last4=Chen |first4=Xi |last5=Sutskever |first5=Ilya |last6=Welling |first6=Max |date=2016 |title=Improved Variational Inference with Inverse Autoregressive Flow |url=https://proceedings.neurips.cc/paper/2016/hash/ddeebdeefdb7e7e7a697e1c3e3d8ef54-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=29|arxiv=1606.04934 }}

= Continuous Normalizing Flow (CNF) =

Instead of constructing flow by function composition, another approach is to formulate the flow as a continuous-time dynamic.{{cite arXiv | eprint=1810.01367| last1=Grathwohl| first1=Will| last2=Chen| first2=Ricky T. Q.| last3=Bettencourt| first3=Jesse| last4=Sutskever| first4=Ilya| last5=Duvenaud| first5=David| title=FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models| year=2018| class=cs.LG}}{{Cite arXiv |last1=Lipman |first1=Yaron |last2=Chen |first2=Ricky T. Q. |last3=Ben-Hamu |first3=Heli |last4=Nickel |first4=Maximilian |last5=Le |first5=Matt |date=2022-10-01 |title=Flow Matching for Generative Modeling |class=cs.LG | eprint=2210.02747}} Let z_0 be the latent variable with distribution p(z_0). Map this latent variable to data space with the following flow function:

: x = F(z_0) = z_T = z_0 + \int_0^T f(z_t, t) dt

where f is an arbitrary function and can be modeled with e.g. neural networks.

The inverse function is then naturally:

: z_0 = F^{-1}(x) = z_T + \int_T^0 f(z_t, t) dt = z_T - \int_0^T f(z_t,t) dt

And the log-likelihood of x can be found as:

: \log(p(x)) = \log(p(z_0)) - \int_0^T \text{Tr}\left[\frac{\partial f}{\partial z_t} \right] dt

Since the trace depends only on the diagonal of the Jacobian \partial_{z_t} f, this allows "free-form" Jacobian.{{cite arXiv |last1=Grathwohl |first1=Will |last2=Chen |first2=Ricky T. Q. |last3=Bettencourt |first3=Jesse |last4=Sutskever |first4=Ilya |last5=Duvenaud |first5=David |date=2018-10-22 |title=FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models |class=cs.LG |eprint=1810.01367 }} Here, "free-form" means that there is no restriction on the Jacobian's form. It is contrasted with previous discrete models of normalizing flow, where the Jacobian is carefully designed to be only upper- or lower-diagonal, so that the Jacobian can be evaluated efficiently.

The trace can be estimated by "Hutchinson's trick":{{Cite journal |last1=Finlay |first1=Chris |last2=Jacobsen |first2=Joern-Henrik |last3=Nurbekyan |first3=Levon |last4=Oberman |first4=Adam |date=2020-11-21 |title=How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization |url=https://proceedings.mlr.press/v119/finlay20a.html |journal=International Conference on Machine Learning |language=en |publisher=PMLR |pages=3154–3164|arxiv=2002.02798 }}{{Cite journal |last=Hutchinson |first=M.F. |date=January 1989 |title=A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines |url=http://www.tandfonline.com/doi/abs/10.1080/03610918908812806 |journal=Communications in Statistics - Simulation and Computation |language=en |volume=18 |issue=3 |pages=1059–1076 |doi=10.1080/03610918908812806 |issn=0361-0918|url-access=subscription }}

Given any matrix W\in \R^{n\times n}, and any random u\in \R^n with E[uu^T] = I, we have E[u^T W u] = tr(W). (Proof: expand the expectation directly.)
Usually, the random vector is sampled from N(0, I) (normal distribution) or \{\pm n^{-1/2}\}^n (Radamacher distribution).

When f is implemented as a neural network, neural ODE methods{{cite conference |last1=Chen |first1=Ricky T. Q. |last2=Rubanova |first2=Yulia |last3=Bettencourt |first3=Jesse |last4=Duvenaud |first4=David K. |year=2018 |editor1-last=Bengio |editor1-first=S. |editor2-last=Wallach |editor2-first=H. |editor3-last=Larochelle |editor3-first=H. |editor4-last=Grauman |editor4-first=K. |editor5-last=Cesa-Bianchi |editor5-first=N. |editor6-last=Garnett |editor6-first=R. |title=Neural Ordinary Differential Equations |url=https://proceedings.neurips.cc/paper_files/paper/2018/file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf |conference= |publisher=Curran Associates, Inc. |volume=31 |arxiv=1806.07366 |book-title=Advances in Neural Information Processing Systems}} would be needed. Indeed, CNF was first proposed in the same paper that proposed neural ODE.

There are two main deficiencies of CNF, one is that a continuous flow must be a homeomorphism, thus preserve orientation and ambient isotopy (for example, it's impossible to flip a left-hand to a right-hand by continuous deforming of space, and it's impossible to turn a sphere inside out, or undo a knot), and the other is that the learned flow f might be ill-behaved, due to degeneracy (that is, there are an infinite number of possible f that all solve the same problem).

By adding extra dimensions, the CNF gains enough freedom to reverse orientation and go beyond ambient isotopy (just like how one can pick up a polygon from a desk and flip it around in 3-space, or unknot a knot in 4-space), yielding the "augmented neural ODE".{{Cite journal |last1=Dupont |first1=Emilien |last2=Doucet |first2=Arnaud |last3=Teh |first3=Yee Whye |date=2019 |title=Augmented Neural ODEs |url=https://proceedings.neurips.cc/paper/2019/hash/21be9a4bd4f81549a9d1d241981cec3c-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32}}

Any homeomorphism of \R^n can be approximated by a neural ODE operating on \R^{2n+1}, proved by combining Whitney embedding theorem for manifolds and the universal approximation theorem for neural networks.{{cite arXiv |last1=Zhang |first1=Han |last2=Gao |first2=Xi |last3=Unterman |first3=Jacob |last4=Arodz |first4=Tom |date=2019-07-30 |title=Approximation Capabilities of Neural ODEs and Invertible Residual Networks |class=cs.LG |language=en |eprint=1907.12998}}

To regularize the flow f, one can impose regularization losses. The paper proposed the following regularization loss based on optimal transport theory:\lambda_{K} \int_{0}^{T}\left\|f(z_t, t)\right\|^{2} dt

+\lambda_{J} \int_{0}^{T}\left\|\nabla_z f(z_t, t)\right\|_F^{2} dt

where \lambda_K, \lambda_J >0

are hyperparameters. The first term punishes the model for oscillating the flow field over time, and the second term punishes it for oscillating the flow field over space. Both terms together guide the model into a flow that is smooth (not "bumpy") over space and time.

Flows on manifolds

When a probabilistic flow transforms a distribution on an m-dimensional manifold embedded in \R^n, where m, and where the transformation is specified as a function, \R^n\to\R^n, the scaling factor between the source and transformed PDFs is not given by the naive computation of the determinant of the n\text{-by-}n Jacobian (which is zero), but instead by the determinant(s) of one or more suitably defined m\text{-by-}m matrices. This section is an interpretation of the tutorial in the appendix of Sorrenson et al.(2023),{{Cite arXiv

|last1=Sorrenson

|first1=Peter

|last2=Draxler

|first2=Felix

|last3=Rousselot

|first3=Armand

|last4=Hummerich

|first4=Sander

|last5=Köthe

|first5=Ullrich

|title=Learning Distributions on Manifolds with Free-Form Flows

|eprint=2312.09852

|year=2023

}} where the more general case of non-isometrically embedded Riemann manifolds is also treated. Here we restrict attention to isometrically embedded manifolds.

As running examples of manifolds with smooth, isometric embedding in \R^n we shall use:

  • The unit hypersphere: \mathbb S^{n-1}=\{\mathbf x\in\R^n:\mathbf x'\mathbf x=1\}, where flows can be used to generalize e.g. Von Mises-Fisher or uniform spherical distributions.
  • The simplex interior: \Delta^{n-1}=\{\mathbf p=(p_1,\dots,p_n)\in\R^n:p_i>0, \sum_ip_i=1\}, where n-way categorical distributions live; and where flows can be used to generalize e.g. Dirichlet, or uniform simplex distributions.

As a first example of a spherical manifold flow transform, consider the normalized linear transform, which radially projects onto the unitsphere the output of an invertible linear transform, parametrized by the n\text{-by-}n invertible matrix \mathbf M:

:

f_\text{lin}(\mathbf x; \mathbf M) = \frac{\mathbf{Mx}}{\lVert\mathbf{Mx}\rVert}

In full Euclidean space, f_\text{lin}:\R^n\to\R^n is not invertible, but if we restrict the domain and co-domain to the unitsphere, then f_\text{lin}:\mathbb S^{n-1}\to\mathbb S^{n-1} is invertible (more specifically it is a bijection and a homeomorphism and a diffeomorphism), with inverse f_\text{lin}(\cdot\,;\mathbf M^{-1})

. The Jacobian of f_\text{lin}:\R^n\to\R^n, at \mathbf y=f_\text{lin}(\mathbf x;\mathbf M) is \lVert\mathbf{Mx}\rVert^{-1}(\mathbf I_n -\mathbf{yy}')\mathbf M, which has rank n-1 and determinant of zero; while as explained here, the factor (see subsection below) relating source and transformed densities is: \lVert\mathbf{Mx}\rVert^{-n}\left|\operatorname{det}\mathbf M\right|.

= Differential volume ratio =

For m, let \mathcal M\subset\R^n be an m-dimensional manifold with a smooth, isometric embedding in \R^n. Let f:\R^n\to\R^n be a smooth flow transform with range restricted to \mathcal M. Let \mathbf x\in\mathcal M be sampled from a distribution with density P_X. Let \mathbf y=f(\mathbf x), with resultant (pushforward) density P_Y. Let U\subset\mathcal M be a small, convex region containing \mathbf x and let V=f(U) be its image, which contains \mathbf y; then by conservation of probability mass:

:

P_X(\mathbf x)\operatorname{volume}(U)\approx P_Y(\mathbf y)\operatorname{volume}(V)

where volume (for very small regions) is given by Lebesgue measure in m-dimensional tangent space. By making the regions infinitessimally small, the factor relating the two densities is the ratio of volumes, which we term the differential volume ratio.

To obtain concrete formulas for volume, we let U be an m-dimensional rectangle, or a linearly transformed rectange, which is a parallelotope (multidimensional generalization of a parallelogram). For very small regions, f becomes essentially linear, so that the image, V=f(U) is also a parallelotope. In \R^m, we can represent an m-dimensional parallelotope with an m\text{-by-}m matrix whose colum-vectors are a set of edges (meeting at a common vertex) that span the paralellotope. The volume is given by the absolute value of the determinant of this matrix. If more generally, the same paralellotope is embedded in \R^n, it can be represented with a (tall) n\text{-by-}m matrix, say \mathbf V, whose volume is given by the square root of the Gram determinant, \sqrt{\left|\operatorname{det}(\mathbf V'\mathbf V)\right|}. In the sections below, we show various ways to use this volume formula to derive the differential volume ratio.

= Simplex flow=

As a first example, we develop expressions for the differential volume ratio in the simplex case, when \mathcal M=\Delta^{n-1}. Define the embedding function, e:\tilde\mathbf p=(p_1\dots,p_{n-1})\mapsto\mathbf p=(p_1\dots,p_{n-1},1-\sum_{i=1}^{n-1}p_i)

which maps an arbitrarily chosen, m=(n-1)-dimensional repesentation, \tilde\mathbf p, to the embedded manifold. The n\text{-by-}m Jacobian is

\mathbf E = \begin{bmatrix}

\mathbf{I}_{n-1} \\

-\boldsymbol{1}'

\end{bmatrix}

.

To define U in \mathbf p-space, start with a rectangle in \tilde\mathbf p-space, having (signed) differential side-lengths, dp_1, \dots, dp_{n-1} and use these to form the (n-1)\text{-by-}(n-1) diagonal matrix \mathbf D, the columns of which span the rectangle. The paralleotope U is spanned by the columns of the matrix \mathbf{ED}.

File:Simplex measure pullback.svg from tangent space (coinciding with the simplex), via the embedding p_1\mapsto(p_1,1-p_1), with Jacobian \mathbf E=\begin{bmatrix}1&-1\end{bmatrix}', a scaling factor of \sqrt{\mathbf E'\mathbf E}=\sqrt2 results.]]

The Gram determinant volume formula gives:

:

\operatorname{volume}(U) = \sqrt{\left|\operatorname{det}(\mathbf{DE}'\mathbf{ED})\right|}

= \sqrt{\left|\operatorname{det}(\mathbf E'\mathbf E)\right|}\,

\left|\operatorname{det}\mathbf D)\right|

=\sqrt n\prod_{i=1}^{n-1} \left|dp_i\right|

To understand the geometric interpretation of the factor \sqrt{n}, see the example for the 1-simplex in the diagram at right.

To get the volume of V=f(U), we use \mathbf{F_p} for the n\text{-by-}n Jacobian of f at \mathbf p=e(\tilde\mathbf p) to get:

:

\operatorname{volume}(V) =

\sqrt{\left|\operatorname{det}(\mathbf{DE}'\mathbf{ F_p}'\mathbf{F_pED})\right|}

= \sqrt{\left|\operatorname{det}(\mathbf E'\mathbf{F_p}'\mathbf{F_pE})\right|}\,

\left|\operatorname{det}\mathbf D)\right|

so that the factor \left|\operatorname{det}\mathbf D)\right| cancels in the volume ratio, which can now already be numerically evaluated. It can however be rewritten in a sometimes more convenient form by also introducing the representation function, r:\mathbf p\mapsto\tilde\mathbf p, which simply extracts the first (n-1) components. The Jacobian is \mathbf R=\begin{bmatrix}\mathbf I_n&\boldsymbol0\end{bmatrix}. Observe that, since e\circ r\circ f=f, the chain rule for function composition gives: \mathbf{ERF_p}=\mathbf{F_p}. By plugging this expansion into the above Gram determinant and then refactoring it as a product of determinants of square matrices, we can extract the factor \sqrt{\left|\operatorname{det}(\mathbf E'\mathbf E)\right|}=\sqrt n, which now also cancels in the ratio, which finally simpifies to the determinant of the Jacobian of the "sandwiched" flow transformation, r\circ f\circ e:

:

R^\Delta_f(\mathbf p)=\frac{\operatorname{volume}(V)}{\operatorname{volume}(U)}

= \left|\operatorname{det}(\mathbf{RF_pE})\right|

which, if \mathbf p\sim P_\mathbf P, can be used to derive the pushforward density after a change of variables, \mathbf q = f(\mathbf p):

:

P_{\mathbf Q}(\mathbf q) = \frac{P_{\mathbf P}(\mathbf p)}{R^\Delta_f(\mathbf p)}\,,\;\text{where}\;\;

\mathbf p=f^{-1}(\mathbf q)

It should be noted that this formula is valid only because the simplex is flat and the Jacobian, \mathbf E is constant. The more general case is discussed below, after we present a concrete example of a simplex transform.

==Simplex calibration transform==

A calibration transform, which is sometimes used in machine learning for post-processing of the (class posterior) outputs of a probabilistic n-class classifier,{{Cite arXiv

|last1=Ferrer

|first1=Luciana

|last2=Ramos

|first2=Daniel

|title=Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

|eprint=2408.02841

|year=2024

}} uses the softmax function to renormalize categorical distributions after scaling and translation of the input distributions in log-probability space. For \mathbf p, \mathbf q\in\Delta^{n-1} and with parameters, a>0 and \mathbf c\in\R^n the transform can be specified as:

:

\mathbf q=f_\text{cal}(\mathbf p; a, \mathbf c) = \operatorname{softmax}(a^{-1}\log\mathbf p+\mathbf c)\;\iff\;

\mathbf p=f^{-1}_\text{cal}(\mathbf q; a, \mathbf c) = \operatorname{softmax}(a\log\mathbf q-a\mathbf c)

where the log is applied elementwise. After some algebra the differential volume ratio can be expressed as:

:

R^\Delta_\text{cal}(\mathbf p; a, \mathbf c) = \left|\operatorname{det}(\mathbf{RF_pE})\right| = a^{1-n}\prod_{i=1}^n\frac{q_i}{p_i}

While calibration transforms are most often trained as discriminative models, the reinterpretation here as a probabilistic flow allows also the design of generative calibration models.

=Differential volume ratio for curved manifolds=

For a flow, \mathbf y=f(\mathbf x) on a curved manifold, for example \mathbb S^{n-1}, equipped an with embedding function, e that maps a set of (n-1) angular spherical coordinates to \mathbb S^{n-1}, the Jacobian of e is non-constant and we have to evaluate it at both input (\mathbf {E_x}) and output (\mathbf {E_y}). The same applies to r, the represententation function that recovers spherical coordinates from points in \R^n, where we will need the Jacobian at the output (\mathbf{R_y}). The differential volume ratio now generalizes to:

:

R_f(\mathbf x) = \left|\operatorname{det}(\mathbf{R_yF_xE_x})\right|\,\frac{\sqrt{\left|\operatorname{det}(\mathbf{E_y}'\mathbf{E_y})\right|}}{\sqrt{\left|\operatorname{det}(\mathbf{E_x}'\mathbf{E_x})\right|}}

For geometric insight, consider \mathbf S^2, where we choose to work with spherical coordinates, with co-latitude \theta\in[0,\pi] and longitude \phi\in[0,2\pi), where at \mathbf x = e(\theta,\phi), we get \sqrt{\left|\operatorname{det}(\mathbf{E_x}'\mathbf{E_x})\right|}=\sin\theta, which gives the radius of the circle at that latitude (compare e.g. polar circle to equator).

We can however, more generally for isometrically embedded manifolds, simplify this formula by abandoning the fixed coordinate system as given by the fixed functions e,r. Instead at the input and output, we choose two different local coordinate systems, the embedding Jacobians of which are n\text{-by-}(n-1) matrices (call them \mathbf{T_x}, \mathbf{T_y}) with orthonormal columns that span the local tangent spaces. Then we get \mathbf{E_x}=\mathbf{T_x} and \mathbf{R_y}=\mathbf{T_y}' and, since \mathbf{T_x}'\mathbf{T_x}=\mathbf{T_y}'\mathbf{T_y}=\mathbf I_{n-1}, the square root factors cancel, to give:

:

R_f(\mathbf x) = \left|\operatorname{det}(\mathbf{T_y}'\mathbf{F_xT_x})\right|

which agrees with the formula from Sorrenson et al.(2023), which they derived via a different argument. Moreover, in that paper they give a general method to construct the tangent matrices and how to efficently compute stochastic gradient approximations for the differential volume log-ratio during learnin g.

The tangent matrix at a point \mathbf x\in\mathcal M is not unique. If \mathbf{T_x}, n\text{-by-}m, spans the local tangent space with orthonormal columns, then so does \mathbf{T_xQ}, if \mathbf Q is any m\text{-by-}m orthogonal matrix and it is easy to check that the above determinant formula is invariant to such reparametrization of the tangent space. For our running examples:

  • The tangent space of the simplex, \Delta^{n-1}, is the hyperplane that coincides with the simplex and it is the same everywhere.
  • The local tangent space at \mathbf x\in\mathcal S^{n-1} is the hyperplane perpendicular to the radius, \mathbf x and it is the image (column space) of the orthogonal projection matrix, \mathbf I_n-\mathbf{xx}'.

Downsides

Despite normalizing flows success in estimating high-dimensional densities, some downsides still exist in their designs. First of all, their latent space where input data is projected onto is not a lower-dimensional space and therefore, flow-based models do not allow for compression of data by default and require a lot of computation. However, it is still possible to perform image compression with them.{{cite arXiv | eprint=2008.10486| last1=Helminger| first1=Leonhard| last2=Djelouah| first2=Abdelaziz| last3=Gross| first3=Markus| last4=Schroers| first4=Christopher| title=Lossy Image Compression with Normalizing Flows| year=2020| class=cs.CV}}

Flow-based models are also notorious for failing in estimating the likelihood of out-of-distribution samples (i.e.: samples that were not drawn from the same distribution as the training set).{{cite arXiv | eprint=1810.09136v3| last1=Nalisnick| first1=Eric| last2=Matsukawa| first2=Teh| last3=Zhao| first3=Yee Whye| last4=Song| first4=Zhao| title=Do Deep Generative Models Know What They Don't Know?| year=2018| class=stat.ML}} Some hypotheses were formulated to explain this phenomenon, among which the typical set hypothesis,{{cite arXiv | eprint=1906.02994| last1=Nalisnick| first1=Eric| last2=Matsukawa| first2=Teh| last3=Zhao| first3=Yee Whye| last4=Song| first4=Zhao| title=Detecting Out-of-Distribution Inputs to Deep Generative Models Using Typicality| year=2019| class=stat.ML}} estimation issues when training models,{{Cite journal |last1=Zhang |first1=Lily |last2=Goldstein |first2=Mark |last3=Ranganath |first3=Rajesh |date=2021 |title=Understanding Failures in Out-of-Distribution Detection with Deep Generative Models |journal=Proceedings of Machine Learning Research |url=https://proceedings.mlr.press/v139/zhang21g.html |volume=139 |pages=12427–12436|pmid=35860036 |pmc=9295254 }} or fundamental issues due to the entropy of the data distributions.{{Cite arXiv|last1=Caterini |first1=Anthony L. |last2=Loaiza-Ganem |first2=Gabriel |date=2022 |title=Entropic Issues in Likelihood-Based OOD Detection |pages=21–26|class=stat.ML |eprint=2109.10794 }}

One of the most interesting properties of normalizing flows is the invertibility of their learned bijective map. This property is given by constraints in the design of the models (cf.: RealNVP, Glow) which guarantee theoretical invertibility. The integrity of the inverse is important in order to ensure the applicability of the change-of-variable theorem, the computation of the Jacobian of the map as well as sampling with the model. However, in practice this invertibility is violated and the inverse map explodes because of numerical imprecision.{{cite arXiv | eprint=2006.09347| last1=Behrmann| first1=Jens| last2=Vicol| first2=Paul| last3=Wang| first3=Kuan-Chieh| last4=Grosse| first4=Roger| last5=Jacobsen| first5=Jörn-Henrik| title=Understanding and Mitigating Exploding Inverses in Invertible Neural Networks| year=2020| class=cs.LG}}

Applications

Flow-based generative models have been applied on a variety of modeling tasks, including:

  • Audio generation{{cite arXiv | eprint=1912.01219| last1=Ping| first1=Wei| last2=Peng| first2=Kainan| last3=Gorur| first3=Dilan| last4=Lakshminarayanan| first4=Balaji| title=WaveFlow: A Compact Flow-based Model for Raw Audio| year=2019| class=cs.SD}}
  • Image generation
  • Molecular graph generation{{cite arXiv | eprint=2001.09382| last1=Shi| first1=Chence| last2=Xu| first2=Minkai| last3=Zhu| first3=Zhaocheng| last4=Zhang| first4=Weinan| last5=Zhang| first5=Ming| last6=Tang| first6=Jian| title=GraphAF: A Flow-based Autoregressive Model for Molecular Graph Generation| year=2020| class=cs.LG}}
  • Point-cloud modeling{{cite arXiv | eprint=1906.12320| last1=Yang| first1=Guandao| last2=Huang| first2=Xun| last3=Hao| first3=Zekun| last4=Liu| first4=Ming-Yu| last5=Belongie| first5=Serge| last6=Hariharan| first6=Bharath| title=PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows| year=2019| class=cs.CV}}
  • Video generation{{cite arXiv | eprint=1903.01434| last1=Kumar| first1=Manoj| last2=Babaeizadeh| first2=Mohammad| last3=Erhan| first3=Dumitru| last4=Finn| first4=Chelsea| last5=Levine| first5=Sergey| last6=Dinh| first6=Laurent| last7=Kingma| first7=Durk| title=VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation| year=2019| class=cs.CV}}
  • Lossy image compression
  • Anomaly detection{{cite arXiv | eprint=2008.12577| last1=Rudolph| first1=Marco| last2=Wandt| first2=Bastian| last3=Rosenhahn| first3=Bodo| title=Same Same But DifferNet: Semi-Supervised Defect Detection with Normalizing Flows| year=2021| class=cs.CV}}

References

{{reflist}}