Energy-based model

An energy-based model (EBM) (also called Canonical Ensemble Learning or Learning via Canonical Ensemble – CEL and LCE, respectively) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

EBMs provide a unified framework for many probabilistic and non-probabilistic approaches to such learning, particularly for training graphical and other structured models.{{citation needed|date=June 2024}}

An EBM learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution.

Energy-based generative neural networks {{Cite journal|last1=Xie|first1=Jianwen|last2=Lu|first2=Yang|last3=Zhu|first3=Song-Chun|last4=Wu|first4=Ying Nian|title=A theory of generative ConvNet|journal=ICML|date=2016|arxiv=1602.03264|bibcode=2016arXiv160203264X}}{{Cite journal|last1=Xie|first1=Jianwen|last2=Zhu|first2=Song-Chun|last3=Wu|first3=Ying Nian|date=2019|title=Learning Energy-based Spatial-Temporal Generative ConvNets for Dynamic Patterns|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=43 |issue=2 |pages=516–531|doi=10.1109/tpami.2019.2934852|pmid=31425020|arxiv=1909.11975|bibcode=2019arXiv190911975X|s2cid=201098397 |issn=0162-8828}} is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models, the energy functions of which are parameterized by modern deep neural networks.

Boltzmann machines are a special form of energy-based models with a specific parametrization of the energy.

Learning Deep Architectures for AI, Yoshua Bengio, Page 54, https://books.google.com/books?id=cq5ewg7FniMC&pg=PA54

Description

For a given input $x$ , the model describes an energy $E_\theta(x)$ such that the Boltzmann distribution $P_\theta(x)=\exp(-\beta E_\theta(x))/Z(\theta)$ is a probability (density), and typically $\beta=1$ .

Since the normalization constant:

$Z(\theta):=\int_{x \in X} \exp(-\beta E_\theta(x)) dx$

(also known as the partition function) depends on all the Boltzmann factors of all possible inputs $x$ , it cannot be easily computed or reliably estimated during training simply using standard maximum likelihood estimation.

However, for maximizing the likelihood during training, the gradient of the log-likelihood of a single training example $x$ is given by using the chain rule:

$\partial_\theta \log\left(P_\theta(x)\right)=\mathbb{E}_{x'\sim P_\theta}[\partial_\theta E_\theta(x')]-\partial_\theta E_\theta(x) \, (*)$

The expectation in the above formula for the gradient can be approximately estimated by drawing samples $x'$ from the distribution $P_\theta$ using Markov chain Monte Carlo (MCMC).{{cite arXiv|last1=Du|first1=Yilun|last2=Mordatch|first2=Igor|date=2019-03-20|title=Implicit Generation and Generalization in Energy-Based Models|eprint=1903.08689|class=cs.LG}}

Early energy-based models, such as the 2003 Boltzmann machine by Hinton, estimated this expectation via blocked Gibbs sampling. Newer approaches make use of more efficient Stochastic Gradient Langevin Dynamics (LD), drawing samples using:Grathwohl, Will, et al. "Your classifier is secretly an energy based model and you should treat it like one." arXiv preprint arXiv:1912.03263 (2019).

$x_0' \sim P_0, x_{i+1}' = x_i' - \frac{\alpha}{2}\frac{\partial E_\theta(x_i') }{\partial x_i'} +\epsilon$ ,

where $\epsilon \sim \mathcal{N}(0,\alpha)$ . A replay buffer of past values $x_i'$ is used with LD to initialize the optimization module.

The parameters $\theta$ of the neural network are therefore trained in a generative manner via MCMC-based maximum likelihood estimation:{{Cite book|last1=Barbu|first1=Adrian|title=Monte Carlo Methods|last2=Zhu|first2=Song-Chun|publisher=Springer|year=2020}}

the learning process follows an "analysis by synthesis" scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method (e.g., Langevin dynamics or Hybrid Monte Carlo), and then updates the parameters $\theta$ based on the difference between the training examples and the synthesized ones – see equation $(*)$ . This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation.{{Cite journal|last1=Wu|first1=Ying Nian|last2=Xie|first2=Jianwen|last3=Lu|first3=Yang|last4=Zhu|first4=Song-Chun|date=2018|title=Sparse and deep generalizations of the FRAME model|journal=Annals of Mathematical Sciences and Applications|volume=3|issue=1|pages=211–254|doi=10.4310/amsa.2018.v3.n1.a7|issn=2380-288X}}

Essentially, the model learns a function $E_\theta$ that associates low energies to correct values, and higher energies to incorrect values.

After training, given a converged energy model $E_\theta$ , the Metropolis–Hastings algorithm can be used to draw new samples. The acceptance probability is given by:

$P_{acc}(x_i \to x^*)=\min\left(1, \frac{P_\theta(x^*)}{P_\theta(x_i)}\right).$

History

The term "energy-based models" was first coined in a 2003 JMLR paper{{Cite journal|url=https://www.jmlr.org/papers/v4/teh03a.html|title=Energy-Based Models for Sparse Overcomplete Representations|last1=Teh|first1=Yee Whye|last2=Welling|first2=Max|last3=Osindero|first3=Simon|last4=Hinton|first4=Geoffrey E.|journal=JMLR|date=December 2003|volume=4 |issue=Dec |pages=1235–1260 }} where the authors defined a generalisation of independent components analysis to the overcomplete setting using EBMs.

Other early work on EBMs proposed models that represented energy as a composition of latent and observable variables.

Characteristics

EBMs demonstrate useful properties:

Simplicity and stability–The EBM is the only object that needs to be designed and trained. Separate networks need not be trained to ensure balance.
Adaptive computation time–An EBM can generate sharp, diverse samples or (more quickly) coarse, less diverse samples. Given infinite time, this procedure produces true samples.
Flexibility–In Variational Autoencoders (VAE) and flow-based models, the generator learns a map from a continuous space to a (possibly) discontinuous space containing different data modes. EBMs can learn to assign low energies to disjoint regions (multiple modes).
Adaptive generation–EBM generators are implicitly defined by the probability distribution, and automatically adapt as the distribution changes (without training), allowing EBMs to address domains where generator training is impractical, as well as minimizing mode collapse and avoiding spurious modes from out-of-distribution samples.
Compositionality–Individual models are unnormalized probability distributions, allowing models to be combined through product of experts or other hierarchical techniques.

Experimental results

On image datasets such as CIFAR-10 and ImageNet 32x32, an EBM model generated high-quality images relatively quickly. It supported combining features learned from one type of image for generating other types of images. It was able to generalize using out-of-distribution datasets, outperforming flow-based and autoregressive models. EBM was relatively resistant to adversarial perturbations, behaving better than models explicitly trained against them with training for classification.

Applications

Target applications include natural language processing, robotics and computer vision.

The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network.{{Cite journal|last1=Lecun|first1=Y.|last2=Bottou|first2=L.|last3=Bengio|first3=Y.|last4=Haffner|first4=P.|date=1998|title=Gradient-based learning applied to document recognition|journal=Proceedings of the IEEE|volume=86|issue=11|pages=2278–2324|doi=10.1109/5.726791|s2cid=14542261 |issn=0018-9219}}{{Cite journal|last1=Krizhevsky|first1=Alex|last2=Sutskever|first2=Ilya|last3=Hinton|first3=Geoffrey|year=2012|title=ImageNet classification with deep convolutional neural networks|url=https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf|journal=NIPS}} The model has been generalized to various domains to learn distributions of videos,{{Cite book|last1=Xie|first1=Jianwen|last2=Zhu|first2=Song-Chun|last3=Wu|first3=Ying Nian|title=2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet |date=July 2017|pages=1061–1069|publisher=IEEE|doi=10.1109/cvpr.2017.119|isbn=978-1-5386-0457-1|arxiv=1606.00972|s2cid=763074 }} and 3D voxels.{{Cite book|last1=Xie|first1=Jianwen|last2=Zheng|first2=Zilong|last3=Gao|first3=Ruiqi|last4=Wang|first4=Wenguan|last5=Zhu|first5=Song-Chun|last6=Wu|first6=Ying Nian|title=2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition |chapter=Learning Descriptor Networks for 3D Shape Synthesis and Analysis |date=June 2018|pages=8629–8638|publisher=IEEE|doi=10.1109/cvpr.2018.00900|arxiv=1804.00586|bibcode=2018arXiv180400586X|isbn=978-1-5386-6420-9|s2cid=4564025 }} They are made more effective in their variants.{{Cite book|last1=Gao|first1=Ruiqi|last2=Lu|first2=Yang|last3=Zhou|first3=Junpei|last4=Zhu|first4=Song-Chun|last5=Wu|first5=Ying Nian|title=2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition |chapter=Learning Generative ConvNets via Multi-grid Modeling and Sampling |date=June 2018|pages=9155–9164|publisher=IEEE|doi=10.1109/cvpr.2018.00954|isbn=978-1-5386-6420-9|arxiv=1709.08868|s2cid=4566195 }}{{Cite book|last1=Nijkamp, Zhu, Song-Chun Wu, Ying Nian|first1=Erik|title=On Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model|last2=Hill|first2=Mitch|last3=Zhu|first3=Song-Chun|last4=Wu|first4=Ying Nian|year=2019|location=NeurIPS|oclc=1106340764}}{{Cite journal|last1=Cai|first1=Xu|last2=Wu|first2=Yang|last3=Li|first3=Guanbin|last4=Chen|first4=Ziliang|last5=Lin|first5=Liang|date=2019-07-17|title=FRAME Revisited: An Interpretation View Based on Particle Evolution|journal=Proceedings of the AAAI Conference on Artificial Intelligence|volume=33|pages=3256–3263|doi=10.1609/aaai.v33i01.33013256|issn=2374-3468|doi-access=free|arxiv=1812.01186}}{{Cite journal|last1=Xie|first1=Jianwen|last2=Lu|first2=Yang|last3=Gao|first3=Ruiqi|last4=Zhu|first4=Song-Chun|last5=Wu|first5=Ying Nian|date=2020-01-01|title=Cooperative Training of Descriptor and Generator Networks|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=42|issue=1|pages=27–45|doi=10.1109/tpami.2018.2879081|pmid=30387724|issn=0162-8828|arxiv=1609.09408|s2cid=7759006 }}{{Cite journal|last1=Xie|first1=Jianwen|last2=Lu|first2=Yang|last3=Gao|first3=Ruiqi|last4=Gao|first4=Song-Chun|year=2018|title=Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching|url=https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17432/16737|journal=Thirty-Second AAAI Conference on Artificial Intelligence|volume=32 |doi=10.1609/aaai.v32i1.11834 |s2cid=9212174 |doi-access=free}}{{Cite book|last1=Han|first1=Tian|last2=Nijkamp|first2=Erik|last3=Fang|first3=Xiaolin|last4=Hill|first4=Mitch|last5=Zhu|first5=Song-Chun|last6=Wu|first6=Ying Nian|title=2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=Divergence Triangle for Joint Training of Generator Model, Energy-Based Model, and Inferential Model |date=June 2019|pages=8662–8671|publisher=IEEE|doi=10.1109/cvpr.2019.00887|isbn=978-1-7281-3293-8|s2cid=57189202 }} They have proven useful for data generation (e.g., image synthesis, video synthesis,

3D shape synthesis, etc.), data recovery (e.g., recovering videos with missing pixels or image frames, 3D super-resolution, etc), data reconstruction (e.g., image reconstruction and linear interpolation ).

Alternatives

EBMs compete with techniques such as variational autoencoders (VAEs), generative adversarial networks (GANs) or normalizing flows.

Extensions

= Joint energy-based models=

File:Joint Energy Based Model.png

Joint energy-based models (JEM), proposed in 2020 by Grathwohl et al., allow any classifier with softmax output to be interpreted as energy-based model. The key observation is that such a classifier is trained to predict the conditional probability $p_\theta(y | x)=\frac{e^{\vec{f}_\theta(x)[y]}}{\sum_{j=1}^K e^{\vec{f}_\theta(x)[j]}} \ \ \text{ for } y = 1, \dotsc, K \text{ and } \vec{f}_\theta = (f_1, \dotsc, f_K) \in \R^K,$

where $\vec{f}_\theta(x)[y]$ is the y-th index of the logits $\vec{f}$ corresponding to class y.

Without any change to the logits it was proposed to reinterpret the logits to describe a joint probability density:

: $p_\theta(y,x)=\frac{e^{\vec{f}_\theta(x)[y]}}{Z(\theta)},$

with unknown partition function $Z(\theta)$ and energy $E_\theta (x, y)=-f_\theta(x)[y]$ .

By marginalization, we obtain the unnormalized density

: $p_\theta(x)=\sum_y p_\theta(y,x)= \sum_y \frac{e^{\vec{f}_\theta(x)[y]}}{Z(\theta)}=:\exp(-E_\theta(x)),$

therefore,

: $E_\theta(x)=-\log\left(\sum_y \frac{e^{\vec{f}_\theta(x)[y]}}{Z(\theta)}\right),$

so that any classifier can be used to define an energy function $E_\theta(x)$ .

Literature

Implicit Generation and Generalization in Energy-Based Models Yilun Du, Igor Mordatch https://arxiv.org/abs/1903.08689
Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One, Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, Kevin Swersky https://arxiv.org/abs/1912.03263

References

External links

{{Cite web|url=http://www.cs.toronto.edu/~vnair/ciar/|title=CIAR NCAP Summer School|website=www.cs.toronto.edu|access-date=2019-12-27}}
{{Citation|last1=Dayan|first1=Peter|title=Helmholtz Machine|date=1999|work=Unsupervised Learning|publisher=The MIT Press|isbn=978-0-262-28803-3|last2=Hinton|first2=Geoffrey|last3=Neal|first3=Radford|last4=Zemel|first4=Richard S.|doi=10.7551/mitpress/7011.003.0017}}
{{Cite journal|last=Hinton|first=Geoffrey E.|date=August 2002|title=Training Products of Experts by Minimizing Contrastive Divergence|journal=Neural Computation|volume=14|issue=8|pages=1771–1800|doi=10.1162/089976602760128018|pmid=12180402|s2cid=207596505|issn=0899-7667}}
{{Cite journal|last1=Salakhutdinov|first1=Ruslan|last2=Hinton|first2=Geoffrey|date=2009-04-15|title=Deep Boltzmann Machines|url=http://proceedings.mlr.press/v5/salakhutdinov09a.html|journal=Artificial Intelligence and Statistics|language=en|pages=448–455}}

Category:Statistical models

Category:Machine learning

Category:Statistical mechanics

Category:Hamiltonian mechanics