Implicit Generation and Generalization Methods for Energy-Based Models

8 minute read

We've made progress towards stable and scalable training of energy-based models (EBMs) resulting in better sample quality and generalization ability than existing models. Generation in EBMs spends more compute to continually refine its answers and doing so can generate samples competitive with GANs at low temperatures[1], while also having mode coverage guarantees of likelihood-based models. We hope these findings stimulate further research into this promising class of models.

Read PaperView Code + Pre-trained Models

Generative modeling is the task of observing data, such as images or text, and learning to model the underlying data distribution. Accomplishing this task leads models to understand high level features in data and synthesize examples that look like real data. Generative models have many applications in natural language, robotics, and computer vision.

Energy-based models represent probability distributions over data by assigning an unnormalized probability scalar (or “energy”) to each input data point. This provides useful modeling flexibility—any arbitrary model that outputs a real number given an input can be used as an energy model. The difficulty however, lies in sampling from these models.


Conditional ImageNet32x32 model samples.

To generate samples from EBMs, we use an iterative refinement process based on Langevin dynamics. Informally, this involves performing noisy gradient descent on the energy function to arrive at low-energy configurations (see paper for more details). Unlike GANs, VAEs, and Flow-based models, this approach does not require an explicit neural network to generate samples - samples are generated implicitly. The combination of EBMs and iterative refinement have the following benefits:

  • Adaptive computation time. We can run sequential refinement for long amount of time to generate sharp, diverse samples or a short amount of time for coarse less diverse samples. In the limit of infinite time, this procedure is known to generate true samples from the energy model.

  • Not restricted by generator network. In both VAEs and Flow based models, the generator must learn a map from a continuous space to a possibly disconnected space containing different data modes, which requires large capacity and may not be possible to learn. In EBMs, by contrast, can easily learn to assign low energies at disjoint regions.

  • Built-in compositionality. Since each model represents an unnormalized probability distribution, models can be naturally combined through product of experts or other hierarchical models.


We found energy-based models are able to generate qualitatively and quantitatively high-quality images, especially when running the refinement process for a longer period at test time. By running iterative optimization on individual images, we can auto-complete images and morph images from one class (such as truck) to another (such as frog).

images 1
images 2
images 1
images 2
Image completions on conditional ImageNet model. Our models exhibit diversity in inpainting. Note that inputs are from test distribution and are not model samples, indicating coverage of test data.
Cross-class implicit sampling on a conditional model. The model is conditioned on a particular class but is initialized with an image from a separate class.

In addition to generating images, we found that energy-based models are able to generate stable robot dynamics trajectories across large number of timesteps. EBMs can generate a diverse set of possible futures, while feedforward models collapse to a mean prediction.

Sample 1
Sample 2
T = 0
T = 20
T = 40
T = 60
T = 80
Top down views of robot hand manipulation trajectories generated unconditionally from the same starting state (1st frame). The FC network predicts a hand that does not move, while the EBM is able to generate distinctively different trajectories that are feasible.


We tested energy-based models on classifying several different out-of-distribution datasets and found that energy-based models outperform other likelihood models such as Flow based and autoregressive models. We also tested classification using conditional energy-based models, and found that the resultant classification exhibited good generalization to adversarial perturbations. Our model—despite never being trained for classification—performed classification better than models explicitly trained against adversarial perturbations.

Lessons learned

We found evidence that suggest the following observations, though in no way are we certain that these observations are correct:

  • We found it difficult to apply vanilla HMC to EBM training as optimal step sizes and leapfrog simulation numbers differ greatly during training, though applying adaptive HMC would be an interesting extension.
  • We found training ensembles of energy functions (sampling and evaluating on ensembles) to help a bit, but was not worth the added complexity.
  • We didn’t find much success adding a gradient penalty term, as it seemed to hurt model capacity and sampling.

More tips, observations and failures from this research can be found in Section A.8 of the paper.

Next steps

We found preliminary indications that we can compose multiple energy-based models via a product of experts model. We trained one model on different size shapes at a set position and another model on same size shape at different positions. By combining the resultant energy-based models, we were able to generate different size shapes at different locations, despite never seeing examples of both being changed.

Energy A
Energy B
Energy A + B
A 2D example of combining energy functions through their summation and the resulting sampling trajectories.

Compositionality is one of the unsolved challenges facing AI systems today, and we are excited about what energy-based models can do here. If you are excited to work on energy-based models please consider applying to OpenAI!


Thanks to Ilya Sutskever, Greg Brockman, Bob McGrew, Johannes Otterbach, Jacob Steinhardt, Harri Edwards, Yura Burda, Jack Clark and Ashley Pilipiszyn for feedback on this blog post and manuscript.


  1. See Equation 2 in this paper. ↩︎