Generative Adversarial Networks GANs

Generative Adversarial Networks

After Deep Autoregressive Models, Deep Generative Modelling and Variational Autoencoders we now continue the discussion with Generative Adversarial Networks (GANs).


So far, in the series of deep generative modellings (DGMs [Yad22a]), we have covered autoregressive modelling, which estimates the exact log likelihood defined by the model and variational autoencoders, which was variational approximations for lower bound optimization. Both of these modelling techniques were explicitly defining density functions and optimizing the likelihood of the training data. However, in this blog, we are going to discuss generative adversarial networks (GANs), which are likelihood-free models and do not define density functions explicitly. GANs follow a game-theoretic approach and learn to generate from the training distribution through a set up of a two-player game.

A two player model of GAN along with the generator and discriminators.

A two player model of GAN along with the generator and discriminators.

GAN tries to learn the distribution of high dimensional training data and generates high-quality synthetic data which has a similar distribution to training data. However, learning the training distribution is a highly complex task therefore GAN utilizes a two-player game approach to overcome the high dimensional complexity problem. GAN has two different neural networks (as shown in Figure ??) the generator and the discriminator. The generator takes a random input z\sim p(z) and produces a sample that has a similar distribution as p_d. To train this network efficiently, there is the other network that is utilized as the second player and known as the discriminator. The generator network (player one) tries to fool the discriminator by generating real looking images. Moreover, the discriminator network tries to distinguish between real (training data x\sim p_d(x)) and fake images effectively. Our main aim is to have an efficiently trained discriminator to be able to distinguish between real and fake images (the generator’s output) and on the other hand, we would like to have a generator, which can easily fool the discriminator by generating real-looking images.

Objective function and training

Objective function

Simultaneous training of these two networks is one of the main challenges in GANs and a minimax loss function is defined for this purpose. To understand this minimax function, firstly, we would like to discuss the concept of two sample testing by Aditya grover [Gro20]. Two sample testing is a method to compute the discrepancy between the training data distribution and the generated data distribution:

(1)   \begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{x\sim p_{\theta_g}} [D_{\theta_d}(G_{\theta_g}(x))], \end{equation*}

where p_{\theta_g} and p_d are the distribution functions of generated and training data respectively. The term F is a set of functions. The \textit{max} part is computing the discrepancies between two distribution using a function D_{\theta_d} \in F and this part is very similar to the term d (discrepancy measure) from our first article (Deep Generative Modelling) and KL-divergence is applied to compute this measure in second article (Deep Autoregressive Models) and third articles (Variational Autoencoders). However, in GANs, for a given set of functions F, we would like compute the distribution p_{\theta_g}, which minimizes the overall discrepancy even for a worse function D_{\theta_d}\in F. The above mentioned objective function does not use any likelihood function and utilizing two different data samples from training and generated data respectively.

By combining Figure ?? and Equation 1, the first term \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] corresponds to the discriminator, which has direct access to the training data and the second term \mathbb{E}_{x\sim p_{\theta_g}}[D_{\theta_d}(G_{\theta_g}(x))] represents the generator part as it relies only on the latent space and produces synthetic data. Therefore, Equation 1 can be rewritten in the form of GAN’s two players as:

(2)   \begin{equation*} \min_{p_{\theta_g}}\: \max_{D_{\theta_d}\in F} \: \mathbb{E}_{x\sim p_d}[D_{\theta_d}(x)] - \mathbb{E}_{z\sim p_z}[D_{\theta_d}(G_{\theta_g}(z))], \end{equation*}

The above equation can be rearranged in the form of log loss:

(3)   \begin{equation*} \min_{\theta_g}\: \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d} (x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]), \end{equation*}

In the above equation, the arguments are modified from p_{\theta_g} and D_{\theta_d} in F to \theta_g and  \theta_d respectively as we would like to approximate the network parameters, which are represented by \theta_g and \theta_d for the both generator and discriminator respectively. The discriminator wants to maximize the above objective for \theta_d such that D_{\theta_d}(x) \approx 1, which indicates that the outcome is close to the real data. Furthermore, D_{\theta_d}(G_{\theta_g}(z)) should be close to zero as it is fake data, therefore, the maximization of the above objective function for \theta_d will ensure that the discriminator is performing efficiently in terms of separating real and fake data. From the generator point of view, we would like to minimize this objective function for \theta_g such that D_{\theta_d}(G_{\theta_g}(z)) \approx 1. If the minimization of the objective function happens effectively for \theta_g then the discriminator will classify a fake data into a real data that means that the generator is producing almost real-looking samples.


The training procedure of GAN can be explained by using the following visualization from Goodfellow et al. [GPAM+14]. In Figure 2(a), z is a random input vector to the generator to produce a synthetic outcome x\sim p_{\theta_g} (green curve). The generated data distribution is not close to the original data distribution p_d (dotted black curve). Therefore, the discriminator classifies this image as a fake image and forces generator to learn the training data distribution (Figure 2(b) and (c)). Finally, the generator produces the image which could not detected as a fake data by discriminator(Figure 2(d)).

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow et al.

GAN’s training visualization: the dotted black, solid green lines represents pd and pθ
respectively. The discriminator distribution is shown in dotted blue. This image taken from Goodfellow
et al. [GPAM+14].

The optimization of the objective function mentioned in Equation 3 is performed in th following two steps repeatedly:
\item Firstly, the gradient ascent is utilized to maximize the objective function for \theta_d for discriminator.

(4)   \begin{equation*} \max_{\theta_d} \: (\mathbb{E}_{x\sim p_d} [log \: D_{\theta_d}(x)] + \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}

\item In the second step, the following function is minimized for the generator using gradient descent.

(5)   \begin{equation*} \min_{\theta_g} \: ( \mathbb{E}_{z\sim p_z}[log(1 - D_{\theta_d}(G_{\theta_g}(z))]) \end{equation*}


However, in practice the minimization for the generator does now work well because when D_{\theta_d}(G_{\theta_g}(z) \approx 1 then the term log \: (1-D_{\theta_d}(G_{\theta_g}(z))) has the dominant gradient and vice versa.

However, we would like to have the gradient behaviour completely opposite because D_{\theta_d}(G_{\theta_g}(z) \approx 1 means the generator is well trained and does not require dominant gradient values. However, in case of D_{\theta_d}(G_{\theta_g}(z) \approx 0, the generator is not well trained and producing low quality outputs therefore, it requires a dominant gradient for an efficient training. To fix this problem, the gradient ascent method is applied to maximize the modified generator’s objective:
In the second step, the following function is minimized for the generator using gradient descent alternatively.

(6)   \begin{equation*} \max_{\theta_g} \: \mathbb{E}_{z\sim p_z}[log \: (D_{\theta_d}(G_{\theta_g}(z))] \end{equation*}

therefore, during the training, Equation 4 and 6 will be maximized using the gradient ascent algorithm until the convergence.


The quality of the generated images using GANs depends on several factors. Firstly, the joint training of GANs is not a stable procedure and that could severely decrease the quality of the outcome. Furthermore, the different neural network architecture will modify the quality of images based on the sophistication of the used network. For example, the vanilla GAN [GPAM+14] uses a fully connected deep neural network and generates a quite decent result. Furthermore, DCGAN [RMC15] utilized deep convolutional networks and enhanced the quality of outcome significantly. Furthermore, different types of loss functions are applied to stabilize the training procedure of GAN and to produce high-quality outcomes. As shown in Figure 3, StyleGAN [KLA19] utilized Wasserstein metric [Yad22b] to generate high-resolution face images. As it can be seen from Figure 3, the quality of the generated images are enhancing with time by applying more sophisticated training techniques and network architectures.

GAN timeline with different variations in terms of network architecture and loss functions.

GAN timeline with different variations in terms of network architecture and loss functions.


This article covered the basics and mathematical concepts of GANs. However, the training of two different networks simultaneously could be complex and unstable. Therefore, researchers are continuously working to create a better and more stable version of GANs, for example, WGAN. Furthermore, different types of network architectures are introduced to improve the quality of outcomes. We will discuss this further in the upcoming blog about these variations.


[GPAM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in
neural information processing systems, 27, 2014.

[Gro20] Aditya Grover. Generative adversarial networks., 2020.

[KLA19] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for
generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 4401–4410, 2019.

[RMC15] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.

[Yad22a] Sunil Yadav. Deep generative modelling. https://data-scienceblog.
com/blog/2022/02/19/deep-generative-modelling/, 2022.

[Yad22b] Sunil Yadav. Necessary probability concepts for deep learning: Part 2.
deep-learning-995560752a53, 2022.

Deep Autoregressive Models

Deep Autoregressive Models

In this blog article, we will discuss about deep autoregressive generative models (AGM). Autoregressive models were originated from economics and social science literature on time-series data where obser- vations from the previous steps are used to predict the value at the current and at future time steps [SS05]. Autoregression models can be expressed as:

    \begin{equation*} x_{t+1}= \sum_i^t \alpha_i x_{t-i} + c_i, \end{equation*}

where the terms \alpha and c are constants to define the contributions of previous samples x_i for the future value prediction. In the other words, autoregressive deep generative models are directed and fully observed models where outcome of the data completely depends on the previous data points as shown in Figure 1.

Autoregressive directed graph.

Figure 1: Autoregressive directed graph.

Let’s consider x \sim X, where X is a set of images and each images is n-dimensional (n pixels). Then the prediction of new data pixel will be depending all the previously predicted pixels (Figure ?? shows the one row of pixels from an image). Referring to our last blog, deep generative models (DGMs) aim to learn the data distribution p_\theta(x) of the given training data and by following the chain rule of the probability, we can express it as:

(1)   \begin{equation*} p_\theta(x) = \prod_{i=1}^n p_\theta(x_i | x_1, x_2, \dots , x_{i-1}) \end{equation*}

The above equation modeling the data distribution explicitly based on the pixel conditionals, which are tractable (exact likelihood estimation). The right hand side of the above equation is a complex distribution and can be represented by any possible distribution of n random variables. On the other hand, these kind of representation can have exponential space complexity. Therefore, in autoregressive generative models (AGM), these conditionals are approximated/parameterized by neural networks.


As AGMs are based on tractable likelihood estimation, during the training process these methods maximize the likelihood of images over the given training data X and it can be expressed as:

(2)   \begin{equation*} \max_{\theta} \sum_{x\sim X} log \: p_\theta (x) = \max_{\theta} \sum_{x\sim X} \sum_{i=1}^n log \: p_\theta (x_i | x_1, x_2, \dots, x_{i-1}) \end{equation*}

The above expression is appearing because of the fact that DGMs try to minimize the distance between the distribution of the training data and the distribution of the generated data (please refer to our last blog). The distance between two distribution can be computed using KL-divergence:

(3)   \begin{equation*} \min_{\theta} d_{KL}(p_d (x),p_\theta (x)) = log\: p_d(x) - log \: p_\theta(x) \end{equation*}

In the above equation the term p_d(x) does not depend on \theta, therefore, whole equation can be shortened to Equation 2, which represents the MLE (maximum likelihood estimation) objective to learn the model parameter \theta by maximizing the log likelihood of the training images X. From implementation point of view, the MLE objective can be optimized using the variations of stochastic gradient (ADAM, RMSProp, etc.) on mini-batches.

Network Architectures

As we are discussing deep generative models, here, we would like to discuss the deep aspect of AGMs. The parameterization of the conditionals mentioned in Equation 1 can be realized by different kind of network architectures. In the literature, several network architectures are proposed to increase their receptive fields and memory, allowing more complex distributions to be learned. Here, we are mentioning a couple of well known architectures, which are widely used in deep AGMs:

  1. Fully-visible sigmoid belief network (FVSBN): FVSBN is the simplest network without any hidden units and it is a linear combination of the input elements followed by a sigmoid function to keep output between 0 and 1. The positive aspects of this network is simple design and the total number of parameters in the model is quadratic which is much smaller compared to exponential [GHCC15].
  2. Neural autoregressive density estimator (NADE): To increase the effectiveness of FVSBN, the simplest idea would be to use one hidden layer neural network instead of logistic regression. NADE is an alternate MLP-based parameterization and more effective compared to FVSBN [LM11].
  3. Masked autoencoder density distribution (MADE): Here, the standard autoencoder neural networks are modified such that it works as an efficient generative models. MADE masks the parameters to follow the autoregressive property, where the current sample is reconstructed using previous samples in a given ordering [GGML15].
  4. PixelRNN/PixelCNN: These architecture are introducced by Google Deepmind in 2016 and utilizing the sequential property of the AGMs with recurrent and convolutional neural networks.
Different autoregressive architectures

Figure 2: Different autoregressive architectures (image source from [LM11]).

Results using different architectures

Results using different architectures (images source

It uses two different RNN architectures (Unidirectional LSTM and Bidirectional LSTM) to generate pixels horizontally and horizontally-vertically respectively. Furthermore, it ulizes residual connection to speed up the convergence and masked convolution to condition the different channels of images. PixelCNN applies several convolutional layers to preserve spatial resolution and increase the receptive fields. Furthermore, masking is applied to use only the previous pixels. PixelCNN is faster in training compared to PixelRNN. However, the outcome quality is better with PixelRNN [vdOKK16].


In this blog article, we discussed about deep autoregressive models in details with the mathematical foundation. Furthermore, we discussed about the training procedure including the summary of different network architectures. We did not discuss network architectures in details, we would continue the discussion of PixelCNN and its variations in upcoming blogs.


[GGML15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for distribution estimation. CoRR, abs/1502.03509, 2015.

[GHCC15] Zhe Gan, Ricardo Henao, David Carlson, and Lawrence Carin. Learning Deep Sigmoid Belief Networks with Data Augmentation. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence
and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 268–276, San Diego, California, USA, 09–12 May 2015. PMLR.

[LM11] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 29–37, Fort Lauderdale, FL, USA, 11–13 Apr 2011.

[SS05] Robert H. Shumway and David S. Stoffer. Time Series Analysis and Its Applications (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005.

[vdOKK16] A ̈aron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. CoRR, abs/1601.06759, 2016