Inception score
The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN).[1] The score is calculated based on the output of a separate, pretrained Inceptionv3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:
- The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
- The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".[2]
It has been somewhat superseded by the related Fréchet inception distance.[3] While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth").
Definition
Let there be two spaces, the space of images [math]\displaystyle{ \Omega_X }[/math] and the space of labels [math]\displaystyle{ \Omega_Y }[/math]. The space of labels is finite.
Let [math]\displaystyle{ p_{gen} }[/math] be a probability distribution over [math]\displaystyle{ \Omega_X }[/math] that we wish to judge.
Let a discriminator be a function of type [math]\displaystyle{ p_{dis}:\Omega_X \to M(\Omega_Y) }[/math]where [math]\displaystyle{ M(\Omega_Y) }[/math] is the set of all probability distributions on [math]\displaystyle{ \Omega_Y }[/math]. For any image [math]\displaystyle{ x }[/math], and any label [math]\displaystyle{ y }[/math], let [math]\displaystyle{ p_{dis}(y|x) }[/math] be the probability that image [math]\displaystyle{ x }[/math] has label [math]\displaystyle{ y }[/math], according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet.
The Inception Score of [math]\displaystyle{ p_{gen} }[/math] relative to [math]\displaystyle{ p_{dis} }[/math] is[math]\displaystyle{ IS(p_{gen}, p_{dis}) := \exp\left( \mathbb E_{x\sim p_{gen}}\left[ D_{KL} \left(p_{dis}(\cdot | x) \| \int p_{dis}(\cdot | x) p_{gen}(x)dx \right) \right]\right) }[/math]Equivalent rewrites include[math]\displaystyle{ \ln IS(p_{gen}, p_{dis}) := \mathbb E_{x\sim p_{gen}}\left[ D_{KL} \left(p_{dis}(\cdot | x) \| \mathbb E_{x\sim p_{gen}}[p_{dis}(\cdot | x)]\right) \right] }[/math][math]\displaystyle{ \ln IS(p_{gen}, p_{dis}) := H[\mathbb E_{x\sim p_{gen}}[p_{dis}(\cdot | x)]] -\mathbb E_{x\sim p_{gen}}[ H[p_{dis}(\cdot | x)]] }[/math][math]\displaystyle{ \ln IS }[/math] is nonnegative by Jensen's inequality.
Pseudocode:
INPUT discriminator [math]\displaystyle{ p_{dis} }[/math].INPUT generator [math]\displaystyle{ g }[/math].
Sample images [math]\displaystyle{ x_i }[/math] from generator.
Compute [math]\displaystyle{ p_{dis}(\cdot |x_i) }[/math], the probability distribution over labels conditional on image [math]\displaystyle{ x_i }[/math].
Sum up the results to obtain [math]\displaystyle{ \hat p }[/math], an empirical estimate of [math]\displaystyle{ \int p_{dis}(\cdot | x) p_{gen}(x)dx }[/math].
Sample more images [math]\displaystyle{ x_i }[/math] from generator, and for each, compute [math]\displaystyle{ D_{KL} \left(p_{dis}(\cdot | x_i) \| \hat p\right) }[/math].
Average the results, and take its exponential.
RETURN the result.
Interpretation
A higher inception score is interpreted as "better", as it means that [math]\displaystyle{ p_{gen} }[/math] is a "sharp and distinct" collection of pictures.
[math]\displaystyle{ \ln IS(p_{gen}, p_{dis}) \in [0, \ln N] }[/math], where [math]\displaystyle{ N }[/math] is the total number of possible labels.
[math]\displaystyle{ \ln IS(p_{gen}, p_{dis}) = 0 }[/math] iff for almost all [math]\displaystyle{ x\sim p_{gen} }[/math][math]\displaystyle{ p_{dis}(\cdot | x) = \int p_{dis}(\cdot | x) p_{gen}(x)dx }[/math]That means [math]\displaystyle{ p_{gen} }[/math] is completely "indistinct". That is, for any image [math]\displaystyle{ x }[/math] sampled from [math]\displaystyle{ p_{gen} }[/math], discriminator returns exactly the same label predictions [math]\displaystyle{ p_{dis}(\cdot | x) }[/math].
The highest inception score [math]\displaystyle{ N }[/math] is achieved if and only if the two conditions are both true:
- For almost all [math]\displaystyle{ x\sim p_{gen} }[/math], the distribution [math]\displaystyle{ p_{dis}(y|x) }[/math] is concentrated on one label. That is, [math]\displaystyle{ H_y[p_{dis}(y|x)] = 0 }[/math]. That is, every image sampled from [math]\displaystyle{ p_{gen} }[/math] is exactly classified by the discriminator.
- For every label [math]\displaystyle{ y }[/math], the proportion of generated images labelled as [math]\displaystyle{ y }[/math] is exactly [math]\displaystyle{ \mathbb E_{x\sim p_{gen}}[p_{dis}(y | x)] = \frac 1 N }[/math]. That is, the generated images are equally distributed over all labels.
References
- ↑ Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 29. https://proceedings.neurips.cc/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html.
- ↑ Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks 144: 187–209. doi:10.1016/j.neunet.2021.07.019. PMID 34500257.
- ↑ Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments" (in en). Computer Vision and Image Understanding 215: 103329. doi:10.1016/j.cviu.2021.103329. https://linkinghub.elsevier.com/retrieve/pii/S1077314221001685.
![]() | Original source: https://en.wikipedia.org/wiki/Inception score.
Read more |