Neural Style Transfer

From HandWiki
Short description: Type of software algorithm for image manipulation
Mona Lisa in the style of "The Starry Night" using neural style transfer.
Mona Lisa in the style of "Woman with a Hat" using neural style transfer.
Mona Lisa in the style of "The Great Wave" using neural style transfer.

Neural Style Transfer (NST) refers to a class of software algorithms that manipulate digital images, or videos, in order to adopt the appearance or visual style of another image. NST algorithms are characterized by their use of deep neural networks for the sake of image transformation. Common uses for NST are the creation of artificial artwork from photographs, for example by transferring the appearance of famous paintings to user-supplied photographs. Several notable mobile apps use NST techniques for this purpose, including DeepArt and Prisma. This method has been used by artists and designers around the globe to develop new artwork based on existent style(s).

Earlier style transfer algorithms

NST is an example of image stylization, a problem studied for over two decades within the field of non-photorealistic rendering. The first two example-based style transfer algorithms were Image Analogies[1] and Image Quilting.[2] Both of these methods were based on patch-based texture synthesis algorithms.

Given a training pair of images–a photo and an artwork depicting that photo–a transformation could be learned and then applied to create new artwork from a new photo, by analogy. If no training photo was available, it would need to be produced by processing the input artwork; Image Quilting did not require this processing step, though it was demonstrated on only one style.

NST

NST was first published in the paper "A Neural Algorithm of Artistic Style" by Leon Gatys et al., originally released to ArXiv 2015,[3] and subsequently accepted by the peer-reviewed Computer Vision and Pattern Recognition (CVPR) in 2016.[4]

NST is based on histogram-based texture synthesis algorithms, notably the method of Portilla and Simoncelli. NST can be summarized as histogram-based texture synthesis with convolutional neural network (CNN) features for the image analogies problem. The original paper used a VGG-19 architecture[5] that has been pre-trained to perform object recognition using the ImageNet dataset.

In 2017, Google AI introduced a method[6] that allows a single deep convolutional style transfer network to learn multiple styles at the same time. This algorithm permits style interpolation in real-time, even when done on video media.

Formulation

The process of NST assumes an input image [math]\displaystyle{ p }[/math] and an example style image [math]\displaystyle{ a }[/math].

The image [math]\displaystyle{ p }[/math] is fed through the CNN, and network activations are sampled at a late convolution layer of the VGG-19 architecture. Let [math]\displaystyle{ C(p) }[/math] be the resulting output sample, called the 'content' of the input [math]\displaystyle{ p }[/math].

The style image [math]\displaystyle{ a }[/math] is then fed through the same CNN, and network activations are sampled at the early to middle layers of the CNN. These activations are encoded into a Gramian matrix representation, call it [math]\displaystyle{ S(a) }[/math] to denote the 'style' of [math]\displaystyle{ a }[/math].

The goal of NST is to synthesize an output image [math]\displaystyle{ x }[/math] that exhibits the content of [math]\displaystyle{ p }[/math] applied with the style of [math]\displaystyle{ a }[/math], i.e. [math]\displaystyle{ C(x)=C(p) }[/math] and [math]\displaystyle{ S(x)=S(a) }[/math].

An iterative optimization (usually gradient descent) then gradually updates [math]\displaystyle{ x }[/math] to minimize the loss function error:

[math]\displaystyle{ \mathcal{L(x)} = | C(x)-C(p) | + k |S(x)-S(a)| }[/math],

where [math]\displaystyle{ |.| }[/math] is the L2 distance. The constant [math]\displaystyle{ k }[/math] controls the level of the stylization effect.

Training

Image [math]\displaystyle{ x }[/math] is initially approximated by adding a small amount of white noise to input image [math]\displaystyle{ p }[/math] and feeding it through the CNN. Then we successively backpropagate this loss through the network with the CNN weights fixed in order to update the pixels of [math]\displaystyle{ x }[/math]. After several thousand epochs of training, an [math]\displaystyle{ x }[/math] (hopefully) emerges that matches the style of [math]\displaystyle{ a }[/math] and the content of [math]\displaystyle{ p }[/math].

Algorithms are typically implemented for GPUs, so that training takes a few minutes.[citation needed]

Extensions

NST has also been extended to videos.[7]

Subsequent work improved the speed of NST for images.[clarification needed]

In a paper by Fei-Fei Li et al. adopted a different regularized loss metric and accelerated method for training to produce results in real-time (three times faster than Gatys). Their idea was to use not the pixel-based loss defined above but rather a 'perceptual loss' measuring the differences between higher-level layers within the CNN. They used a symmetric encoder-decoder CNN. Training uses a similar loss function to the basic NST method but also regularizes the output for smoothness using a total variation (TV) loss. Once trained, the network may be used to transform an image into the style used during training, using a single feed-forward pass of the network. However the network is restricted to the single style in which it has been trained.[8]

In a work by Chen Dongdong et al. they explored the fusion of optical flow information into feedforward networks in order to improve the temporal coherence of the output.[9]

Most recently, feature transform based NST methods have been explored for fast stylization that are not coupled to single specific style and enable user-controllable blending of styles, for example the Whitening and Coloring Transform (WCT).[10]

References

  1. "Image Analogies". 2001. https://mrl.nyu.edu/publications/image-analogies/analogies-72dpi.pdf. Retrieved 13 February 2019. 
  2. "Image Quilting". 2001. http://graphics.cs.cmu.edu/people/efros/research/quilting.html. Retrieved 4 February 2021. 
  3. Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (26 August 2015). "A Neural Algorithm of Artistic Style". arXiv:1508.06576 [cs.CV].
  4. Bethge, Matthias; Ecker, Alexander S.; Gatys, Leon A. (2016). "Image Style Transfer Using Convolutional Neural Networks". pp. 2414–2423. https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Gatys_Image_Style_Transfer_CVPR_2016_paper.html. Retrieved 13 February 2019. 
  5. "Very Deep CNNS for Large-Scale Visual Recognition". 2014. http://www.robots.ox.ac.uk/~vgg/research/very_deep/. Retrieved 13 February 2019. 
  6. Dumoulin, Vincent; Shlens, Jonathon S.; Kudlur, Manjunath (9 February 2017). "A Learned Representation for Artistic Style". arXiv:1610.07629 [cs.CV].
  7. Ruder, Manuel; Dosovitskiy, Alexey; Brox, Thomas (2016). "Artistic Style Transfer for Videos". Pattern Recognition. Lecture Notes in Computer Science. 9796. pp. 26–36. doi:10.1007/978-3-319-45886-1_3. ISBN 978-3-319-45885-4. 
  8. Johnson, Justin; Alahi, Alexandre; Li, Fei-Fei (2016). "Perceptual Losses for Real-Time Style Transfer and Super-Resolution". arXiv:1603.08155 [cs.CV].
  9. Chen, Dongdong; Liao, Jing; Yuan, Lu; Yu, Nenghai; Hua, Gang (2017). "Coherent Online Video Style Transfer". arXiv:1703.09211 [cs.CV].
  10. Li, Yijun; Fang, Chen; Yang, Jimei; Wang, Zhaowen; Lu, Xin; Yang, Ming-Hsuan (2017). "Universal Style Transfer via Feature Transforms". arXiv:1705.08086 [cs.CV].