This has got to be one of the coolest implementations of machine learning. If you don’t know what neural style transfer is, it’s basically taking a content image, like a photograph that you took of you and your family, and a style image, most of the time you would choose a famous painting with a distinct style, and combine those two in a single image containing the overall objects and events of the content image and the painting style of the style image.

You can find the code here.

## Why not Tensorflow?

It is true that I prefer using Tensorflow over any other deep learning library, but I’ve been trying to get a Tensorflow mode Ml working for almost two weeks and I just can’t seem to get it right. And I think I know where the problem lies – the weights. If you’ve ever dealt with style transfer, than you should know that in order to create a image containing the overall objects of the content image and the painting style of the style image you either got to train your own convolutional model on hundreds, or even thousands, of classes, which might take a while; or you can use one of the pretrained models.

The most popular pretrained image classifying model is VGG. VGG is a model trained by people at Oxford which achieved very good results on the ImageNet dataset.

The original model was trained using Caffe, and the weights were released for free public use. The problem is that there isn’t an official Tensorflow version of those weights.

Keras, on the other hand has a class called VGG19 which downloads the officially supported Keras weights.

And I am going to rewrite this using Tensorflow and write a blog post about it in the next couple of days.

**\\Update:** I’ve written the Tensorflow tutorial, and you can find it here.

### Let’s dig in!

## Dependencies

First of all, ignore that I’m importing Keras through tf.keras.

I’m a lazy idiot and I don’t really want to install an additional dependency. Keras is the official high-level API for Tensorflow, and the regular Keras, which you get using pip install Keras, uses one of the three officially supported backend libraries (Tensorflow, Theano and CNTK), so either way, you need Tensorflow.

https://gist.github.com/markojerkic/61325fd9c66e3743626c7c4d436ecd78

## Preprocessing

Before you do any machine learning, you need to do some preprocessing. Basically, we need to convert the image, which is three dimensional, to a four-dimensional array, where the newly added dimension is for using batch images.

And also we need to add the mean pixel values. It’s really simple:

https://gist.github.com/markojerkic/26d28c33d9a1794ab0e8997ef5c5ece2

The image height and width are predefined and are the same for all three images (content, style and generated). Those values are for you to define based on your CPU or GPU.

## Build the model

Now we need to initialize the variables. Those are the content image, the style image and the generated image.

https://gist.github.com/markojerkic/a8b458947209b8d7db288bd448512677

Since I’m using Tensorflow as my backend, my image shape is *(batch, height, width, depth)*. If you’re using Theano than it should be *(batch, depth, height, width)*.

The concatenate() function combines those three arrays into a single array which will be passed to the VGG19 class and the convolutional layers.

After that, we create a dictionary of outputs, where the layer names are the keys and the outputs are the values.

# Losses

The loss is the most important part of this script. We need to look at specific layer outputs of the model for either the content image and the generated image or the style image and the generated image.

## Content loss

The content loss is the simple Euclidean distance between the outputs of the model for the content image and the generated image at a specific layer in the network.

https://gist.github.com/markojerkic/7fac3384ac4d9d3f93f4bbe4e1acf707

Where the *content* is the feature map of the content image, and *gen* is the feature map of the generated image.

https://gist.github.com/markojerkic/54b6f03c29481fde88b5ebd5e0019db2

We initialize the loss to be zero. We get the outputs of the model for the layer *block5_conv2*, which is the layer I chose in this case for the content loss. The higher layers are better at recognizing the overall shapes like eyes, faces, while the lower layers are better at recognizing the lines like the paint brushes and the basic shapes.

Then we add the content loss, multiplied by the content weight, which in this case is 0.025, to the total loss.

## Style loss

Style loss is a bit different story. When calculating the style loss, we need to find the Euclidean distance between the gram matrices of the feature maps of the style image and the content image, multiplied by a constant.

https://gist.github.com/markojerkic/eb45208df73084bdd0e774831016d9c0

Gram matrix is the dot product of the flattened feature map and the transpose of the flattened feature map.

After we calculate the gram matrices of the feature maps, we calculate the Euclidean distance and divide it by $latex 4 \cdot size^2 \cdot channels^2 &s=1$.

https://gist.github.com/markojerkic/53b632c280c7c7aeeb395f5b27da4f62

For the style loss, we use the first convolutional layers for all five blocks. The procedure here is almost the same as with the content loss. We just need to divide the loss by the number of layers we take into account, which in this case is 5. I’ve initialized the style weight to 1.0.

## Total variance loss

We also need to add a third type of loss, the total variance loss.

This loss reduces the amount of noise in the generated image. This means that the generated image will not look all fuzzy, but rather it will look smooth and overall better.

https://gist.github.com/markojerkic/b1c89a186bf4822567ad0693ca6ea25c

We just go over the image pixel by pixel and see how big of a distance there is between them.

https://gist.github.com/markojerkic/b84d5a5cec527c55fc64fe56d05bbc01

I’ve initialized the “*tv*” weight to 1.0.

## Evaluating the gradients and loss

Gradients are very easy to calculate. We just pass the generated image and the loss to the function`K.gradients()`

.

We also need to define a Keras function which calculates the loss and the gradients during the optimization.

Also, a function which interprets that output and an evaluator object which keeps track of the loss and gradients would be nice as well.

https://gist.github.com/markojerkic/6d592c9a3e0364a15d2a9c69c8d735e4

## Optimization time!

Now that we’ve created the computation graph, we need to define an optimization function.

In this case, the best possible optimization method is the limited memory BFGS, with which SciPy has so kindly provided us in their `scipy.optimize`

package.

https://gist.github.com/markojerkic/e9088eec31e90a377bcc187ae9f5159d

Of course, the more iterations the better the output image should be. At least, in theory, it should be like that. But there is a lot more that goes into optimization of this script, like messing with weights of individual style layers or trying different content to style weight ratio etc.

## Results

I didn’t have a lot of time to run this script, so my results are limited, but I hope that I will have more time to add a couple more output examples when I put up my followup post about writing this script using only Tensorflow. But for now, I’ve got these examples:

And one of my favourite Game of thrones characters – King Robert, first of his name, King of the Andals and the Rhoynar and the First Men, Lord of the Seven Kingdoms and Protector of the Realm.