How do you distinguish between Cats and Dogs, anyway?

Jan Geske
6 min readJul 1, 2020

Describing my course project for the Zero to GANs course by Jovian.ml and freecodecamp.org.

In this blog post, I’ll describe how I went about working with the ‘cats-and-dogs’ dataset from Kaggle, in order to train a neural network, to identify the correct species in a series of images.

So, first, we want to explore the data:

image of a cat

The typical image from the chosen data set is about 400 by 400 pixels large — but not uniform. Some are larger, some smaller. So the first call to action was to resize the images into a format that did fit into a uniform tensor. My first intuition was to scale them all to 380 by 380, but I had to eventually scale it down to 64 by 64, as training on images that size simply took too long.

But I also did a bunch of other transformations, first, we scaled them all down to a 64 by 64 grid, than i decided to take the lesson from the CNN lesson to heart, and did a random crop of the same size, with reflection on a 4 pixel padding, just so there exists some form of deviation in the training data. As a final touch, i also normalized the colors using the stats from the ResNet normalization (as discussed in lecture 5), just so it would be easier to compare, if i also wanted to use a transferred ResNet to train the data.

Normalized and scaled images of cats and dogs from the training data set

So, with that already looking pretty good, i wanted to know how much data i had to work with. A quick glance over the data revealed there are 8005 images in the training data set, and 2023 images in the validation data set. From the validation set, i subtracted a random 500 images to test my different approaches (with a pre-defined seed, so they are always the same).

Next up, i imported the base functions we already used in our CIFAR10 CNN notebook, in order to have a baseline to work up from, this includes the accuracy, evaluate, and fit functions, and the base class ImageClassificationBase.

So, i tried a few different approaches, just to see how good they perform against each other, just to see if i could pull it off, with a slightly different data set compared to the lectures.

Number 1: The Feed forward Neural Network

The first hunch i had, was to try out a FFNN that maps my image (3 color channels with 64 by 64 pixels, flattened into a 12288 element vector) to 128 neurons, to 64 neurons, and then into indicators for the two output classes, each with a ‘relu’ activation function in between each layer. The first randomized weights and biases for the model performed as expected and gave a forecast for the right classification with about 50% accuracy.

That model trained with 3 different learning rates for 10 epochs each, for a total of 30 epochs. Let’s look at the losses and the accuracy:

loss for the first FFNN approach
accuracy of the first FFNN approach

We can easily see, how the loss actually increases after a full 10 epochs with the initial learning rate. And the accuracy is climbing minimally, with heavy setbacks (especially during the second set of epochs).

The accuracy against the test set, however, is at the higher end of this graph with 64.4% accuracy. Not bad for a first draft, but i can do better.

Number 2: Adjusting the number of epochs and learning rate

loss of the second FFNN approach
accuaracy of the second FFNN approach

This time, i trained the model with steadily shrinking learning rates for 4, 3 and 3 epochs respectively. Totaling a third of the number of epochs trained last time. While the final accuracy (61% against the test set) isn’t quite as good as the previous one, we get a very similar result, with much less effort.

Number 3: Trying to get somewhere with a Convolutional Neural Network

loss for initial CNN
accuracy for inital CNN

Once again, i looked at how we trained a model for the CIFAR10 dataset, and once again, i took the learnings and tried to apply them to the cats and dogs problem.

I created a CNN with with 3 (double) convolutional layers, in which the image is ‘folded’ by half, and is reduced in size with a max pool layer before being folded again.

After three rounds, it’s packaged in a single vector and passed through a three layer FFNN (like in the previous attempts) for classification.

As displayed in the images, it didn’t end well. The accuray didn’t get noticably better than 52% and ended up at about the same rate as it started with random weights. In disbelieve, i started my search for the cause of such a bad result for a technique, which should have improved my results.

Number 4: It’s the learning Rate, stupid!

loss for CNN with improved learning rate

So, I’ve changed a single thing, and it’s dramatically better, with accuracy well above 80%. Now that’s something — apparently my learning rate was too high, so that’s a lesson I’ll take home.

As you can see in the loss curve, I might’ve trained it a bit longer than needed, but thats totally fine in my book.

accuracy for CNN with improved learning rate

That level of accuracy, especially with the low image resultion, isn’t much worse than a human, and therefore something i’m quite happy with.

I did a final experiment, with multiple rounds of different learning rates, but it didn’t perform much better than the previous (plotted) experiment with 86% over 84% accuracy, but with 20% longer training (more epochs).

Summary: I’ve had a lot of fun with the course ( http://zerotogans.com ), and learned a lot, and this final project did show me, that i had all the tools I’ve needed in order to perform a simple analysis over which ML techniques i wanted to use, and how to evaluate & improve upon my approaches.

If you want to reach through the notebook, you can find it at: https://jovian.ml/jages/course-project-cats-and-dogs/

Make sure to check out the versioning tab, as the main model, aswell as some of the surrounding helper methods, have changed a bit over the creation of this article.

Shout out to the amazing team at http://jovian.ml and http://freecodecamp.org for making this possible.

--

--