Download Image Classification with Transfer Learning in Keras - Create Cutting Edge CNN Models - Python
Categories:

TAGS :

Viewed: 158 - Published at: a few seconds ago

[ Image Classification with Transfer Learning in Keras - Create Cutting Edge CNN Models ]

Introduction

Deep Learning models are very versatile and powerful - they're routinely outperforming humans in narrow tasks, and their generalization power is increasing at a rapid rate. New models are being released and benchmarked against community-accepted datasets frequently, and keeping up with all of them is getting harder.

Most of these models are open source, and you can implement them yourself as well.

This means that the average enthusiast can load in and play around with the cutting edge models in their home, on very average machines, not only to gain a deeper understanding and appreciation of the craft, but also to contribute to the scientific discourse and publish their own improvements whenever they're made.

In this guide, you'll learn how to use pre-trained, cutting edge Deep Learning models for Image Classification and repurpose them for your own specific application. This way, you're leveraging their high performance, ingenious architectures and someone else's training time - while applying these models to your own domain.
All of the code written in the guide is also available on GitHub.
        <h3 id="transferlearningforcomputervisionandconvolutionalneuralnetworkscnns">Transfer Learning for Computer Vision and Convolutional Neural Networks (CNNs)</h3>

Knowledge and knowledge representations are very universal. A computer vision model trained on one dataset learns to recognize patterns that might be very prevalent in many other datasets. Notably, in "Deep Learning for the Life Sciences", by Bharath Ramsundar, Peter Eastman, Patrick Walters and Vijay Pande, it's noted that:

"There have been multiple studies looking into the use of recommendation system algorithms for use in molecular binding prediction. Machine learning architectures used in one field tend to carry over to other fields, so it’s important to retain the flexibility needed for innovative work."

For instance, straight and curved lines, which are typically learned at a lower level of a CNN hierarchy are bound to be present in practically all datasets. Some high-level features, such as the ones that distinguish a bee from an ant are going to be represented and learned much higher in the hierarchy:

feature hierarchies for convolutional neural networks

The "fine line" between these is what you can reuse! Depending on the level of similarity between your dataset and the one a model's been pre-trained on, you may be able to reuse a small or large portion of it.

A model that classifies human-made structures (trained on a dataset such as the Places365) and a model that classifies animals (trained on a dataset such as ImageNet) are bound to have some shared patterns, although, not a lot.

You might want to train a model to distinguish, say, buses and cars for a self-driving car's vision system. You may also reasonably choose to use a very performant architecture that has proven to work well on datasets similar to yours. Then, the long process of training begins, and you end up having a performant model of your own! However, if another model is likely to have similar representations on lower and higher levels of abstraction, there's no need to re-train a model from scratch. You may decide to use some of the already pre-trained weights, which are just as applicable to your own application of the model as they were applicable to the creator of the original architecture. You'd be transferring some of the knowledge from an already existing model to a new one, and this is known as Transfer Learning. The closer the dataset of a pre-trained model is to your own, the more you can transfer. The more you can transfer, the more of your own time and computation you can save. It's worth remembering that training neural networks does have a carbon footprint, so you're not only saving time! Typically, Transfer Learning is done by loading a pre-trained model, and freezing its layers. In many cases, you can just cut off the classification layer (the final layers, or, head) and just re-train the classification layer, while keeping all of the other abstraction layers intact. In other cases, you may decide to re-train several layers in the hierarchy instead, and this is typically done when the datasets contain sufficiently different data points that re-training multiple layers is warranted. You may also decide to re-train the entirety of the model to fine-tune all of the layers. These two approaches can be summarized as:

  • Using the Convolutional Network as a Feature Extractor
  • Fine-Tuning the Convolutional Network

In the former, you use the underlying entropic capacity of the model as a fixed feature extractor, and just train a dense network on top to discern between these features. In the latter, you fine-tune the entire (or portion of the) convolutional network, if it doesn't already have representative feature maps for some other more specific dataset, while also relying on the already trained feature maps. Here's a visual representation of how Transfer Learning works:

how does transfer learning work?

Established and Cutting Edge Image Classification Models

Many models exist out there, and for well-known datasets, you're likely to find hundreds of well-performing models published in online repositories and papers. A good holistic view of models trained on the ImageNet dataset can be seen at PapersWithCode. Some of the well-known published architectures that have subsequently been ported into many Deep Learning frameworks include:

  • EfficientNet
  • SENet
  • Xception
  • ResNet
  • VGGNet
  • AlexNet
  • LeNet-5

The list of models on PapersWithCode is constantly being updated, and you shouldn't hang up on the position of these models there. Many of them are outperformed by various other models as of writing, and many of the new models are actually based on the ones outlined in the list above. It's worth noting that Transfer Learning actually played an important role in the newer, higher accuracy models! The downside is - a lot of the newest models aren't ported as pre-trained models within frameworks such as Tensorflow and PyTorch. It's not like you'll be losing out on a lot of the performance, so going with any of the well-established ones isn't really bad at all.

Transfer Learning with Keras - Adapting Existing Models

With Keras, the pre-trained models are available under the tensorflow.keras.applications module. Each model has its own sub-module and class. When loading a model in, you can set a couple of optional arguments to control how the models are being loaded in. For instance, the weights argument, if present, defines the pre-trained weights. If omitted, only the architecture (untrained network) will be loaded in. If you supply the name of a dataset - a pre-trained network will be returned for that dataset. Additionally, since you'll most likely be removing the top layer(s) for Transfer Learning, the include_top argument is used to define whether the top layer(s) should be present or not!

import tensorflow.keras.applications as models

# 98 MB
resnet = models.resnet50.ResNet50(weights='imagenet', include_top=False)
# 528MB
vgg16 = models.vgg16.VGG16(weights='imagenet', include_top=False)
# 23MB
nnm = models.NASNetMobile(weights='imagenet', include_top=False)
# etc...
Note: If you've never loaded pre-trained models before, they'll be downloaded over an internet connection. This may take anywhere between a few seconds and a couple of minutes, depending on your internet speed and the size of the models. The size of pre-trained models spans from as little as 14MB (typically lower for Mobile models) to as high as 549MB.
        <strong><em>EfficientNet</em></strong> is a family of networks that are quite performant, scalable and, well, efficient. They were made with reducing learnable parameters in mind, so they only have 4M parameters to train. While 4M is still a large number, consider that VGG16, for instance, has 20M. On a home setup, this also helps with training times significantly!

Let's load in one of the members of the EfficientNet family - EfficientNet-B0:

effnet = keras.applications.EfficientNetB0(weights='imagenet', include_top=False)
effnet.summary()

This results in:

Model: "efficientnetb0"
________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
========================================================================================
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
________________________________________________________________________________________
rescaling (Rescaling)           (None, 224, 224, 3)  0           input_1[0][0]                    
________________________________________________________________________________________
...
...
block7a_project_conv (Conv2D)   (None, 7, 7, 320)    368640      block7a_se_excite[0][0]          
________________________________________________________________________________________
block7a_project_bn (BatchNormal (None, 7, 7, 320)    1280        block7a_project_conv[0][0]                    
========================================================================================
Total params: 3,634,851
Trainable params: 3,592,828
Non-trainable params: 42,023
________________________________________________________________________________________

On the other hand, if we were to load in EfficientNet-B0 with the top included, we'd also have a few new layers at the end, that were trained to classify the data for ImageNet. This is the top of the model that we'll be training ourselves for our own application:

effnet = keras.applications.EfficientNetB0(weights='imagenet', include_top=True)
effnet.summary()

This would include the Flatten and Dense layers, which then prop up the parameter size significantly:

Model: "efficientnetb0"
_________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
=========================================================================================
input_11 (InputLayer)           [(None, 224, 224, 3) 0                                            
_________________________________________________________________________________________
...
...
_________________________________________________________________________________________
top_conv (Conv2D)               (None, 7, 7, 1280)   409600      block7a_project_bn[0][0]         
_________________________________________________________________________________________
top_bn (BatchNormalization)     (None, 7, 7, 1280)   5120        top_conv[0][0]                   
_________________________________________________________________________________________
top_activation (Activation)     (None, 7, 7, 1280)   0           top_bn[0][0]                     
_________________________________________________________________________________________
avg_pool (GlobalAveragePooling2 (None, 1280)         0           top_activation[0][0]             
_________________________________________________________________________________________
top_dropout (Dropout)           (None, 1280)         0           avg_pool[0][0]                   
_________________________________________________________________________________________
predictions (Dense)             (None, 1000)         1281000     top_dropout[0][0]                
=========================================================================================
Total params: 5,330,571
Trainable params: 5,288,548
Non-trainable params: 42,023
_________________________________________________________________________________________

Again, we won't be using the top layers, as we'll be adding our own top to the EfficientNet model and re-training only the ones we add on top. It is worth noting what the architecture is already built with! They seem to be using a Conv2D layer, followed by a BatchNormalization, GlobalAveragePooling2D and Dropout before the final Dense classification layer. While we don't have to strictly follow this approach (and other approaches may prove to be better for another dataset), it's reasonable to remember how the original top looked like.

Note: Data preprocessing plays a crucial role in model training, and most models will have different preprocessing pipelines. You don't have to perform guesswork here! Where applicable, a model comes with its own preprocess_input() function.
        The preprocess_input() function applies the same preprocessing steps to the input as they were applied during training. You can import the function from the respective module of the model, if a model resides in its own module. For instance, VGG16 has its own preprocess_input function:
from keras.applications.vgg16 import preprocess_input

That being said, loading in a model, preprocessing input for it and predicting a result in Keras is as easy as:

import tensorflow.keras.applications as models
from keras.applications.vgg16 import preprocess_input

vgg16 = models.vgg16.VGG16(weights='imagenet', include_top=True)

img = # get data
img = preprocess_input(img)
pred = vgg16.predict(img)
Note: Not all models have a dedicated preprocess_input() function, because the preprocessing is done within the model itself. For instance, EfficientNet that we'll be using doesn't have its own dedicated preprocessing function, as the Rescaling layer takes care of that.
        That's it! Now, since the pred array doesn't really contain human-readable data, you can also import the decode_predictions() function alongside the preprocess_input() function from a module. Alternatively, you can import the generic decode_predictions() function that also applies to models that don't have their dedicated modules:
from keras.applications.model_name import preprocess_input, decode_predictions
# OR
from keras.applications.imagenet_utils import decode_predictions
# ...
print(decode_predictions(pred))

Tying this together, let's get an image of a black bear via urllib, save that file into a target size suitable for EfficientNet (the input layer expects a shape of (None, 224, 224, 3)) and classify it with the pre-trained model:

from tensorflow import keras
from keras.applications.vgg16 import preprocess_input, decode_predictions
from tensorflow.keras.preprocessing import image

import urllib.request
import matplotlib.pyplot as plt
import numpy as np

# Public domain image
url = 'https://upload.wikimedia.org/wikipedia/commons/0/02/Black_bear_large.jpg'
urllib.request.urlretrieve(url, 'bear.jpg')

# Load image and resize (doesn't keep aspect ratio)
img = image.load_img('bear.jpg', target_size=(224, 224))
# Turn to array of shape (224, 224, 3)
img = image.img_to_array(img)
# Expand array into (1, 224, 224, 3)
img = np.expand_dims(img, 0)
# Preprocess for models that have specific preprocess_input() function
# img_preprocessed = preprocess_input(img)

# Load model and run prediction
effnet = keras.applications.EfficientNetB0(weights='imagenet', include_top=True)
pred = effnet.predict(img)
print(decode_predictions(pred))

This results in:

[[
('n02133161', 'American_black_bear', 0.6024658),
('n02132136', 'brown_bear', 0.1457715),
('n02134418', 'sloth_bear', 0.09819221),
('n02510455', 'giant_panda', 0.0069221947),
('n02509815', 'lesser_panda', 0.005077324)
]]

It's fairly certain that the image is an image of an American Black Bear, which is right! When preprocessed with a preprocessing function, the image may change significantly. For instance, VGG16's preprocessing function would change the color of the bear's fur:

preprocessing image for VGG16 CNN

It looks a lot more brown now! If we were to feed this image into EfficientNet, it'd think it's a brown bear:

[[
('n02132136', 'brown_bear', 0.7152758), 
('n02133161', 'American_black_bear', 0.15667434), 
('n02134418', 'sloth_bear', 0.012813852), 
('n02134084', 'ice_bear', 0.0067828503), ('n02117135', 'hyena', 0.0050422684)
]]

Awesome! The model works. Now, let's add a new top to it and re-train the top to perform classification for something outside of the ImageNet set.

Adding a New Top to a pre-trained Model

When performing transfer learning, you'll be loading models without tops, or remove them manually:

# Load without top
# When adding new layers, we also need to define the input_shape
# so that  the new Dense layers have a fixed input_shape as well
effnet_base = keras.applications.EfficientNetB0(weights='imagenet', 
                                          include_top=False, 
                                          input_shape=((224, 224, 3)))

# Or load the full model
full_effnet = keras.applications.EfficientNetB0(weights='imagenet', 
                                            include_top=True, 
                                            input_shape=((224, 224, 3)))

# And then remove X layers from the top
trimmed_effnet = keras.Model(inputs=full_effnet.input, outputs=full_effnet.layers[-3].output)

We'll be going with the first option since it's more convenient. Depending on whether you'd like to fine-tune the convolutional blocks or not - you'll either freeze or won't freeze them. Say we want to use the underlying pre-trained feature maps and freeze the layers so that we only re-train the new classification layers at the top:

effnet_base.trainable = False

You don't need to iterate through the model and set each layer to be trainable or not, though you also can. If you'd like to turn off the first n layers, and allow some higher-level feature maps to be fine-tuned, but leave the lower-level ones untouched, you can:

for layer in effnet_base.layers[:-2]:
    layer.trainable = False

Here, we've set all layers in the base model to be untrainable, except for the last two. If we check the model, there are only ~2.5K trainable parameters now:

effnet_base.summary()
# ...                
=========================================================================================
Total params: 4,049,571
Trainable params: 2,560
Non-trainable params: 4,047,011
_________________________________________________________________________________________

Now, let's define a Sequential model that'll be put on top of this effnet_base. Fortunately, chaining models in Keras is as easy as making a new model and putting it on top of another one! You can leverage the Functional API and just chain a few new layers on top of a model. Let's add a Conv2D layer, a BatchNormalization layer, a GlobalAveragePooling2D layer, some Dropout and a couple of fully connected layers after a Flatten:

conv2d = keras.layers.Conv2D(7, 7)(effnet_base.output, training=False)
bn = keras.layers.BatchNormalization()(conv2d)
gap = keras.layers.GlobalAveragePooling2D()(bn)
do = keras.layers.Dropout(0.2)(gap)
flatten = keras.layers.Flatten()(do)
fc1 = keras.layers.Dense(512, activation='relu')(flatten)
output = keras.layers.Dense(10, activation='softmax')(fc1)

new_model = keras.Model(inputs=effnet_base.input, outputs=output)
Note: When adding the layers of the EfficientNet, we set the training to False. This puts the network in inference mode instead of training mode and it's a different parameter than the trainable we've set to False earlier. This is a crucial step if you wish to unfreeze layers later on! BatchNormalization computes moving statistics. When unfrozen, it'll start applying updates to parameters again, and will "undo" the training done before fine-tuning. Since TF 2.0, setting the model's trainable as False also turns training to False but only for BatchNormalization layers, so this step is unnecessary for versions after TF 2.0.
        Alternatively, you can use the Sequential API and call the add() method multiple times:
new_model = keras.Sequential()
new_model.add(effnet_base) # Add entire model
new_model.add(keras.layers.Conv2D(7,7))
new_model.add(keras.layers.BatchNormalization())
new_model.add(keras.layers.GlobalAveragePooling2D())
new_model.add(keras.layers.Dropout(0.2))
new_model.add(keras.layers.Flatten())
new_model.add(keras.layers.Dense(512, activation='relu'))
new_model.add(keras.layers.Dense(10, activation='softmax'))

This adds the entire model as a layer itself, so it's treated as one entity:

Layer: 0, Trainable: False # Entire EfficientNet model
Layer: 1, Trainable: True
Layer: 2, Trainable: True
...

On the other hand, you can extract all of the layers and add them instead as separate entities, by adding the output of effnet_base:

new_model = keras.Sequential()
new_model.add(effnet_base.output) # Add unwrapped layers
new_model.add(keras.layers.Conv2D(7,7))
new_model.add(keras.layers.BatchNormalization())
new_model.add(keras.layers.GlobalAveragePooling2D())
new_model.add(keras.layers.Dropout(0.2))
new_model.add(keras.layers.Dense(10, activation='softmax'))

In any of these cases - we've added 10 output classes, since we'll be using the CIFAR10 dataset later on, which has 10 classes! Let's take a look at the trainable layers in the network:

for index, layer in enumerate(new_model.layers):
    print("Layer: {}, Trainable: {}".format(index, layer.trainable))

This results in:

Layer: 0, Trainable: False
Layer: 1, Trainable: False
Layer: 2, Trainable: False
...
Layer: 235, Trainable: False
Layer: 236, Trainable: False
Layer: 237, Trainable: True
Layer: 238, Trainable: True
Layer: 239, Trainable: True
Layer: 240, Trainable: True
Layer: 241, Trainable: True

Awesome! Let's load in the dataset, preprocess it and re-train the classification layers on it.

Loading and Preprocessing Data

We'll be working with the CIFAR10 dataset. This is a dataset that's not too hard to classify since it only has 10 classes, and we'll be leveraging a well-received architecture to help us in that endeavor. Its "older brother", CIFAR100 is a genuinely hard one to work with. It has 50.000 images, with 100 labels, meaning each class has only 500 samples. This is extremely hard to get right on so few labels, and almost all well-performing models on the dataset use heavy data augmentation.

Data augmentation is an art and science in and of itself, and is out of scope for this guide - so we'll only be diversifying the dataset with a couple of random transformations.

For brevity's sake, we'll stick to CIFAR10, to emulate the dataset you'll be working with yourself!

Note: Keras' datasets module contains a few datasets, but these are mainly meant for benchmarking and learning. We can use tensorflow_datasets to get access to a much larger corpora of datasets! Alternatively, you can use any other source, such as Kaggle or academic repositories.
        We'll be using tensorflow_datasets to download the CIFAR10 dataset, get the labels and the number of classes:
import tensorflow_datasets as tfds
import tensorflow as tf

dataset, info = tfds.load("cifar10", as_supervised=True, with_info=True)
# Save class_names and n_classes for later
class_names = info.features["label"].names
n_classes = info.features["label"].num_classes

print(class_names) # ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
print(n_classes) # 10

You can get to know the dataset like this, but we won't be diving into that right now. Let's split it into a train_set, valid_set and test_set instead:

test_set, valid_set, train_set = tfds.load("cifar10", 
                                           split=["train[:10%]", "train[10%:25%]", "train[25%:]"],
                                           as_supervised=True)

print("Train set size: ", len(train_set)) # Train set size:  37500
print("Test set size: ", len(test_set)) # Test set size:  5000
print("Valid set size: ", len(valid_set)) # Valid set size:  7500
Note: The split argument expects train and test keywords, and there's no valid keyword that can be used to extract a validation set. Because of this, we need to perform the slightly awkward and clunky split as we have - with a 10/15/75 split.
        Now, the CIFAR10 images are significantly different from the ImageNet images! Namely, CIFAR10 images are just 32x32, while our EfficientNet model expects 224x224 images. We'll want to resize the images in any case. We might also want to apply some transformation functions on duplicate images to artificially expand the sample size per class if the dataset doesn't have enough of them. In the case of CIFAR10, this isn't an issue, as there are enough images per class, but with CIFAR100 - it's a different story. It's worth noting that, when upscaling images that are this small, even humans have a significant difficulty discerning what's on some of the images.

For instance, here are a few images:

cifar100 image examples

Can you tell what's on these with confidence? Consider the lifelong amount of context you have for these images, as well, which the model doesn't have. It's worth keeping this in mind when you train it and observe the accuracy. Let's define a preprocessing function for each image and its associated label:

def preprocess_image(image, label):
    # Resize to EfficientNet size
    resized_image = tf.image.resize(image, [224, 224])
    # Random flips and rotations (fully optional, and doesn't impact performance much no augmentation takes place)
    # If we run this function multiple times, it'll net different results
    img = tf.image.random_flip_left_right(resized_image)
    img = tf.image.random_flip_up_down(img)
    img = tf.image.rot90(img)
    # Preprocess image with model-specific function if it has one
    # img = preprocess_input(img)
    return img, label

And finally, we'll want to apply this function to each image in the sets! We haven't performed augmentation by expanding the sets here, though you could. For brevity's sake, we'll avoid performing data augmentation. This is easily done via the map() function. Since the input into the network also expects batches ((None, 224, 224, 3) instead of (224, 224, 3)) - we'll also batch() the datasets after mapping:

train_set = train_set.map(preprocess_image).batch(32).prefetch(1)
test_set = test_set.map(preprocess_image).batch(32).prefetch(1)
valid_set = valid_set.map(preprocess_image).batch(32).prefetch(1)
Note: The prefetch() function is optional but helps with efficiency. As the model is training on a single batch, the prefetch() function pre-fetches the next batch so it's not waited upon when the training step is finished.
        Finally, we can train the model!

Training a Model

With the data loaded, preprocessed and split into adequate sets - we can train the model on it. The optimizer as well as its hyperparameters, loss function and metrics generally depend on your specific task. Since we're doing sparse classification, a sparse_categorical_crossentropy loss should work well, and the Adam optimizer is a reasonable default loss function. Let's compile the model, and train it on a few epochs. It's worth remembering that most of the layers in the network are frozen! We're only training the new classifier on top of the extracted feature maps. Only once we train the top layers, we may decide to unfreeze the feature extraction layers, and let them fine-tune a bit more. This step is optional, and in many cases, you won't unfreeze them (mainly when working with really large networks). A good rule of thumb is to try and compare the datasets and guesstimate which levels of the hierarchy you can re-use without re-training. If they're really different, you probably chose a network pre-trained on the wrong dataset. It wouldn't be efficient to use feature extraction of Places365 (man-made objects) for classifying animals. However, it would make sense to use a network trained on ImageNet (which has various objects, animals, plants and humans) and then use it for a different dataset with relatively similar categories, such as CIFAR10.

Note: Depending on the architecture you're using, unfreezing the layers might be a bad idea, due to their size. There's a good chance that your local machine will run out of memory when trying to tackle a 20M parameter model and loading a training step into the RAM/VRAM. When possible, try to find an architecture pre-trained on a dataset that's sufficiently similar to yours that you don't have to change the feature extractors. If you have to, it's not impossible but does make the process much slower. We'll cover that later.
        Let's train the new network (really, only the top of it) for 10 epochs:
optimizer = keras.optimizers.Adam(learning_rate=2e-5)

new_model.compile(loss="sparse_categorical_crossentropy", 
                  optimizer=optimizer, 
                  metrics=["accuracy"])

history = new_model.fit(train_set, 
                        epochs=10,
                        validation_data=valid_set)
Note: This may take some time and is ideally done on a GPU. Depending on how large the model is, and the dataset being fed into it. If you don't have access to a GPU, it's advised to run this code on any of the cloud providers that give you access to a free GPU, such as Google Collab, Kaggle Notebooks, etc. Each epoch can take anywhere from 60 seconds on stronger GPUs to 10 minutes, on weaker ones.
        This is the point in which you sit back and go grab a coffee (or tea)! After 10 epochs, the train and validation accuracy are looking good:
Epoch 1/10
1172/1172 [==============================] - 99s 80ms/step - loss: 1.7795 - accuracy: 0.5503 - val_loss: 1.2776 - val_accuracy: 0.8055
...
Epoch 10/10
1172/1172 [==============================] - 94s 80ms/step - loss: 0.2760 - accuracy: 0.9088 - val_loss: 0.3419 - val_accuracy: 0.8896

~90% on the training set and ~89% on the validation set - it didn't really overfit much, and a 1% difference might as well be the random sampling at work. Let's take a look at the learning curves!

Testing a Model

Let's first test this model out, before trying to unfreeze all of the layers and seeing if we can fine-tune it then:

new_model.evaluate(test_set)
# 157/157 [==============================] - 10s 66ms/step - loss: 0.3312 - accuracy: 0.8888
# [0.3312007784843445, 0.8888000249862671]

~89% on the testing set, and extremely close to the accuracy on the validation set! Looks like our model is generalizing well, but there's still room for improvement. Let's take a look at the learning curves. The training curves are to be expected - they're pretty short since we only trained for 10 epochs, but they've quickly plateaued, so we probably wouldn't have gotten much better performance with more epochs. The validation loss and accuracy reached the training loss an accuracy fairly quickly, so we would've overfit the model if we were to train it further as it is. While oscillations do occur and the accuracy could very well rise in Epoch 11 - it's not too likely, so we'll miss out on the chance:

transfer learning training curves

Can we fine-tune this network further? We've replaced and re-trained the top layers concerned with classification of feature maps, but the feature maps themselves might not be ideal! While they are pretty good, these images are simply different from ImageNet, so it's worth taking the time to update the feature extraction layers as well. Let's try unfreezing the convolutional layers and fine-tuning them as well!

Unfreezing Layers - Fine-Tuning a Network Trained with Transfer Learning

Once you've finished re-training the top layers, you can close the deal and be happy with your model. For instance, suppose you got a 95% accuracy - you seriously don't need to go further. However, why not? If you can squeeze out an additional 1% in accuracy, it might not sound like a lot, but consider the other end of the trade. If your model has a 95% accuracy on 100 samples, it misclassified 5 samples. If you up that to 96% accuracy, it misclassified 4 samples.

The 1% of accuracy translates to a 25% decrease in false classifications.

Whatever you can further squeeze out of your model can actually make a significant difference on the number of incorrect classifications. We have a pretty satisfactory 88% accuracy with our model, but we can most probably squeeze more out of it if we just slightly re-train the feature extractors. Again, the images in CIFAR10 are much smaller than ImageNet images, and it's almost as if someone with great eyesight suddenly gained a huge prescription and only saw the world through blurry eyes. The feature maps have to be at least somewhat different! Let's save the model into a file so we don't lose the progress, and unfreeze/fine-tune a loaded copy, so we don't accidentally mess up the weights on the original one:

new_model.save('effnet_transfer_learning.h5')
loaded_model = keras.models.load_model('effnet_transfer_learning.h5')

Now, we can fiddle around and change the loaded_model without impacting new_model! To start out, we'll want to change the loaded_model from inference mode back to training mode - i.e. unfreeze the layers so that they're trainable again.

Note: Again, if a network uses BatchNormalization (and most do), you'll want to keep it frozen while fine-tuning a network. Since we're not freezing the entire base network anymore, we'll just freeze the BatchNormalization layers instead and allow other layers to be altered.
        Let's turn off the BatchNormalization layers so our training doesn't go down the drain:
for layer in loaded_model.layers:
    if isinstance(layer, keras.layers.BatchNormalization):
        layer.trainable = False
    else:
        layer.trainable = True

for index, layer in enumerate(loaded_model.layers):
    print("Layer: {}, Trainable: {}".format(index, layer.trainable))

Let's check if that worked:

Layer: 0, Trainable: True
Layer: 1, Trainable: True
Layer: 2, Trainable: True
Layer: 3, Trainable: True
Layer: 4, Trainable: True
Layer: 5, Trainable: False
Layer: 6, Trainable: True
Layer: 7, Trainable: True
Layer: 8, Trainable: False
...

Awesome! Before we can do anything with the model, to "solidify" the trainability, we have to recompile it. This time around, we'll be using a smaller learning_rate, since we don't want to alter the network much at all, and just want to fine-tune some of the feature extracting capabilities and the new classification layer on top:

optimizer = keras.optimizers.Adam(learning_rate=1e-6, decay=(1e-6/50))

# Recompile after turning to trainable
loaded_model.compile(loss="sparse_categorical_crossentropy", 
                  optimizer=optimizer, 
                  metrics=["accuracy"])

history = loaded_model.fit(train_set, 
                        epochs=50,
                        validation_data=valid_set)

Again, this may take some time - so sip on another hot beverage of your choice while this runs in the background. Once it finishes, it should reach up to 92% in validation accuracy and 92.6% on the test set:

Epoch 1/50
1172/1172 [==============================] - 389s 327ms/step - loss: 0.2031 - accuracy: 0.9316 - val_loss: 0.2916 - val_accuracy: 0.9075
...
Epoch 50/50
1172/1172 [==============================] - 380s 324ms/step - loss: 0.0741 - accuracy: 0.9722 - val_loss: 0.2429 - val_accuracy: 0.9363

We've gotten up to ~94%! This is a huge jump from the perspective of the proportion of misclassifications. Additionally, if you take a look at the learning curves, they appear to have not plateaued, and we could've probably increased the performance of the model further if we were just to train it for longer:

efficientnetb0 - transfer learning 94% accuracy on CIFAR10

Note: We probably could've seen further performance increases through further training. We've ran a training loop for an additional 100 epochs and achieved ~95% accuracy, but note that training for this long takes time. While comparatively low to many other architectures, the 100 epochs took over 10h to train on a home, non-specialized GPU - GeForce GTX 1060 Super.
        Let's evaluate it and visualize some of the predictions:
loaded_model.evaluate(test_set)

# 157/157 [==============================] - 10s 61ms/step - loss: 0.2149 - accuracy: 0.9384
# [0.21492990851402283, 0.9383999705314636]


fig = plt.figure(figsize=(15, 10))

i = 1
for entry in test_set.take(25):
    # Predict, get the raw Numpy prediction probabilities
    # Reshape entry to the model's expected input shape
    pred = np.argmax(loaded_model.predict(entry[0].numpy()[0].reshape(1, 224, 224, 3)))

    # Get sample image as numpy array
    sample_image = entry[0].numpy()[0]
    # Get associated label
    sample_label = class_names[entry[1].numpy()[0]]
    # Get human label based on the prediction
    prediction_label = class_names[pred]
    ax = fig.add_subplot(5, 5, i)

    # Plot image and sample_label alongside prediction_label
    ax.imshow(np.array(sample_image, np.int32))
    ax.set_title("Actual: %s\nPred: %s" % (sample_label, prediction_label))
    i = i+1

plt.tight_layout()
plt.show()

transfer learning efficientnet-b0 model predictions - 94% accuracy

Awesome, no misclassifications in the first 25 images! We could've just gotten lucky with this batch being classified so well, and the model doesn't really have a 100% accuracy. Though, take the lack of context into consideration as well - for instance, take image 16 (truck). It's in a forest and is brown and elongated. This also fits the description of a horse to a degree, so it's not too surprising that in a blurry, small (224x224) image, a truck could be misclassified as a horse. Another thing that definitely doesn't help is that it looks like the gate of the truck is open, which may look like the neck of a horse as it feeds on grass.

Conclusion

Transfer Learning is the process of transferring already learned knowledge representations from one model to another, when applicable. This concludes this guide to Transfer Learning for Image Classification with Keras and Tensorflow. We've started out with taking a look at what Transfer Learning is and how knowledge representations can be shared between models and architectures. Then, we've taken a look at some of the most popular and cutting edge models for Image Classification released publically, and piggy-backed on one of them - EfficientNet - to help us in classifying some of our own data. We've taken a look at how to load and examine pre-trained models, how to work with their layers, predict with them and decode the results, as well as how to define your own layers and intertwine them with the existing architecture. Finally, we've loaded and preprocessed a dataset, and trained our new classification top layers on it, before unfreezing the layers and fine-tuning it further through several additional epochs.

Reference: stackabuse.com