Finetuning a pretrained model architecture vs. training from scratch

I recall a case when I helped with supervised model training. The input was 32x32 image and the output was 7 classes. Our dataset size was around 140. We used augmentation heavily.

When we took a renset32 architecture and trained it from scratch, we got 1.41 test loss, and 0.97 training loss.

When we used a CIFAR10-renset32 pretrained architecture, and continued to finetuning, we got 0.51 test loss and 0.09 training loss. This is a huge improvement. Worth to mention that at this stage, we kept the last fully connected layer of CIFAR10-renset32 intact, while our dataset had only 7 labels and not 10, which did not matter. In addition, the training converged twice as fast as the full training. CIFAR’s dataset size is 60,000, which is larger than our 140 images dataset. Therefore, when the dataset is small, one must try using a pretrained model.

Replacing the last linear layer of a classification model

Often, when one is taking a model architecture, like resnet for example, the best practice is to replace the last linear with a new layer with the correct number of output classes to what you need. However, when you just replace a layer, you lose all the pretrained weights. Does it matter? Is keeping the last layer weights as a starting point important?

Let’s take the previous section problem and dataset and see what happened.

When we used a pretrained CIFAR10-renset32 with 10 classes output, on our 7 classes dataset, we got: 0.51 test loss (0.09 train loss)

When we replaced the last layer with a linear layer with 7 classes output we got: 0.58 test loss (0.25 train loss)

So you can see that the performance is lower.

When we replaced the last layer with a linear layer with 7 classes output while preserving the weights of the relevant neurons, we got: 0.49 test loss (0.09 train loss). So, we even improved the performance a bit, and our model has a little bit less parameters.

To conclude, in this case, keeping the pretrained model weights, even when we need to change the last layer, is important.