Fast.ai Lesson 5 of 7: Backprop, Neural net from scratch

Notes from Practical Deep Learning for Coders 2019 Lesson 5 (Part 1)

13 min readAug 12, 2019

Other lessons: Lesson 1 / Lesson 2 / Lesson 3 / Lesson 4 / Lesson 6 / Lesson 7

Quick links: Fast.ai course page / Lecture / Jupyter Notebooks

Reviewing some concepts from last lecture — remember that activation functions are element-wise. The function is applied to each element in the input.

So if the input to an activation function is a 20-element long vector, the output will be of the same size. ReLU is the main one we’ve looked at.

Universal Approximation Theorem: If you have big enough matrices, it can solve any arbitrarily complex mathematical function to any arbitrary level of accuracy.

The piece where we take the loss function between the final activations and the output targets is called backpropagation.

Fine-tuning

What happens when we take a resnet34 and do transfer learning?

The resnet34 from ImageNet has a final weight matrix of 1000 columns. Because the ImageNet task was to classify images into one of 1000 classes (probability for each class).

When you do transfer learning, you might not always need 1000 or maybe not the same classes. So we’ll throw out the weight matrix. So create_cnn() actually deletes that. And instead, it puts in two new weight matrices, with a ReLU in between.

The second matrix is as bit as you want it to be. If you’re doing classification, it’s how many classes you have.

We need to train the new weight matrices because initially they’ll be full of random numbers. But the other layers are not new — they’re already good at something due to the previous training on the neural net.

So we’ll apply freeze() on all the other layers. We're asking fast.ai and pytorch to NOT backpropagate the gradients back into those layers (parameters = parameters - learning rate* gradient). Only update the newer layers. It will make things faster due to fewer calculations, take up less memory, but most importantly it's not going to change the weights that are better than random.

AFTER training the new layers, we unfreeze() and train the whole thing. But the newest layers will still need more training than the ones at the start. So we split the model into a few sections and give different parts of the model different learning rates. One part (earlier) might have 1e-5, another part (later) might have 1e-3. Another thing to note is that if the model is already doing pretty well, a high learning rate could make it less accurate. This process is called discriminative learning rates.

Any time you have a fit() function, you can pass in a learning rate. It can be a single number like 1e-3 (all layers get same learning rate), or you can write a slice like slice(1e-3) with a single number (means final layers get the learning rate but all other layers get 1e-3/3), or

2 numbers like slice(1e-5, 1e-3) (means final layers get 1e-3 but first layers will get 1e-5 and all the other layers in between will get learning rates that are equally split between the two). We give a different learning rate to each layer group.

Going back to the excel sheet from last lesson, these are the outputs after running the solver:

The mean-squared error is 0.39, meaning that for movie ratings predictions ranging from 0 to 5, the error is 0.39

Embedding matrices

Let’s put the earlier worksheet aside and look at another one. We copy over the weight matrices from the earlier worksheet.

One-hot encoding

For each rating, there’s the index, user id and weight matrix of 5 weights.

Same with movies:

The original data was organized like this, where each rating had the userId, movieId, user index, movie index

Now we’re going to replace user id 1 with this vector. We have 15 users. User #1 will have a 1 in the first column and 0s in the remaining 14. User #2 will have a 1 in the second column and 0s in all the others.

Same with movies. Movie #14 will have a 1 in the 14th column and 0 elsewhere. The overall data looks like this:

So the first row is showing user #1 gave a rating for movie #14, second row showing user #2 gave a rating for movie #14, etc.

This is a form of input pre-processing.

Now, to get the user activations in the middle: we’ll take the input user matrix and multiply by the weight matrix. This works because the input user matrix has 15 columns, and the weight matrix has 15 rows and 5 columns (1x15 by 15x5). The resulting matrix is 1x5, which is each row in the user activations column.

We do the same for movies:

Finally, we multiply each movie or user with the activations and get the predicted rating. Which is just the dot product of the Movie matrix with the Movie activation matrix

We can then find the losses squared for each prediction and average loss, which is the 0.39 we saw earlier.

The final version:

It’s the same weight matrices, same userId, movieId and rating mapping.

But this time we have the user embedding which is the activation mapped to the corresponding user index (i.e. user index 1 always has embeddings [0.21, 1.61, 2.89, -1.26, 0.82]), without the one-hot encoding with the 1 and 14 zeros. This approach uses the array-lookup instead of one-hot encoding. Because the matrix multiply is sparse (majority 0s) in the one-hot encoding case.

Looking something up in an array is mathematically identical to doing a matrix product by a one-hot encoded matrix.

Bias

We’ll able to add more information about the data by including bias. Example given in the lecture:

No one’s going to like Battlefield Earth. It’s not a good movie even though it has John Travolta in it. So how are we going to deal with that? Because there’s this feature called I like John Travolta movies, and this feature called this movie has John Travolta, and so this is now like you’re gonna like the movie. But we need to save some way to say “unless it’s Battlefield Earth” or “you’re a Scientologist” — either one. So how do we do that? We need to add in bias.

We have the same data, but we’re going to tack on an additional row which represents the bias. Now, each movie can have an overall “this is a great movie” or “this is not a great movie”. So in the field for the dot product there will also be a bias.

The resulting MSE is 0.32, which is less than the previous 0.39. This is a slightly better model (gives us more flexibility) that yields a better result.

Movielens 100k

Data setup for this jupyter notebook section: Had to download the dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip to the folder /home/jupyter/.fastai/data/

Can do that through the terminal ssh’d into the GCP VM

The pd.reads_csv() contains parameters like delimiter, encoding, etc. for this particular dataset.

We want the movie title directly in our ratings, so we can use ratings.merge() which is a pandas function.

We use a CollabDataBunch for the dataset. DataBunch objects support show_batch() so you can inspect the data after loading

data = CollabDataBunch.from_df(rating_movie, seed=42, valid_pct=0.1, item_name=title)

Setting the y_range is a trick we can use to control the range of the output, and we want that to be from 0 to 5.5. This can help the neural network make predictions in the right range. Because sigmoids have an asymptote on either end of the range, we want the minimum to be slightly less than the actual minimum and the maximum to be slightly more. Hence 0–5.5

The wd or weight decay is another trick to improve accuracy.

The n_factors parameter is the width of the embedding matrix.

learn = collab_learner(data, n_factors=40, y_range=y_range, wd=1e-1)

As usual, use the learning rate finding process and use that for fit_one_cycle:

learn.lr_find()
learn.recorder.plot(skip_end=15)

The first parameter for fit_one_cycle is the number of epochs. The second one means we're using a learning rate of 5e-3 for all layers.

learn.fit_one_cycle(5, 5e-3)

We’re getting an MSE of 0.81 which is pretty good according to the benchmark value of 0.83.

Save the model with learn.save('dotprod')

How do we make the predictions less biased?

Let’s pick out some popular movies based on rating counts:

g = rating_movie.groupby(title)['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_movies[:10]

We can then take our learner that we trained and ask it for the bias of the items listed here.

Movie bias

We can ask the learner to provide the bias of the top movies. The is_item parameter means we want the bias on the movie items, not the users

movie_bias = learn.bias(top_movies, is_item=True)
movie_bias.shape

In collaborative filtering, most things are users or items

We can also group the titles by the average rating. So w can zip through each movie along with the bias and grab their rating, bias and movie. Then we can sort them by the bias:

The movies above are the lowest rated movies. If we do reverse=True, we can get the most highly rated movies.

We can also grab the weights in addition to the biases.

movie_w = learn.weight(top_movies, is_item=True)
movie_w.shape

We’re going to grab the weights for the items (aka movies). We asked for a width of 40 back when we defined n_factors

40 is a bit large, so we’ll narrow it down to 3.

movie_pca = movie_w.pca(3)
movie_pca.shape

pca stands for principal components analysis. It’s a simple linear transformation that takes an input matrix and tries to find a smaller number of columns that cover a lot of the space of the original matrix.

Taking layers of neural nets and checking them through PCA is a good idea. Because often you might have more activations than you need. Makes it easier to interpret.

So let’s look at the movies sorted by factor 0 (fac0)

The highest ranked movies are high at the connoisseur level.

By factor 1 (fac1):

These seem to be big hits that you can watch with the family.

Hence these are all ways to extract features and interpret the ratings that the model predicted for specific factors.

There’s one more collab_learner parameter to discuss: wd or weight decay

learn = collab_learner(data, n_factors=40, y_range=y_range, wd=1e-1)

Weight decay is a type of regularization:

Models with lots of parameters tend to overfit. But we still want to be able to use many parameters because it could lead to a better representation of real data. The solution for this is to penalize irregularity.

Let’s sum up the squares of parameters. We create a model where in the loss function we have the squares of parameters. But to prevent the squares of parameters from getting too big, we’ll multiply that by some number we choose. That number is wd. We are going take our loss function and add to it the sum of the squares of parameters multiplied by some number wd. Generally, it should be 0.1.

How weights are calculated: Weight at time t is the weight at time t-1 minus learning rate multiplied by derivative of loss function with respect to weights at time t-1

What’s our loss? Our loss is some function of our independent variables x and our weights. We’re using MSE loss function which gets the difference between predictions (y_hat) and labels (y)

And our predictions y_hat are generated from running some model m on the inputs (x) and weights (w)

Now we’re going to add a weight decay wd (0.1) times the sum of weights squared

MNIST SGD

Again, we manually download the pickled MNIST dataset and load it into the right path.

Show the image and the shape:

There are 50,000 rows and 784 columns. Each column is a 28x28 pixel image. So if we reshape one of them and plot it, we can see it’s the number.

Currently they are numpy arrays but we need them to be tensors so we just use map(torch.tensor)

x_train,y_train,x_valid,y_valid = map(torch.tensor (x_train,y_train,x_valid,y_valid))
n,c = x_train.shape
x_train.shape, y_train.min(), y_train.max()

We get: (torch.Size([50000, 784]), tensor(0), tensor(9))

In lesson2-sgd, we created a column of ones to add bias but we don’t have to do that this time. We’ll have pytorch handle that. We also wrote our own mse() function and matrix multiplication procedure but now we’ll have pytorch handle all of that. And to handle mini-batches.

We’ll create a logistic regression model that subclasses nn.Module

class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10, bias=True)    def forward(self, xb): return self.lin(xb)

A one-layer neural net with no hidden layers (linearities). We want to put the weight matrices, which is done with cuda()

Our model has been created! We can get the shape of all parameters of our model with

[p.shape for p in model.parameters()]

So what are these two parameters?

the [10,784] is the thing that’s going to take in a 784 dimensional input and spit out a 10 dimensional output. Our input is 784 dimensional and we need something that can give us probabilities for 10 outputs.

Then we need 10 activations which we want to add bias to. So we have this second vector of length 10.

The model has exactly the stuff we need to do our ax+b.

We’ll grab a learning rate of lr=2e-2 and a loss function of CrossEntropyLoss

In our update function, we’ll call our model(x) instead of a@x from lesson 2, as if it were a function, to get our y_hat

def update(x,y,lr):
    wd = 1e-5
    y_hat = model(x)
    # weight decay
    w2 = 0.
    for p in model.parameters(): w2 += (p**2).sum()
    # add to regular loss
    loss = loss_func(y_hat, y) + w2*wd
    loss.backward()
    with torch.no_grad():
        for p in model.parameters():
            p.sub_(lr * p.grad)
            p.grad.zero_()
    return loss.item()

We call our loss_func() to get our loss, and we can loop through the parameters.

We also have a w2. For each p in model.parameters we add to w2 the sum of p**2, which is the sum of squared weights, and we multiply that by wd which is 1e-5

So weight decay is really just a simple value.

Run the update function with list comprehension on the data:

losses = [update(x,y,lr) for x,y in data.train_dl]

Generalizing, the gradient of wd*(w**2) with respect to w is just 2wd*w. We can drop the 2 without losing generality

All that wd does is it subtracts some constant times the weights every time we do a batch. That’s why it’s called weight decay!

L2 regularization (wd * w²) and weight decay wd * w*)* are pretty much mathematically identical

We can replace Mnist_Logistic with Mnist_NN and build a neural net from scratch.

class Mnist_NN(nn.Module):
    def __init__(self):
        super().__init__()
        # use 2 linear layers
        # first layer has output of 50
        self.lin1 = nn.Linear(784, 50, bias=True)
        # second layer has input of 50 and output of 10 (since it's the number of classes we're predicting)
        self.lin2 = nn.Linear(50, 10, bias=True)    def forward(self, xb):
        # first layer
        x = self.lin1(xb)
        # calculate relu
        x = F.relu(x)
        return self.lin2(x) # second layer

Once you have something that can do gradient descent, you can try different models. You can start to add more pytorch stuff

def update(x,y,lr):
    # take model.parameters() and optimize them using Adam (can also use SGD)
    opt = optim.Adam(model.parameters(), lr) # you can also pass in a wd
    y_hat = model(x)
    loss = loss_func(y_hat, y)
    loss.backward()
    opt.step()
    opt.zero_grad()
    return loss.item()

If you change the optimizer, the losses will diverge.

Optimizers: Adam, SGD, RMSProp

They’re randomly generated X’s’ and the Y’s

y = ax + b where a is 2 and b is 30

Start by picking an intercept (b) and slope (a) kind of arbitrarily.

So gradient descent is just taking our current value of that slope and subtract the learning rate times the derivative. Gives us new (a) and (b)

And then we copy that intercept and that slope to the next row, and do it again. And do it lots of times, and at the end we’ve done one epoch.

# regular SGD
opt = optim.Adam(model.parameters(), lr)
# with momentum
opt = optim.SGD(model.parameters(), lr, momentum=0.9)

We can use Adam or SGD which allows you to apply momentum (take derivative, multiply by 0.1 then take previous update and multiply by 0.9 and add them together)

Momentum of 0.9 is very common

Exponentially Weighted Moving Average: weighing the number of observations and using their average

Step at time t (S_t) equals some number times the actual gradient plus [1 — alpha] times whatever you had last time at S_t-1

RMSProp: very similar to momentum but instead, it’s an exponentially weighted moving average not of the gradient updates but of F8 squared — that’s the gradient squared.

Adam keeps track of the exponentially weighted moving average of the gradient squared (RMSProp) and also keep track of the exponentially weighted moving average of my steps (momentum).