Fast.ai Lesson 4 of 7: NLP, Tabular data, Recsys

Notes from Practical Deep Learning for Coders 2019 Lesson 4 (Part 1)

Julia Wu

11 min readAug 12, 2019

Other lessons: Lesson 1 / Lesson 2 / Lesson 3 / Lesson 5 / Lesson 6 / Lesson 7

Quick links: Fast.ai course page / Lecture / Jupyter Notebooks

We continue to look at NLP.

NLP

25,000 movie reviews in the IMDB dataset. It’s not enough information. So the trick is to use transfer learning!

The idea is we use a pre-trained model that has been trained to do something different to what we’re trying to do. With ImageNet, the model was trained to recognize 1000 different types of objects and people have since fine-tuned it for all sorts of different use cases.

We’re going to start with a pre-trained language model, which has learned to predict the next word in the sentence. This model needs to know quite a bit about English and the world.

Previous approaches to NLP used n-grams, but they’re terrible at this kind of task about what the next word is. But with a neural net, you can. If you train a neural net to predict the next word of a sentence, then you have a lot of information — rather than a single bit (positive, negative). And this neural net doesn’t have to be specific for movie data.

Wikitext 103: A subset of the largest articles on wikipedia. So taking wikipedia and building a language model on wikipedia. Approx. 1 billion tokens.

When we do train this model on movie review data, it will learn about how movie reviews are written, pick up some names of popular movies, etc.

Steps

Take the wikitext103 language model
Train the wikitext103 language model with imdb data
Transform the language model into a classifier for imdb

Preparing the data:

# what kind of data, where to find the data
data = (TextList.from_csv(path, 'texts.csv', cols='text') .split_from_df(col=2) # validation flag                 .label_from_df(cols=0) # column of labels (positive, negative)                 .databunch())

We don’t have to train the Wikitext103 language model. We can use the pre-trained version. So no need to start with random weights.

data_lm = (TextList.from_folder(path)#Inputs: all the text files in path             .filter_by_folder(include=['train', 'test', 'unsup'])#We may have other temp folders that contain text files so we only keep what's in train and test             
.split_by_rand_pct(0.1) 
          
#We randomly split and keep 10% (10,000 reviews) for validation             .label_for_lm()#We want to do a language model so we label accordingly             .databunch(bs=bs)) data_lm.save('data_lm.pkl')

We randomly split it by 10%. Why (rather than using the predefined split given to us)?

This is specific to transfer learning: Even though our test set has to be held aside, it’s only the labels we have to set aside. We can use the text in the test set to train the language model. Concatenate training and test set together, then just split a smaller validation set.

If you’re doing NLP stuff on Kaggle, you can use all the text you have to train your LM because there’s no reason not to.

Model was saved. We can then load the model:

data_lm = load_data(path, 'data_lm.pkl', bs=bs)
data_lm.show_batch()

We’ll put this data in a learner object with a model (wikidata103) loaded with pretrained weights. The following will download the model and store it in ~/.fastai/models

Instead of a cnn_learner, we're creating a language_model_learner which is an RNN. They're the same basic structure.

We pass in the data language model (data_lm), the pretrained model (AWD_LSTM which is wikitext103), regularization (drop_mult). We use a number lower than 1, so we avoid underfitting

run lr_find() and run learn.fit_one_cycle() to fine-tune the last few layers.

As usual, we’ll unfreeze() and re-train the whole thing. This takes 3 hours.

You can now run learn.predict() on an input text and see what the model spits out as a completion for your input:

Save the model and its encoder (part that’s responsible for creating and updating the hidden state) — it’s the bit that’s responsible for understanding the sentence so far. We don’t care about the part that’s actually trying to guess the next word.

learn.save_encoder('fine_tuned_enc')

Now, onto the actual classifier:

We want to make sure it uses the exact same vocab it used for the language model with vocab=data_lm.vocab — this is an important step because otherwise the pretrained model will be totally meaningless.

data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)#grab all the text files in path
             .split_by_folder(valid='test')#split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])#label them all with their folders
             .databunch(bs=bs))data_clas.save('data_clas.pkl')

We won’t split randomly this time. We’ll split by folder.

We want to label it not for the language model but with classes. And finally create a databunch.

And now, instead of creating a language_model_learner we'll have a text_classifier_learner()

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
# we want to load in the encoder that was saved earlier
learn.load_encoder('fine_tuned_enc')

Freeze, lr_find, fit_one_cycle()

So the only time-consuming part is the re-training AFTER you add in your domain-specific data (such as IMDB). But after that, everything will finish in a few minutes and you can get really good results.

After running fit_one_cycle(), you can use learn.freeze_to() to unfreeze just the last 2 layers, not the whole thing. Unfreezing 1 or 2 layers, then training a bit more, then unfreezing one more and training a bit more tends to get you pretty good results. Then unfreeze the whole thing and train it a little bit more

moms is momentum.

Accuracy of 94% in just 15 minutes of training!

Finally, you can do learn.predict() on a piece of text to classify it

Tabular data

Neural nets can be useful for this too.

For tabular data, use fastai.tabular library. Pandas can read from almost anywhere. We’ll use a TabularList DataBlock.

The independent variables are what we’re using to make predictions with.

cont_names are continuous variables

cat_names are categorical variables

The dependent variable is what we’re trying to predict (salary)

dep_var = 'salary'
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [FillMissing, Categorify, Normalize]

We have something that’s very similar to transforms in computer vision. In tabular data, instead of transforms we have processes (procs)

We pre-process the dataframe.FillMissing will look for missing values and deal with them — replace it with median and add a column saying whether data was missing or not

Categorify will find categorical variables and turn them into pandas categories

Normalize is taking continuous variables, subtract from mean, divide by standard deviation so they're between 0 and 1. Whatever you do with the training set, you need to do the same with the test set

Initialize the test set:

test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)

Initialize the databunch from the dataframe

data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
                           .split_by_idx(list(range(800,1000)))
                           .label_from_df(cols=dep_var)
                           .add_test(test)
                           .databunch())

split_by_idx will split into training and validation set.

Add labels: with label_from_df we specify the dependent variable

Inspect the data with data.show_batch()

The learner we’ll use is tabular_learner

Similar to what we’ve seen before, we give it the data, some architecture parameters, and the metrics for logging.

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

Fit the learner

learn = tabular_learner(data, layers=[200,100], metrics=accuracy)

Inference

Select a row in the data and ask the learner to make a prediction:

Collab Filtering

i.e. having 2 columns, users and movies watched. A good place for datasets is MovieLens. We’ll use a small dataset.

Now we can create a CollabDataBunch using the DataBlock API, and a collab_learner

Next, we just run learn.fit_one_cycle()

We need to tell it the n_factors (architecture info) and range of scores

Now we can take a user id and movie id and guess whether the user will like that movie

Embeddings

Using excel sheet, created matrix of random numbers associated with userId and movieId.

For each movie, created 5 random numbers and same with movie. You can then take the dot product (green box)

This is the basic starting point of a neural net: you take the matrix multiplication of 2 matrices and that’s what your first layer always is. So you just have to come up with the 2 matrices you can multiply. You need a vector for a user (matrix for all users) and a vector for movieId (matrix for all movies). Now we can use gradient descent to try and make the random numbers give us results that are closer to what we want.

We’ve set this up as a linear model; now we need a loss function. We have the actual ratings from the labeled data, so we can calculate the difference between the computed rating adn actual rating. Capture the MSE or square root of the MSE (RMSE).

Now that we have loss values, we can use gradient descent to modify our weight matrices to make our loss smaller.

Use excel solver — you put in which cell represents the loss function, which cells contain your variables and hit Solve

This is a very simple way of creating a neural network — just a single layer — using gradient descent to solve a collaborative filtering problem.

Notice that we used collab_learner(). This is a glimpse at the underlying implementation.

An embedding is a matrix of weights. Specifically, a matrix of weights which you can look up into and grab one item out of it. Any kind of weight matrix that can be indexed into as an array and grab one vector out of it. We have an embedding matrix for a user and one for movie.

Prediction is the dot product. Plus a bias term for movies and a bias term for user id.

When we set up the model, we set up the embedding matrix for the users and the embedding matrix for the items. We also set up the bias vector for users and items.

We take the dot product, add the bias, get the min and max score and return

But when you get the dot product and add the two biases, that can give you a large range — but we know that we want a number between 0 and 5 (ratings). So what if we mapped that number line to a function?

The shape of this function is called a sigmoid. Whatever number comes out, if we stick it into this function it will always be in the range of 0 to 5.

The last tweak is we take the result of the dot product and biases, put it through the sigmoid, then multiply by (max — min) and add min → that will give you something that’s between min score and max score. Applying this sigmoid makes a big difference.

Essential concepts

Imagine you have a vector of size 3 (with values 10, 20, 30 which are pixels). You then have a weight matrix. You decide how many columns you want in that weight matrix. Say you have 5 columns, so 3x5 matrix. Initially, this weight matrix will contain random numbers.

When you multiply the vector of size 3 by the weight matrix of 3x5, you get a 1x5 matrix.

Next, this vector of size 5 (1x5 matrix) goes through an activation function such as ReLU, which is just max(0, x) and spits out a new vector which is the same size.

Next, we multiply the vector by another matrix. Doesn’t matter how many columns but needs to be 5 rows because the vector has 5 items. Say it’s 8 columns, so 5x8.

The resulting matrix of 1x5 times 5x8 will be a 1x8 matrix (vector of size 8)

We put the 1x8 matrix (vector of size 8) through ReLU again.

If you’re doing digit recognition, you want the final output to be 10 in size because there are 10 digits. So we’ll multiply the vector of size 8 by a 8x10 matrix to get a vector of size 10 (matrix of 1x10)

If the number we’re trying to predict is 3, then there will be a 1 in the third position. Our neural net runs along, going weight matrix → relu, weight matrix → relu until final output. We compare the final output (a probability for the given digit) with the labeled data to see how close they are, using some loss function such as MSE.

Next we update the weight matrices. It’s common to have an activation function like sigmoid (not ReLU) as your last layer.

Terminology

The matrices are known as parameters, though sometimes we’ll refer to them as weights. But they could also be biases.

After each product, that gives us a vector of numbers. The vectors calculated from weight matrix multiplications AS WELL AS the vectors after ReLU (an activation function) are called activations

parameters are numbers that are stored. activations are results (of matrix multiplication or activation function)

All the steps that have a calculation are called layers. Every layer results in a set of calculations. The very first input is the input layer

The outputs of a neural net are nothing but activations of a layer.