TLDR

This article is adapted from fast.ai’s Machine Learning for Coders course, specifically, lesson 10. I would highly recommend checking this and other courses from fast.ai, it has numerous tips on how to do practical machine learning and deep learning.

We will be building a naive Bayes classifier in just 10 lines of code that will get over 98% accuracy on a spam message filtering task.

We will do this in the top-bottom approach, where we will first build the model and then dig deeper into the theory of how it works.

Imports

Toggle cells below if you want to see what imports are being made.

Code

%reload_ext autoreload
%autoreload 2
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

Data

We are going to use a dataset with 5572 messages in it. Unfortunately, I am unable to provide full access to this dataset, so you’ll have to take my word for what the contents of the dataset are.

Label of 0 means the message is not spam, and 1 means it is spam.

	message	label
0	Go until jurong point, crazy.. Available only ...	0
1	Ok lar... Joking wif u oni...	0
2	Free entry in 2 a wkly comp to win FA Cup fina...	1
3	U dun say so early hor... U c already then say...	0
4	Nah I don't think he goes to usf, he lives aro...	0
...	...	...
5567	This is the 2nd time we have tried 2 contact u...	1
5568	Will Ì_ b going to esplanade fr home?	0
5569	Pity, * was in mood for that. So...any other s...	0
5570	The guy did some bitching but I acted like i'd...	0
5571	Rofl. Its true to its name	0

5572 rows × 2 columns

The numbers of non-spam and spam messages are respectively:

(4825, 747)

Example spam message is:

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

Let’s split the data into train and test sets with 20% of data going into the test set:

train_ratio = 0.8
train_length = int(len(df) * train_ratio)
X_train = np.squeeze(df.drop(columns=["label"])[:train_length].values)
y_train = df["label"][:train_length].values
X_val = np.squeeze(df.drop(columns=["label"])[train_length:].values)
y_val = df["label"][train_length:].values

Building the model

Below I am going to provide the code for the model without too much explanation, and then in the next section I’ll discuss what is happening under the hood.

vectorizer = CountVectorizer(max_df=0.1)
train_term_doc = vectorizer.fit_transform(X_train)
val_term_doc = vectorizer.transform(X_val)

Let’s see what our example spam message got turned into:

X_train[2]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

has become:

'free entry wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry question std txt rate apply 08452810075over18'

As you can see, some words did not make it into the vocabulary because they were too common (e.g. the word “in”), and punctuation and apostrophes were also left out. This is not ideal and it would probably be a good idea to try a better tokenizer, and although we won’t do that here, you can try finding one yourself. A good place to start would be here.

Since we we’ll be using Bayes’ theorem, note that here p stands for probability of features given spam, q stands for probability of features given non-spam, and b encodes the probabilities that a document is of a certain class (or what Bayesians call priors).

p = train_term_doc[y_train == 1].sum(0) + 1
q = train_term_doc[y_train == 0].sum(0) + 1
ratio = np.log((p / p.sum()) / (q / q.sum()))
b = np.log((y_train == 1).sum() / (y_train == 0).sum())

Let’s check what the accuracy is:

pre_preds = val_term_doc @ ratio.T + b
preds = pre_preds.T > 0
(preds == y_val).mean()

0.9856502242152466

Great! We got over 98% accuracy, and even though this dataset is not particularly hard, I think it is cool that we managed to do it in just 10 lines of code. Let’s check confusion matrices too:

array([[962,   8],
       [  8, 137]])

array([[0.99175258, 0.00824742],
       [0.05517241, 0.94482759]])

We see that the model is slightly less accurate on spam messages, just over 94%, which means that if we used this model in real life, it would let some spam messages slip through. Regardless, it’s doing a great job overall. Let’s now try to understand the inner workings of it.

What just happened?!?

As you can see, our naive Bayes classifier got over 98% accuracy on this dataset in just 10 lines of code. But how does it actually work? Let’s look at everything piece by piece.

1. The `CountVectorizer`

Naive Bayes classifier uses what is called a bag of words approach. It simply means that we disregard any relationships between words and just look at how often they appear in the text that we want to classify.

Note: I am not saying that bag of words is the best approach to do NLP, it is usually quite the opposite as nowadays we have tools like RNNs and Transformers that perform much better on NLP tasks. However, we use it here because it is a really simple approach that sometimes still gives reasonable results, as it did in this case!

This is exactly what we use a CountVectorizer for: it produces a term document matrix with frequencies of each word for each message.

Let’s look at an example of what sklearn’s CountVectorizer is doing. Suppose that our messages are:

message	label
This is a good message	0
A good message	0
This message is bad	1
The message is bad	1

Then, what a CountVectorizer is going to do for us is produce the following matrix:

message	label	this	is	a	good	message	the	bad
This is a good message	0	1	1	1	1	1	0	0
A good message	0	0	0	1	1	1	0	0
This message is bad	1	1	1	0	0	1	0	1
The message is bad	1	0	1	0	0	1	1	1

Important: CountVectorizer just finds the vocabulary (list of all the unique words in our data) and then counts how often each word from the vocabulary is found in each message.

Notice that for a larger dataset, our vocabulary might become very large which would mean that most of the cells in our term document matrix would be 0. For this reason, the sklearn implementation actually produces a sparse matrix which, instead of storing all the entries, stores only the location and values of non-zero entries, and since there are only so many of them, saves huge amounts of memory. How neat!

You must have noticed that I used a max_df=0.1 parameter for my CountVectorizer. This tells the vectorizer to ignore any words that appear in more than 10% on the documents as we can safely say that they are too common. I came with this number by trying different values and looking at vectorizer.stop_words_ to check how many and which words were ignored until I was satisfied. When you build your own model, make sure to play around with this and other parameters, such as min_df (opposite of max_df, used for very rare words), to find what works best for you! You can find more information on what parameters for CountVectorizer can be tinkered on the official docs.

A trick that we want to do before moving on is to note that later we will want to calculate probabilities of each word appearing in spam or non-spam messages. But it might happen that certain words do not appear in a particular class at all and we might run into trouble because we will get feature probabilities of zero, and we don’t want that. To counter that, we will add a row of ones, like so:

message	label	this	is	a	good	message	the	bad
This is a good message	0	1	1	1	1	1	0	0
A good message	0	0	0	1	1	1	0	0
This message is bad	1	1	1	0	0	1	0	1
The message is bad	1	0	1	0	0	1	1	1
row of ones		1	1	1	1	1	1	1

If we think more about, adding a row of ones is not that counter-intuitive at all, since the messages that we have so far only carry information up to this point in time, but if a word has not appeared in any of the messages so far, it is not at all impossible that it will not appear in the future, so adding the extra one takes care of that for us.

So that our feature probabilities will be:

message	label	this	is	a	good	message	the	bad
This is a good message	0	1	1	1	1	1	0	0
A good message	0	0	0	1	1	1	0	0
This message is bad	1	1	1	0	0	1	0	1
The message is bad	1	0	1	0	0	1	1	1
row of ones		1	1	1	1	1	1	1
P(feature\|0)		0.67	0.67	1	1	1	0.33	0.33
P(feature\|1)		0.67	1	0.33	0.33	1	0.67	1

In code:


# Create a vectorizer object that will ignore any
# words that appear in more than 10% of messages.
vectorizer = CountVectorizer(max_df=0.1)

# Use the fit_transform method of the vectorizer to get
# the term document matrix for the training set.
train_term_doc = vectorizer.fit_transform(X_train)

# Use the transform method of the vectorizer to get
# the term document matrix for the validation set.
# We do it this way so that train and validation sets
# are have the same vocabularies so that we could make predictions.
val_term_doc = vectorizer.transform(X_val)

2. Bayes’ theorem

Now, we can get to the essence of the model which is to apply Bayes’ theorem. I am not going to give the usual form of the formula, but instead one that will illustrate how we will be using it. Our goal is, given a particular message, figure out whether it is spam or not. Hence, the formula for us is going to take the form:

\(P(\text{spam} \mid \text{message}) = \frac{P(\text{message} \mid \text{spam}) \cdot P(\text{spam})}{P(\text{message})}\)

But we can use a trick: instead of trying to predict whether it is spam or not, let’s look at which class is a message more likely, i.e. \(\frac{P(\text{spam} \mid \text{message})}{P(\text{non-spam} \mid \text{message})}\). In that case, the formula will become:

\(\text{ratio} = \frac{P(\text{spam} \mid \text{message})}{P(\text{non-spam} \mid \text{message})} = \frac{P(\text{message} \mid \text{spam}) \cdot P(\text{spam})}{P(\text{message} \mid \text{non-spam}) \cdot P(\text{non-spam})}\)

Referring to our previous example, the ratios would then be:

message	label	this	is	a	good	message	the	bad
This is a good message	0	1	1	1	1	1	0	0
A good message	0	0	0	1	1	1	0	0
This message is bad	1	1	1	0	0	1	0	1
The message is bad	1	0	1	0	0	1	1	1
row of ones		1	1	1	1	1	1	1
P(feature\|0)		0.67	0.67	1	1	1	0.33	0.33
P(feature\|1)		0.67	1	0.33	0.33	1	0.67	1
ratio		1	1.5	0.33	0.33	1	2	3

Important: As you can see, ratios are greater than 1 for features that are more likely to be in spam messages and lower than one otherwise.

Then, to further simplify things, we make the naive Bayes assumption, which says that probability of any word appearing in a message is independent of probabilities of other words appearing in that same message.

Note: Obviously, this is a very naive assumption and that is most certainly not the case, but it turns out to work quite well.

Under the naive assumption, the probabilities like \(P(\text{message} \mid \text{spam})\) can be factorized into a product of probabilities of individual features appearing in a message, so that:

\(P(\text{message} \mid \text{spam}) = \prod_{i=1}^{n}{P(\text{features[i]} \mid \text{spam})}\)

and similarly for \(P(\text{message} \mid \text{non-spam})\). Then, our big formula becomes:

\(\text{ratio} = \frac{P(\text{spam} \mid \text{message})}{P(\text{non-spam} \mid \text{message})} = \frac{\prod_{i=1}^{n}{P(\text{features[i]} \mid \text{spam})}}{\prod_{i=1}^{n}{P(\text{features[i]} \mid \text{non-spam})}} \cdot \frac{P(\text{spam})}{P(\text{non-spam})}\)

3. A few final tricks

We are almost done, now we just want to apply a few tricks to make our calculations easier. First, notice that multiplying lots of probabilities together is going to result into a very small number and we might run out of floating point precision, so we can take the natural logarithm instead to handle this. Note that in this case we compare ratios not with 1, but with 0 (because log(1)=0) and that by the properties of logarithms all the products are going to turn into sums, which makes everything even simpler!

Finally, we notice that to make predictions, we can just perform matrix multiplication on the validation term document matrix and our derived vector of ratios and add (remember, we are in log space!) the ratio of priors.

Putting it all together:


# Calculate P(feature|1) and P(feature|0).
# This plus ones are there to constitute the
# row of ones discussed above
p = train_term_doc[y_train == 1].sum(0) + 1
q = train_term_doc[y_train == 0].sum(0) + 1

# Calculate the ratios according to our derived formulae
ratio = np.log((p / p.sum()) / (q / q.sum()))

# Calculate the log of ratio of priors
b = np.log((y_train == 1).sum() / (y_train == 0).sum())

# Make some predictions on the validation set
pre_preds = val_term_doc @ ratio.T + b
preds = pre_preds.T > 0 # Greater than 0 because we are working in log space
(preds == y_val).mean() # Accuracy

0.9856502242152466

How could I use this?

With this newly acquired knowledge, you can go ahead and try using this model on your own NLP data! Though I have to warn you: the dataset used in this article was quite easy and naive Bayes classifier might not work as good for your data. But it is well-worth trying it out, especially when you can build one in just 10 lines of code!

To better understand the materials of this article make sure to play around with anything that you want to dig deeper into, e.g. the different parameters and attributes of CountVectorizer. You can also check the full version of this on GitHub to see what cells I ran myself to better understand the inner working of this model.

If you were to try and run your own model, a few suggestions for improving the models performance are presented below.

Try n-grams

vectorizer = CountVectorizer(ngram_range=(1, 3), max_df=0.1)
train_term_doc = vectorizer.fit_transform(X_train)
val_term_doc = vectorizer.transform(X_val)

p = train_term_doc[y_train == 1].sum(0) + 1
q = train_term_doc[y_train == 0].sum(0) + 1
ratio = np.log((p / p.sum()) / (q / q.sum()))
b = np.log((y_train == 1).sum() / (y_train == 0).sum())

pre_preds = val_term_doc @ ratio.T + b
preds = pre_preds.T > 0
(preds == y_val).mean()

0.9874439461883409

Turns out the model gives even better accuracy with bigrams and trigrams included. But watch out! Checking the confusion matrices, we see that the model is now perfect on non-spam messages, but the error on spam messages has increased. This might not be what we want, so we have to be careful with interpreting our models!

confusion_matrix(y_val, preds.T, normalize=None)

array([[968,   2],
       [ 12, 133]])

confusion_matrix(y_val, preds.T, normalize="true")

array([[0.99793814, 0.00206186],
       [0.08275862, 0.91724138]])

Binarized version

You can also try the binarized version of the term document matrix (i.e. instead of frequencies we look at whether a word is present or not)

pre_preds = val_term_doc.sign() @ ratio.T + b
preds = pre_preds.T > 0
(preds == y_val).mean()

0.9874439461883409

Learning the parameters with logistic regression

Finally, you might try taking it to the next level and learning the parameters with a logistic regression instead of using the theoretical ones. Check the parameters C for regularization and dual=True for when you term document matrix is much wider than it is tall.

m = LogisticRegression(C=1e1, dual=False)
m.fit(train_term_doc, y_train)
preds = m.predict(val_term_doc)
(preds == y_val).mean()

0.9811659192825112


# Binarized version
m = LogisticRegression(C=1e1, dual=False)
m.fit(train_term_doc.sign(), y_train)
preds = m.predict(val_term_doc.sign())
(preds == y_val).mean()

0.9802690582959641

Conclusion

We’ve managed to build a fast and considerably accurate naive Bayes model in just 10 lines of code. We’ve also discussed some ways for how it could be improved and adapted to your own problems, e.g. using n-grams, trying the binarized version or employing a logistic regression to learn the parameters.

It turns out that the best linear model that we can build (i.e. not involving RNNs or transformer) is actually a combination of naive Bayes and logistic regression, but that that is something for another time, although you can check this lesson from fast.ai if you want to find out more yourself.