Fake News Classification — Use Machine Learning to Fight Against Fake News

Frost Tianjian Xu
6 min readMay 7, 2021

Introduction

Millions of news stories are posted online everyday, and many of them could be inaccurate, manipulated, fabricated, or, in one word, fake. It is mission impossible to have humans review every single piece of published news story. Not only because it requires too much workloads, but also because humans are slow: when a human reads through one news story and tries to decide if it is fake, hundreds more news stories are posted online simultaneously. In this article, I will introduce how to leverage machine learning and NLP to make accurate and fast fake news classification.

The complete codebase can be found in my GitHub.

Kickoff to Fake News Dataset

I found a dataset of ~37000 real and fake news stories on Kaggle. You may download it here:

Before we dive into any fancy ML algorithms, let’s have a brief look at the data. The number of fake news stories in the dataset is a little bit greater than the real news stories. In fact, if we simply guesses every news story is fake, there is a 53% chance that we are right. Therefore, 53% accuracy will be the baseline of any model that we built.

We can also see the distribution of article length is roughly the same between fake news and real news. We definitely want this so that our model should not naively judge a news story by looking at its length.

Data Exploration

Now, let’s make some ML happen! I will use two ML approaches to detect the fake news. One is Random Forrest, the classical classification method, and the other one is a dense neural network model using BERT (Bidirectional Transformers for Language Understanding), the state-of-the-art NLP method brought by Google scientists.

Random Forest Approach

Before we build any random forest model. We need to figure out how to convert human language (English in our case) to numbers that are readable by machine. In our approach we will use TF-IDF algorithm to extract keywords in news stories, and convert each stories into a matrix of TF-IDF features.

TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

TF-IDF Formula, source: Digital Marketing Chef

By using the TfidfVectorizer from scikit-learn, we are able to convert natural languages to a numerical matrix. The next step is to build a ML model to fit the data. A decision tree may be the first idea coming into our mind because it resembles human decision-making process: “if word A and word B show up very frequently in the text, then this text is very likely to be fake new; if they do not, then I will look for the frequency of word C…”

However, one decision tree often tends to overfit the training data, so we will take a step further to group multiple decision trees for one classification problem, each trained on a subset of features (looking for different keywords). Now we get a random forest classifier. In our code, I will build a random forest classifier like this:

from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(
n_estimators=5,
max_features=”sqrt”,
max_depth=10,
random_state=10
)

An analogy of this is that we have a committee of 5 judges (decision tree). Each judge will read different part of a news article, and then they will vote whether the article is real or fake. The majority will be the result of our model.

After fitting the random forest classifier over the training data, the classifier can achieve ~89% accuracy in test data, which is pretty nice. I plot the confusion matrix of the random forest classifier’s performance over test data. We can see that the false positive and false negative are roughly equal, meaning that our model does not have a tendency to predict a story to be fake or real.

Confusion Matrix for Random Forest Model

Besides the decent accuracy, the speed to train and test a random forest model is also very fast. In my implementation, it takes 0.37 seconds to fit the model over 27000 training articles, and 0.049 seconds to make predictions for 10000 testing articles.

Training Time (in seconds) = 0.371672 
Testing Time (in seconds) = 0.049181
=============Evaluation Result==============
Random forrest classifier accuracy: 0.899332
Random forrest classifier precision: 0.893516
Random forrest classifier recall: 0.896427
Random forrest classifier f1_score: 0.894969

BERT + DNN Approach

To get a better classification accuracy, I will build a BERT + DNN classifier. BERT is a NLP technique that can transform and tokenize the texts so that they can be fed into the neural network. We will not go deeper into the underneath theory of BERT. The biggest takeaway is that BERT, similar to TF-IDF, can convert texts into numbers, and BERT is more powerful than TF-IDF because it takes into account the context of each word.

BERT enables the machine to “understand” the natural language, and we will then pass the word vector it generates to a neural network to do the classification. This neural network can be very deep and complicated, but here a three-layered dense neural network is good enough for our problem set.

BERT + DNN Model Design

Here is how we design the BERT + DNN classification model in code:

import torch
from transformers import (
BertTokenizer,
BertForSequenceClassification
)
# Use GPU to accelerate calculation.
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# Initialize bert tokenizer and bert classifier.
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
bert_model.config.num_labels = 1
# Freeze the pre-trained parameters.
for param in bert_model.parameters():
param.requires_grad = False
# Determine the dense neural nets, loss function, and optimizer.
bert_model.classifier = torch.nn.Sequential(
torch.nn.Linear(768, 256),
torch.nn.ReLU(),
torch.nn.Dropout(p=0.2),
torch.nn.Linear(256, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 2),
torch.nn.Softmax(dim=1)
)
bert_model = bert_model.to(device)
criterion = torch.nn.MSELoss().to(device)
optimizer = torch.optim.SGD(bert_model.classifier.parameters(), lr=0.01)

We will then train this BERT + DNN model on the training data and test it over the testing data. This model can achieve 96% accuracy, which is a huge improvement from our initial Random Forrest model.

Of course, the high accuracy comes with tradeoff: it takes 41 minutes to train the neural network to read through ~27000 training articles, and it takes another 10 minutes to predict real or fake for the 10000 training articles. This is 7285 times longer than training and testing the random forest model.

Training Time (in minutes) = 41.600925
Testing Time (in minutes) = 11.135320
===============Evaluation End===============
BERT classifier accuracy: 0.962918
BERT classifier precision: 0.958473
BERT classifier recall: 0.964273
BERT classifier f1_score: 0.961364
Confusion Matrix for BERT + DNN

Summary

Now let’s look at what we have made: a Random Forest classifier that is mediocre accurate and fast, and a BERT + DNN classifier that is very accurate but slow. They are both valuable and we cannot really tell who wins. Meanwhile, we should also notice that in my example, the model designs are simple and the models are trained over a relative small dataset (37000 fake/real news articles comparing to millions of news stories posted online everyday in real world practice). However, it is a promising sign that machine learning models can perform pretty well in fake news detection. I hope this article can bring you inspirations, and you will be able to design your own machine learning model that fight against fake news.

Model Performance Comparison

--

--