Can We Predict Fake News?

31.05.2021 — Machine Learning — 7 min read

TLDR: No, not with the dataset that was used :(

If you need a quick brush up on machine learning, you can learn more here: Machine Learning Post

Introduction

Fake news has become a major issue today, but there is an issue identifying what is true and what is not. This report details how Support-vector-machines (SVM) and Logistic Regression can be used to identify the truth to try to classify statements, made by different politicians, as either fake or true. The dataset used is 'Liar, Liar pants on fire' dataset, provided by William Wang, william@cs.ucsb.edu. There are 14 fields and 10,240 examples in the training and test sets. We hypothesize that Support-Vector Machines and Logistic Regression will be able to classify fake and real news with at least 70% accuracy. The business applications of identifying fake news can be applied to social media companies that need to monitor the content that is shared on their sites.

Presentation

Dataset

Here is a row from the training dataset:

ID	Label	Statement	Subject	Speaker	Speaker's Job Title	State	Party	barely-true counts	false counts	half-true counts	mostly-true counts	pants-on-fire counts	context
2635.json	false	When did the decline of coal start?	energy, history, job-accomplishments	scott-surovell	State delegate	Virginia	democrat	0	0	1	1	0	a floor speech.

There are 10,240 examples in the training set and 1267 examples in the test set. Side note: the use of a 90:10 split on the training and test set sizes was already done with before hand and did lead to the testing accuracy being higher than the training accuracy. This is indication that the training and test sets should have been more balanced in size.

Pre-Processing the Dataset

For Logistic Regression (LR) and Support Vector Machines (SVM), we used the SKLearn library. Both LR and SVM are classifying algorithms, meaning they work on discrete sets, not continuous. Logistic regression is used to estimate the probability of an event’s occurrence, based on data that was passed in previously - the label being predicted is binary. In our case, we want to predict whether a statement is true (1) or false (0). Support Vector Machines work similarly in a sense that the algorithm is finding a hyperplane in N-dimensional space (we chose to use N = 8 in this case) that will distinctly classify the data points.

Below is a table that lists the feature encodings that were done to the dataset. The columns: ID, statement, subject(s), speaker, speaker-title, and context were excluded from the feature vector. Instead of having to predict 6 labels (pants-on-fire, false, barely-true, half-true, mostly-true, and true), each label is classified to be true (1) or false (0), this is referred to as Binary Classification throughout this report. The labels: ‘Half-true’, ‘barely-true’, and ‘true’ are changed to 1 and the labels ‘false’ and ‘pants-on-fire’ are changed to 1. We also created a new feature called the truth-score. It is the probability that the speaker will tell the truth based on previous statements made. We decided to only use these 9 features, out of the 14 available, because it would reduce the noise that the models would pick up on. For instance, we did not know to enumerate the statement, subject, and context columns without the use of a natural language processor. By leaving out those features, we were reducing the risk of increasing noise within the model. We learned that raw data is not pretty and needs to be massaged and processed before feeding it into a machine learning algorithm.

Feature Set that was Used

1	2	3	4	5	6	7	8	9
Label	State	Party	Barely-True	False	Half-True	Mostly-True	Pants-on-Fire	Truth-Score
Binary and Label Encoder	LabelEncoder	LabelEncoder	-	-	-	-	-	See below

Logistic Regression Results

Using Binary Classification:

These test results came when we classified all of the input labels as either true or false. Which made the algorithm only have to label whether a statement is True(1) or False(0).

Regularization	Feature Transformation	Training Features	Training Accuracy	Testing Accuracy
L1 (Lasso)	Polynomial Degree 3	State, Party, Truth-Score	80.0%	81.16%
L2 (Ridge)	Polynomial Degree 3	State, Party, Truth-Score	79.96%	81.77%

Without Binary Classification

Predicting 5 different labels.

Regularization	Feature Transformation	Training Features	Training Accuracy	Testing Accuracy
L1 (Lasso)	Polynomial Degree 3	Features[2-9]	34.46%	36.23%
L2 (Ridge)	Polynomial Degree 3	Features[2-9]	31.96%	32.99%

How we did compared to the State-of-the-Art

We can compare these results (without binary classification), to work that was done by Arjun Roy, Kingshuk Basak, Asif Ekbal, Pushpak Bhattacharyya from the Indian Institute of Technology in Patna, India. Instead of logistic regression, they were able to use other machine learning algorithms like called Bi-LSTM (Long Short-Term Memory) and Convolutional-Neural-Networks (CNN) to yield much better than state-of-the-art testing accuracy. Their work can be found here: FakeNews.

Their Results (State of the Art)

Algorithm(s)	Testing Accuracy
Bi-LSTM	42.65%
CNN	42.89%
CNN+Bi-LSTM	44.87%

We also used Support-Vector-Machines (with three different Kernels) to try to test the hypothesis. We ended up with a slightly lower testing accuracy than Logistic Regression.

Support Vector Machine Results

Kernel	Training Accuracy	Testing Accuracy
Linear	79.76%	81.37%
RBF	72.32%	73.08%
Polynomial	72.32%	73.08%

Conclusion

In a world with increasingly unreliable media and harder to qualify statements -- we think that fake news is a growing plague upon the internet and world in general. It allows actual truths to seep through the cracks while fake ideas and information are used as talking points for dangerous precedents. In this paper we wanted to create a tool to filter through these ideas such that that truth gets recognized and what is fake can be knowingly dismissed. The dataset we had statements of various political and public figures in the political space. The features of the dataset included the statements they made and the level of truth assigned to them by PolitiFact. We wanted to use this to ascertain new statements made by these individuals and categorize how true or false they may be based on these features. To process the data we first normalized the labels and made it a binary classification of True/Untrue. We then normalized the features giving each data entry a “Truth Score” based on their previous record of telling the truth. Using these features we created various Machine Learning models on the training data. For a dry test of our hypothesis -- we created a baseline that simply returned the median of the dataset labels. This allows us to see exactly how much better our predictions are doing on the dataset. We used various forms of logistic regression with l1 regularization - l2 regularization and polynomial regression. We used support vector machines with Linear, RBF , and polynomial features kernels. The logistic regression and support vector machines gave a consistent average of roughly 70 percent on the data. The loss of accuracy with increasing iterations could be attributed to overfitting the data and losing some general trends seen in the training data. We were able to derive an average of 70 percent accuracy through the supervised learning mechanisms stated above, beating the median score by roughly 30 percent. We looked at using XGBoost to use alongside Logistic Regression and Support Vector Machines, but we ran into time constraints when trying to implement it. Otherwise, it may have provided better results than those shown by LR and SVM.

Can we accurately predict fake news? - No. Not better than flipping a coin.

Commercial Applications

Showing that tools like this can be reliable in filtering fake news can have a profound impact on politics. Disallowing and automatically filtering malicious bots on the internet that attempt to inject misinformation in the public eye. Social media websites like Twitter and Facebook can use these systems to mitigate legal risks that can arise from malicious accounts generating and sharing fake news on their platforms.

Code

View my code (Jupyter Notebook) here: Notebook

Thank you reading!