Can We Predict Fake News?
— Machine Learning — 7 min read
TLDR: No, not with the dataset that was used :(
If you need a quick brush up on machine learning, you can learn more here: Machine Learning Post
Introduction
Fake news has become a major issue today, but there is an issue identifying what is true and what is not. This report details how Support-vector-machines (SVM) and Logistic Regression can be used to identify the truth to try to classify statements, made by different politicians, as either fake or true. The dataset used is 'Liar, Liar pants on fire' dataset, provided by William Wang, william@cs.ucsb.edu. There are 14 fields and 10,240 examples in the training and test sets. We hypothesize that Support-Vector Machines and Logistic Regression will be able to classify fake and real news with at least 70% accuracy. The business applications of identifying fake news can be applied to social media companies that need to monitor the content that is shared on their sites.
Presentation
Dataset
Here is a row from the training dataset:
ID | Label | Statement | Subject | Speaker | Speaker's Job Title | State | Party | barely-true counts | false counts | half-true counts | mostly-true counts | pants-on-fire counts | context |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2635.json | false | When did the decline of coal start? | energy, history, job-accomplishments | scott-surovell | State delegate | Virginia | democrat | 0 | 0 | 1 | 1 | 0 | a floor speech. |
There are 10,240 examples in the training set and 1267 examples in the test set. Side note: the use of a 90:10 split on the training and test set sizes was already done with before hand and did lead to the testing accuracy being higher than the training accuracy. This is indication that the training and test sets should have been more balanced in size.
Pre-Processing the Dataset
For Logistic Regression (LR) and Support Vector Machines (SVM), we used the SKLearn library. Both LR and SVM are classifying algorithms, meaning they work on discrete sets, not continuous. Logistic regression is used to estimate the probability of an event’s occurrence, based on data that was passed in previously - the label being predicted is binary. In our case, we want to predict whether a statement is true (1) or false (0). Support Vector Machines work similarly in a sense that the algorithm is finding a hyperplane in N-dimensional space (we chose to use N = 8 in this case) that will distinctly classify the data points.
Below is a table that lists the feature encodings that were done to the dataset. The columns: ID, statement, subject(s), speaker, speaker-title, and context were excluded from the feature vector. Instead of having to predict 6 labels (pants-on-fire, false, barely-true, half-true, mostly-true, and true), each label is classified to be true (1) or false (0), this is referred to as Binary Classification throughout this report. The labels: ‘Half-true’, ‘barely-true’, and ‘true’ are changed to 1 and the labels ‘false’ and ‘pants-on-fire’ are changed to 1. We also created a new feature called the truth-score. It is the probability that the speaker will tell the truth based on previous statements made. We decided to only use these 9 features, out of the 14 available, because it would reduce the noise that the models would pick up on. For instance, we did not know to enumerate the statement, subject, and context columns without the use of a natural language processor. By leaving out those features, we were reducing the risk of increasing noise within the model. We learned that raw data is not pretty and needs to be massaged and processed before feeding it into a machine learning algorithm.
Feature Set that was Used
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
Label | State | Party | Barely-True | False | Half-True | Mostly-True | Pants-on-Fire | Truth-Score |
Binary and Label Encoder | LabelEncoder | LabelEncoder | - | - | - | - | - | See below |
Logistic Regression Results
Using Binary Classification:
These test results came when we classified all of the input labels as either true or false. Which made the algorithm only have to label whether a statement is True(1) or False(0).
Regularization | Feature Transformation | Training Features | Training Accuracy | Testing Accuracy |
---|---|---|---|---|
L1 (Lasso) | Polynomial Degree 3 | State, Party, Truth-Score | 80.0% | 81.16% |
L2 (Ridge) | Polynomial Degree 3 | State, Party, Truth-Score | 79.96% | 81.77% |
Without Binary Classification
Predicting 5 different labels.
Regularization | Feature Transformation | Training Features | Training Accuracy | Testing Accuracy |
---|---|---|---|---|
L1 (Lasso) | Polynomial Degree 3 | Features[2-9] | 34.46% | 36.23% |
L2 (Ridge) | Polynomial Degree 3 | Features[2-9] | 31.96% | 32.99% |
How we did compared to the State-of-the-Art
We can compare these results (without binary classification), to work that was done by Arjun Roy, Kingshuk Basak, Asif Ekbal, Pushpak Bhattacharyya from the Indian Institute of Technology in Patna, India. Instead of logistic regression, they were able to use other machine learning algorithms like called Bi-LSTM (Long Short-Term Memory) and Convolutional-Neural-Networks (CNN) to yield much better than state-of-the-art testing accuracy. Their work can be found here: FakeNews.
Their Results (State of the Art)
Algorithm(s) | Testing Accuracy |
---|---|
Bi-LSTM | 42.65% |
CNN | 42.89% |
CNN+Bi-LSTM | 44.87% |
We also used Support-Vector-Machines (with three different Kernels) to try to test the hypothesis. We ended up with a slightly lower testing accuracy than Logistic Regression.
Support Vector Machine Results
Kernel | Training Accuracy | Testing Accuracy |
---|---|---|
Linear | 79.76% | 81.37% |
RBF | 72.32% | 73.08% |
Polynomial | 72.32% | 73.08% |
Conclusion
In a world with increasingly unreliable media and harder to qualify statements -- we think that fake news is a growing plague upon the internet and world in general. It allows actual truths to seep through the cracks while fake ideas and information are used as talking points for dangerous precedents. In this paper we wanted to create a tool to filter through these ideas such that that truth gets recognized and what is fake can be knowingly dismissed. The dataset we had statements of various political and public figures in the political space. The features of the dataset included the statements they made and the level of truth assigned to them by PolitiFact. We wanted to use this to ascertain new statements made by these individuals and categorize how true or false they may be based on these features. To process the data we first normalized the labels and made it a binary classification of True/Untrue. We then normalized the features giving each data entry a “Truth Score” based on their previous record of telling the truth. Using these features we created various Machine Learning models on the training data. For a dry test of our hypothesis -- we created a baseline that simply returned the median of the dataset labels. This allows us to see exactly how much better our predictions are doing on the dataset. We used various forms of logistic regression with l1 regularization - l2 regularization and polynomial regression. We used support vector machines with Linear, RBF , and polynomial features kernels. The logistic regression and support vector machines gave a consistent average of roughly 70 percent on the data. The loss of accuracy with increasing iterations could be attributed to overfitting the data and losing some general trends seen in the training data. We were able to derive an average of 70 percent accuracy through the supervised learning mechanisms stated above, beating the median score by roughly 30 percent. We looked at using XGBoost to use alongside Logistic Regression and Support Vector Machines, but we ran into time constraints when trying to implement it. Otherwise, it may have provided better results than those shown by LR and SVM.
Can we accurately predict fake news? - No. Not better than flipping a coin.
Commercial Applications
Showing that tools like this can be reliable in filtering fake news can have a profound impact on politics. Disallowing and automatically filtering malicious bots on the internet that attempt to inject misinformation in the public eye. Social media websites like Twitter and Facebook can use these systems to mitigate legal risks that can arise from malicious accounts generating and sharing fake news on their platforms.
Code
View my code (Jupyter Notebook) here: Notebook
Thank you reading!