!pip install pandas; numpy; seaborn; matplotlib; scikit-learn; xgboost; nltk; wordcloud; shap; kagglehub

Requirement already satisfied: pandas in c:\users\anni_\anaconda3\lib\site-packages (2.2.3)
Requirement already satisfied: numpy in c:\users\anni_\anaconda3\lib\site-packages (2.1.3)
Requirement already satisfied: seaborn in c:\users\anni_\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: matplotlib in c:\users\anni_\anaconda3\lib\site-packages (3.10.1)
Requirement already satisfied: scikit-learn in c:\users\anni_\anaconda3\lib\site-packages (1.6.1)
Requirement already satisfied: xgboost in c:\users\anni_\anaconda3\lib\site-packages (3.0.0)
Requirement already satisfied: nltk in c:\users\anni_\anaconda3\lib\site-packages (3.9.1)
Requirement already satisfied: wordcloud in c:\users\anni_\anaconda3\lib\site-packages (1.9.4)
Requirement already satisfied: shap in c:\users\anni_\anaconda3\lib\site-packages (0.47.2)
Requirement already satisfied: kagglehub in c:\users\anni_\anaconda3\lib\site-packages (0.3.10)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\anni_\anaconda3\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\anni_\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\anni_\anaconda3\lib\site-packages (from pandas) (2025.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (1.3.1)
Requirement already satisfied: cycler>=0.10 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (1.4.8)
Requirement already satisfied: packaging>=20.0 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\anni_\anaconda3\lib\site-packages (from matplotlib) (3.2.1)
Requirement already satisfied: scipy>=1.6.0 in c:\users\anni_\anaconda3\lib\site-packages (from scikit-learn) (1.15.2)
Requirement already satisfied: joblib>=1.2.0 in c:\users\anni_\anaconda3\lib\site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in c:\users\anni_\anaconda3\lib\site-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: click in c:\users\anni_\anaconda3\lib\site-packages (from nltk) (8.1.7)
Requirement already satisfied: regex>=2021.8.3 in c:\users\anni_\anaconda3\lib\site-packages (from nltk) (2024.11.6)
Requirement already satisfied: tqdm in c:\users\anni_\anaconda3\lib\site-packages (from nltk) (4.66.5)
Requirement already satisfied: slicer==0.0.8 in c:\users\anni_\anaconda3\lib\site-packages (from shap) (0.0.8)
Requirement already satisfied: numba>=0.54 in c:\users\anni_\anaconda3\lib\site-packages (from shap) (0.61.0)
Requirement already satisfied: cloudpickle in c:\users\anni_\anaconda3\lib\site-packages (from shap) (3.1.1)
Requirement already satisfied: typing-extensions in c:\users\anni_\anaconda3\lib\site-packages (from shap) (4.13.0)
Requirement already satisfied: pyyaml in c:\users\anni_\anaconda3\lib\site-packages (from kagglehub) (6.0.1)
Requirement already satisfied: requests in c:\users\anni_\anaconda3\lib\site-packages (from kagglehub) (2.32.3)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in c:\users\anni_\anaconda3\lib\site-packages (from numba>=0.54->shap) (0.44.0)
Requirement already satisfied: six>=1.5 in c:\users\anni_\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: colorama in c:\users\anni_\anaconda3\lib\site-packages (from tqdm->nltk) (0.4.6)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\anni_\anaconda3\lib\site-packages (from requests->kagglehub) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in c:\users\anni_\anaconda3\lib\site-packages (from requests->kagglehub) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\anni_\anaconda3\lib\site-packages (from requests->kagglehub) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\anni_\anaconda3\lib\site-packages (from requests->kagglehub) (2025.1.31)

import pandas as pd
import numpy as np
import kagglehub
from pathlib import Path
import re
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, make_scorer
from sklearn.model_selection import cross_val_score
import shap

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\anni_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\anni_\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

import os

# Download the files via kagglehub
path = kagglehub.dataset_download("emineyetm/fake-news-detection-datasets")

fake_path = os.path.join(path, "News _dataset", "Fake.csv")
true_path = os.path.join(path, "News _dataset", "True.csv")

# Read both files
df_fake = pd.read_csv(fake_path)
df_true = pd.read_csv(true_path)

# Add a column to indicate true/fake. 0 indicates fake, 1 indicates true.
df_fake["label"] = 0
df_true["label"] = 1

# combine the two dataset into one. 
df = pd.concat([df_fake, df_true], ignore_index=True)

# Shuffle the combined dataframe
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Check the resulting dataframe
print(df["label"].value_counts())
print("\nRows in the original csv files")
print("Fake rows", len(df_fake), "\nTrue rows", len(df_true))

label
0    23481
1    21417
Name: count, dtype: int64

Rows in the original csv files
Fake rows 23481 
True rows 21417

# Check for any null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44898 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44898 non-null  object
 1   text     44898 non-null  object
 2   subject  44898 non-null  object
 3   date     44898 non-null  object
 4   label    44898 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.7+ MB

# Take a look at the first 5 rows of the dataframe
df.head()

# Processing the text data, starting with turning everything to lowercase
df["title"] = df["title"].str.lower()
df["text"] = df["text"].str.lower()

# Remove punctuation and special characters using regex
df["title"] = df["title"].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))
df["text"] = df["text"].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

# Remove stopwords
stop_words = set(stopwords.words("english"))
df["title"] = df["title"].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))
df["text"] = df["text"].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

# Stemming
lemmatizer = WordNetLemmatizer()
df["title"] = df["title"].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))
df["text"] = df["text"].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))

# Remove source-words like Reuters and words referring to metadata of the article
source_words = ["reuters", "via", "image", "video", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"]
df["title"] = df["title"].apply(lambda x: ' '.join(word for word in x.split() if word not in source_words))
df["text"] = df["text"].apply(lambda x: ' '.join(word for word in x.split() if word not in source_words))
                                

# Check the first five rows after processing
df.head()

# Drop rows that have duplicate text
df.drop_duplicates(subset="text", inplace=True)
df.reset_index(drop=True, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38510 entries, 0 to 38509
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    38510 non-null  object
 1   text     38510 non-null  object
 2   subject  38510 non-null  object
 3   date     38510 non-null  object
 4   label    38510 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.5+ MB

# Check the number of fake and true news rows in the dataset after dropping duplicates
print(df["label"].value_counts())

label
1    21063
0    17447
Name: count, dtype: int64

# Visualize the distribution of true vs. fake news in the dataset
sns.countplot(x="label", data=df)
plt.title("Distribution of True vs. Fake News")
plt.xticks([0, 1], ["Fake", "True"])
plt.show()

# Calculate text length
df["text_length"] = df["text"].apply(lambda x: len(x.split()))

# Drop all rows where text length is 0
print("Before:", len(df))
df = df[df["text_length"] > 0]
print("After:", len(df))

Before: 38510
After: 38509

# Visualize difference in length between true and fake news articles
sns.boxplot(x="label", y="text_length", data=df, hue="label")
plt.xticks([0, 1], ["Fake", "True"])
plt.title("Boxplot of Text Length by News Type")
plt.xlabel("News Type")
plt.ylabel("Number of Words")
plt.show()

# Create descriptive statistics for text length by news type
df.groupby("label")["text_length"].describe()

# Discover most common words in fake vs. true articles
def get_top_n_words(corpus, n=None):
    all_words = ' '.join(corpus).split()
    return Counter(all_words).most_common(n)

top_fake_words = get_top_n_words(df[df["label"] == 0]["text"], 10)
top_true_words = get_top_n_words(df[df["label"] == 1]["text"], 10)

print(f"{'Top Fake Words':<25} {'Top True Words'}")
print("-" * 45)

for i in range(10):
    fake_word, fake_count = top_fake_words[i]
    true_word, true_count = top_true_words[i]
    print(f"{fake_word:<15} ({fake_count:>5})     {true_word:<15} ({true_count:>5})")

Top Fake Words            Top True Words
---------------------------------------------
trump           (64069)     said            (97426)
said            (22894)     trump           (53446)
people          (20748)     u               (40375)
president       (19629)     state           (35754)
one             (18259)     would           (31025)
would           (18156)     president       (26516)
state           (15922)     republican      (21926)
donald          (14891)     government      (19140)
u               (14846)     year            (18989)
like            (14456)     house           (16695)

# Create a word cloud visualization of the words in either category

fake_text = ' '.join(df[df["label"] == 0]["text"])
true_text = ' '.join(df[df["label"] == 1]["text"])

# Fake news word cloud
WordCloud(width=800, height=400, background_color="white").generate(fake_text).to_image()

# True news word cloud
WordCloud(width=800, height=400, background_color="white").generate(true_text).to_image()

# Investigate the values of the subject column by category
df.groupby("label")["subject"].value_counts()

label  subject        
0      News                9050
       politics            4335
       left-news           2409
       Government News      869
       Middle-east          400
       US_News              383
1      politicsNews       11139
       worldnews           9924
Name: count, dtype: int64

# Drop subject column
df = df.drop("subject", axis=1)
df.head()

# Combine title and text of each article
df["combined_text"] = df["title"] + " " + df["text"]

# Create vectorizer and vectors
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = tfidf.fit_transform(df["combined_text"])

# Separate labels as the target variable
y = df["label"]

# Split data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y)

models = [
    ("Logistic Regression", LogisticRegression(max_iter=1000)),
    ("Random Forest", RandomForestClassifier(n_estimators=100)),
    ("XGBoost", XGBClassifier(eval_metric='logloss')),
    ("Multinomial Naive Bayes", MultinomialNB())
]

trained_models = {}

for name, model in models:
    print(f"Training: {name}")
    model.fit(X_train, y_train)
    trained_models[name] = model

Training: Logistic Regression
Training: Random Forest
Training: XGBoost
Training: Multinomial Naive Bayes

# Evaluate each trained model's accuracy one at a time
for name, model in trained_models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n=== {name} ===")
    print(f"Accuracy: {acc:.4f}")
    print("Classification Report:\n", classification_report(y_test, y_pred))

    # visualize a confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()

=== Logistic Regression ===
Accuracy: 0.9809
Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.97      0.98      3489
           1       0.98      0.99      0.98      4213

    accuracy                           0.98      7702
   macro avg       0.98      0.98      0.98      7702
weighted avg       0.98      0.98      0.98      7702

=== Random Forest ===
Accuracy: 0.9756
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97      3489
           1       0.97      0.98      0.98      4213

    accuracy                           0.98      7702
   macro avg       0.98      0.97      0.98      7702
weighted avg       0.98      0.98      0.98      7702

=== XGBoost ===
Accuracy: 0.9816
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98      3489
           1       0.98      0.99      0.98      4213

    accuracy                           0.98      7702
   macro avg       0.98      0.98      0.98      7702
weighted avg       0.98      0.98      0.98      7702

=== Multinomial Naive Bayes ===
Accuracy: 0.9365
Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.94      0.93      3489
           1       0.95      0.94      0.94      4213

    accuracy                           0.94      7702
   macro avg       0.94      0.94      0.94      7702
weighted avg       0.94      0.94      0.94      7702

# Cross-validation with five folds
for name, model in models:
    scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
    print(f"\n=== {name} ===")
    print("Cross-validation scores:", scores)
    print("Average accuracy:", np.mean(scores))

=== Logistic Regression ===
Cross-validation scores: [0.97727863 0.98130356 0.97818748 0.98169307 0.98259966]
Average accuracy: 0.9802124798665901

=== Random Forest ===
Cross-validation scores: [0.96935861 0.97429239 0.96987795 0.97494157 0.97649656]
Average accuracy: 0.972993417204853

=== XGBoost ===
Cross-validation scores: [0.9789665  0.9832511  0.98104388 0.98143339 0.98376834]
Average accuracy: 0.9816926452438788

=== Multinomial Naive Bayes ===
Cross-validation scores: [0.93001818 0.93183589 0.93767853 0.93663983 0.94039735]
Average accuracy: 0.9353139547481432

xgboost_model = trained_models["XGBoost"]

explainer = shap.Explainer(xgboost_model)
# Convert the sparse matrix to a dense numpy array
X_dense = X_test.toarray()
shap_values = explainer(X_test) 
shap.summary_plot(shap_values, X_dense, feature_names=tfidf.get_feature_names_out())
plt.show()

	title	text	subject	date	label
0	Ben Stein Calls Out 9th Circuit Court: Committ...	21st Century Wire says Ben Stein, reputable pr...	US_News	February 13, 2017	0
1	Trump drops Steve Bannon from National Securit...	WASHINGTON (Reuters) - U.S. President Donald T...	politicsNews	April 5, 2017	1
2	Puerto Rico expects U.S. to lift Jones Act shi...	(Reuters) - Puerto Rico Governor Ricardo Rosse...	politicsNews	September 27, 2017	1
3	OOPS: Trump Just Accidentally Confirmed He Le...	On Monday, Donald Trump once again embarrassed...	News	May 22, 2017	0
4	Donald Trump heads for Scotland to reopen a go...	GLASGOW, Scotland (Reuters) - Most U.S. presid...	politicsNews	June 24, 2016	1

	title	text	subject	date	label
0	ben stein call th circuit court committed coup...	st century wire say ben stein reputable profes...	US_News	February 13, 2017	0
1	trump drop steve bannon national security council	washington u president donald trump removed ch...	politicsNews	April 5, 2017	1
2	puerto rico expects u lift jones act shipping ...	puerto rico governor ricardo rossello said exp...	politicsNews	September 27, 2017	1
3	oops trump accidentally confirmed leaked israe...	donald trump embarrassed country accidentally ...	News	May 22, 2017	0
4	donald trump head scotland reopen golf resort	glasgow scotland u presidential candidate go a...	politicsNews	June 24, 2016	1

	count	mean	std	min	25%	50%	75%	max
label
0	17446.0	227.886622	198.134997	1.0	148.0	199.0	267.0	4823.0
1	21063.0	223.315245	158.076110	13.0	86.0	207.0	302.0	2426.0

	title	text	date	label	text_length
0	ben stein call th circuit court committed coup...	st century wire say ben stein reputable profes...	February 13, 2017	0	99
1	trump drop steve bannon national security council	washington u president donald trump removed ch...	April 5, 2017	1	461
2	puerto rico expects u lift jones act shipping ...	puerto rico governor ricardo rossello said exp...	September 27, 2017	1	169
3	oops trump accidentally confirmed leaked israe...	donald trump embarrassed country accidentally ...	May 22, 2017	0	103
4	donald trump head scotland reopen golf resort	glasgow scotland u presidential candidate go a...	June 24, 2016	1	297

Fake news detection¶

1. Introduction¶

2. Data preprocessing¶

3. Exploratory data analysis¶

4. Feature engineering¶

5. Model building¶

6. Model evaluation¶

7. Model interpretation¶