News Headlines Dataset For Sarcasm Detection¶
• DOMAIN: Social media analytics
• Dataset: Dataset : News Headlines Dataset For Sarcasm Detection .
• CONTEXT: Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.In this hands-on project, the goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.
• DATA DESCRIPTION: The dataset is collected from two news websites, theonion.com and huffingtonpost.com. This new dataset has the following advantages over the existing Twitter datasets: Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings. Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets. Unlike tweets that reply to other tweets, the news headlines obtained are self-contained. This would help us in teasing apart the real sarcastic elements
Content: Each record consists of three attributes: is_sarcastic: 1 if the record is sarcastic otherwise 0 headline: the headline of the news article
article_link: link to the original news article. Useful in collecting supplementary data Reference: https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#importing the packages.
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from tensorflow import keras
%matplotlib inline
import random, re
import nltk
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
from bs4 import BeautifulSoup
from wordcloud import WordCloud
from nltk.stem import WordNetLemmatizer
# Models
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import Constant
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import re
import os
import tensorflow as tf
!pip install text-preprocessing
from text_preprocessing import *
# Set random state
random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)
tf.__version__
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date!
Requirement already satisfied: text-preprocessing in /usr/local/lib/python3.10/dist-packages (0.1.1) Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (3.8.1) Requirement already satisfied: pyspellchecker in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (0.7.2) Requirement already satisfied: contractions in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (0.1.73) Requirement already satisfied: names-dataset==2.1 in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (2.1.0) Requirement already satisfied: unittest-xml-reporting in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (3.2.0) Requirement already satisfied: textsearch>=0.0.21 in /usr/local/lib/python3.10/dist-packages (from contractions->text-preprocessing) (0.0.24) Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (8.1.7) Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (1.3.2) Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (2023.6.3) Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (4.66.1) Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from unittest-xml-reporting->text-preprocessing) (4.9.3) Requirement already satisfied: anyascii in /usr/local/lib/python3.10/dist-packages (from textsearch>=0.0.21->contractions->text-preprocessing) (0.3.2) Requirement already satisfied: pyahocorasick in /usr/local/lib/python3.10/dist-packages (from textsearch>=0.0.21->contractions->text-preprocessing) (2.0.0)
'2.14.0'
#Set your project path
project_path = '/content/drive/MyDrive/GL_files/GL_Dataset/'
DF = pd.read_json(os.path.join(project_path,'Sarcasm_Headlines_Dataset.json'),lines=True)
##DF = pd.read_json('/content/drive/MyDrive/GL_files/GL_Dataset/Sarcasm_Headlines_Dataset.json',lines = True)
#Printing the dataset
DF
article_link | headline | is_sarcastic | |
---|---|---|---|
0 | https://www.huffingtonpost.com/entry/versace-b... | former versace store clerk sues over secret 'b... | 0 |
1 | https://www.huffingtonpost.com/entry/roseanne-... | the 'roseanne' revival catches up to our thorn... | 0 |
2 | https://local.theonion.com/mom-starting-to-fea... | mom starting to fear son's web series closest ... | 1 |
3 | https://politics.theonion.com/boehner-just-wan... | boehner just wants wife to listen, not come up... | 1 |
4 | https://www.huffingtonpost.com/entry/jk-rowlin... | j.k. rowling wishes snape happy birthday in th... | 0 |
... | ... | ... | ... |
26704 | https://www.huffingtonpost.com/entry/american-... | american politics in moral free-fall | 0 |
26705 | https://www.huffingtonpost.com/entry/americas-... | america's best 20 hikes | 0 |
26706 | https://www.huffingtonpost.com/entry/reparatio... | reparations and obama | 0 |
26707 | https://www.huffingtonpost.com/entry/israeli-b... | israeli ban targeting boycott supporters raise... | 0 |
26708 | https://www.huffingtonpost.com/entry/gourmet-g... | gourmet gifts for the foodie 2014 | 0 |
26709 rows × 3 columns
DF.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 26709 entries, 0 to 26708 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 article_link 26709 non-null object 1 headline 26709 non-null object 2 is_sarcastic 26709 non-null int64 dtypes: int64(1), object(2) memory usage: 626.1+ KB
#Visualizing the stats of predicted data
sns.countplot(x='is_sarcastic',data=DF)
<Axes: xlabel='is_sarcastic', ylabel='count'>
print('--'*50); print('Value Counts for `is_sarcastic` label'); print('--'*50)
print(f'Is Sarcastic count: {DF[DF.is_sarcastic == 1].shape[0]} i.e. {round(DF[DF.is_sarcastic == 1].shape[0]/DF.shape[0]*100, 0)}%')
print(f'Isn\'t Sarcastic count: {DF[DF.is_sarcastic == 0].shape[0]} i.e. {round(DF[DF.is_sarcastic == 0].shape[0]/DF.shape[0]*100, 0)}%')
print('--'*50)
---------------------------------------------------------------------------------------------------- Value Counts for `is_sarcastic` label ---------------------------------------------------------------------------------------------------- Is Sarcastic count: 11724 i.e. 44.0% Isn't Sarcastic count: 14985 i.e. 56.0% ----------------------------------------------------------------------------------------------------
print('Analysis of `is_sarcastic` label by news website'); print('--'*50)
hf = DF[DF['article_link'].str.contains('huffingtonpost.com')].shape[0]
op = DF[DF['article_link'].str.contains('theonion.com')].shape[0]
is_sarcastic_hf = DF.loc[(DF['article_link'].str.contains('huffingtonpost.com')) & (DF['is_sarcastic'] == 1)].shape[0]
not_sarcastic_hf = DF.loc[(DF['article_link'].str.contains('huffingtonpost.com')) & (DF['is_sarcastic'] == 0)].shape[0]
is_sarcastic_op = DF.loc[(DF['article_link'].str.contains('theonion.com')) & (DF['is_sarcastic'] == 1)].shape[0]
not_sarcastic_op = DF.loc[(DF['article_link'].str.contains('theonion.com')) & (DF['is_sarcastic'] == 0)].shape[0]
display(pd.DataFrame([[is_sarcastic_hf, is_sarcastic_op], [not_sarcastic_hf, not_sarcastic_op]],
columns = ['huffingtonpost', 'theonion'], index = ['Sarcastic', 'Non-sarcastic']))
print('--'*50)
Analysis of `is_sarcastic` label by news website ----------------------------------------------------------------------------------------------------
huffingtonpost | theonion | |
---|---|---|
Sarcastic | 0 | 11724 |
Non-sarcastic | 14985 | 1 |
----------------------------------------------------------------------------------------------------
# Checking 5 random headlines and labels from the data where the length of headline is > 100
print('--'*30); print('Checking 5 random headlines and labels from the data where the length of headline is > 100'); print('--'*30)
indexes = list(DF.loc[DF['headline'].str.len() > 100, 'headline'].index)
rands = random.sample(indexes, 5)
headlines, labels = list(DF.loc[rands, 'headline']), list(DF.loc[rands, 'is_sarcastic'])
_ = [print(f'Headline: {head}\nlabel: {label}\n') for head, label in zip(headlines, labels)]
print('--'*30); print('Distributon of label where the length of headline is > 100'); print('--'*30)
_ = DF.loc[indexes, 'is_sarcastic'].value_counts().plot(kind = 'pie', autopct = '%.0f%%', labels = ['Sarcastic', 'Non-sarcastic'], figsize = (10, 6))
------------------------------------------------------------ Checking 5 random headlines and labels from the data where the length of headline is > 100 ------------------------------------------------------------ Headline: 'this women's strike won't accomplish anything,' reports man who will boycott upcoming 'avengers' movie label: 1 Headline: bernie sanders asks trump's education nominee if she's only getting the job because she's a billionaire label: 0 Headline: study finds controlled washington, d.c. wildfires crucial for restoring healthy political environment label: 1 Headline: trump's prefrontal cortex admits it can't possibly filter all impulsive comments coming from rest of brain label: 1 Headline: motion picture academy releases complete list of films that can be enjoyed without supporting sexual predator label: 1 ------------------------------------------------------------ Distributon of label where the length of headline is > 100 ------------------------------------------------------------
# Define regex patterns for URL patterns to remove
url_patterns = [
r"http[s]?://\S+",
r"www\.\S+",
r"\S+\.com",
r"\S+\.org",
# Will add more specific extensions if needed
]
# Combine the regex patterns into a single pattern
combined_pattern = "|".join(url_patterns)
# Remove URLs and specific patterns from the 'text' column
DF['headline'] = DF['headline'].apply(lambda x: re.sub(combined_pattern, '', x))
# Convert text to lowercase and remove non-alphabetic characters
DF['headline'] = DF['headline'].str.lower()
DF['headline'] = DF['headline'].apply(lambda x: re.sub("[^a-z\s]", " ", x))
# Remove extra whitespaces
DF['headline'] = DF['headline'].str.replace('\s+', ' ')
<ipython-input-63-0a1e37562e6c>:6: FutureWarning: The default value of regex will change from True to False in a future version. DF['headline'] = DF['headline'].str.replace('\s+', ' ')
stop = set(stopwords.words('english'))
punctuation = list(string.punctuation)
stop.update(punctuation)
def strip_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
#Removing the square brackets
def remove_between_square_brackets(text):
return re.sub('\[[^]]*\]', '', text)
# Removing URL's
def remove_between_square_brackets(text):
return re.sub(r'http\S+', '', text)
#Removing the stopwords from text
def remove_stopwords(text):
final_text = []
for i in text.split():
if i.strip().lower() not in stop:
final_text.append(i.strip())
return " ".join(final_text)
# Create a WordNetLemmatizer instance
lemmatizer = WordNetLemmatizer()
# Define a function to lemmatize a string
def lemmatize_text(text):
words = text.split() # Split the text into words
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
return ' '.join(lemmatized_words) # Join the lemmatized words back into a string
#Removing the noisy text
def denoise_text(text):
text = strip_html(text)
text = remove_between_square_brackets(text)
text = remove_stopwords(text)
text = lemmatize_text(text)
return text
DF['headline']=DF['headline'].apply(denoise_text)
DF['headline']
0 former versace store clerk sue secret black co... 1 roseanne revival catch thorny political mood b... 2 mom starting fear son web series closest thing... 3 boehner want wife listen come alternative debt... 4 j k rowling wish snape happy birthday magical way ... 26704 american politics moral free fall 26705 america best hike 26706 reparation obama 26707 israeli ban targeting boycott supporter raise ... 26708 gourmet gift foodie Name: headline, Length: 26709, dtype: object
DF['headline']
0 former versace store clerk sue secret black co... 1 roseanne revival catch thorny political mood b... 2 mom starting fear son web series closest thing... 3 boehner want wife listen come alternative debt... 4 j k rowling wish snape happy birthday magical way ... 26704 american politics moral free fall 26705 america best hike 26706 reparation obama 26707 israeli ban targeting boycott supporter raise ... 26708 gourmet gift foodie Name: headline, Length: 26709, dtype: object
Word Cloud of headline¶
plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(DF[DF.is_sarcastic == 0].headline))
plt.imshow(wc , interpolation = 'bilinear')
<matplotlib.image.AxesImage at 0x7b22144dbdf0>
plt.figure(figsize = (20,20)) # Text that is Sarcastic
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(DF[DF.is_sarcastic == 1].headline))
plt.imshow(wc , interpolation = 'bilinear')
<matplotlib.image.AxesImage at 0x7b2215083610>
Visualization of characters in text¶
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,5))
text_len=DF[DF['is_sarcastic']==1]['headline'].str.len()
ax1.hist(text_len)
ax1.set_title('Sarcastic text')
text_len=DF[DF['is_sarcastic']==0]['headline'].str.len()
ax2.hist(text_len, color='green')
ax2.set_title('Not Sarcastic text')
fig.suptitle('Characters in texts')
plt.show()
Visualization of words in texts¶
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
text_len=DF[DF['is_sarcastic']==1]['headline'].str.split().map(lambda x: len(x))
ax1.hist(text_len)
ax1.set_title('Sarcastic text')
text_len=DF[DF['is_sarcastic']==0]['headline'].str.split().map(lambda x: len(x))
ax2.hist(text_len,color='green')
ax2.set_title('Not Sarcastic text')
fig.suptitle('Words in texts')
plt.show()
Visualization of average word length in each text¶
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
word=DF[DF['is_sarcastic']==1]['headline'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1)
ax1.set_title('Sarcastic text')
word=DF[DF['is_sarcastic']==0]['headline'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='green')
ax2.set_title('Not Sarcastic text')
fig.suptitle('Average word length in each text')
<ipython-input-72-a285dda25c60>:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1) /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) <ipython-input-72-a285dda25c60>:6: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='green')
Text(0.5, 0.98, 'Average word length in each text')
Determine the length of each line and identify the maximum length. Since lines vary in length, it is necessary to pad our sequences using the maximum length.
Modelling¶
Determine Line Lengths and Identify the Maximum Length¶
Since lines vary in length, it is necessary to pad our sequences using the maximum length.
print('--'*40); print('Get the length of each line, find the maximum length and print the maximum length line');
print('Length of line ranges from 3 to 247.'); print('--'*40)
# Get length of each line
DF['line_length'] = DF['headline'].str.len()
print('Minimum line length: {}'.format(DF['line_length'].min()))
print('Maximum line length: {}'.format(DF['line_length'].max()))
print('Line with maximum length: {}'.format(DF[DF['line_length'] == DF['line_length'].max()]['headline'].values[0]))
-------------------------------------------------------------------------------- Get the length of each line, find the maximum length and print the maximum length line Length of line ranges from 3 to 247. -------------------------------------------------------------------------------- Minimum line length: 0 Maximum line length: 227 Line with maximum length: maya angelou poet author civil right activist holy cow tony award nominated actress college professor magazine editor streetcar conductor really streetcar conductor wow calypso singer nightclub performer foreign journalist dead
print('--'*40); print('Get the number of words, find the maximum number of words and print the maximum number of words');
print('Number of words ranges from 1 to 27.'); print('--'*40)
# Get length of each line
DF['nb_words'] = DF['headline'].apply(lambda x: len(x.split(' ')))
print('Minimum number of words: {}'.format(DF['nb_words'].min()))
print('Maximum number of words: {}'.format(DF['nb_words'].max()))
print('Line with maximum number of words: {}'.format(DF[DF['nb_words'] == DF['nb_words'].max()]['headline'].values[0]))
-------------------------------------------------------------------------------- Get the number of words, find the maximum number of words and print the maximum number of words Number of words ranges from 1 to 27. -------------------------------------------------------------------------------- Minimum number of words: 1 Maximum number of words: 30 Line with maximum number of words: maya angelou poet author civil right activist holy cow tony award nominated actress college professor magazine editor streetcar conductor really streetcar conductor wow calypso singer nightclub performer foreign journalist dead
print('--'*30); print('Five point summary for number of words')
display(DF['nb_words'].describe().round(0).astype(int));
print('99% quantilie: {}'.format(DF['nb_words'].quantile(0.99)));print('--'*30)
------------------------------------------------------------ Five point summary for number of words
count 26709 mean 7 std 2 min 1 25% 5 50% 7 75% 9 max 30 Name: nb_words, dtype: int64
99% quantilie: 13.0 ------------------------------------------------------------
Set Different Parameters for the model¶
max_features = 10000
maxlen = DF['nb_words'].max()
embedding_size = 200
trunc_type='post'
padding_type='post'
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(DF['headline']))
Define X and y for your model.
X = tokenizer.texts_to_sequences(DF['headline'])
X = pad_sequences(X, maxlen = maxlen, padding=padding_type, truncating=trunc_type)
y = np.asarray(DF['is_sarcastic'])
print(f'Number of Samples: {len(X)}')
print(f'Number of Labels: {len(y)}')
print(f'\nFirst headline:\n{X[0]}\n\nLabel of the first headline: {y[0]}')
Number of Samples: 26709 Number of Labels: 26709 First headline: [ 240 492 2699 1174 207 44 1766 1942 3527 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] Label of the first headline: 0
Get the Vocabulary size¶
# Reserve padding (indexed zero)
w2i = tokenizer.word_index
vocab_size = len(w2i) + 1
print(f'Number of unique tokens: {vocab_size}')
Number of unique tokens: 21719
glove_file = '/content/drive/MyDrive/GL_files/GL_Dataset/archive.zip'
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
z.extractall()
EMBEDDING_FILE = './glove.6B.200d.txt'
embeddings = {}
for o in open(EMBEDDING_FILE):
word = o.split(' ')[0]
embd = o.split(' ')[1:]
embd = np.asarray(embd, dtype = 'float32')
embeddings[word] = embd
# Getting the minimum number of words
num_words = min(max_features, vocab_size) + 1
embedding_matrix = np.zeros((num_words, embedding_size))
for word, i in tokenizer.word_index.items():
if i > max_features: continue
embedding_vector = embeddings.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
len(embeddings.values())
400000
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, shuffle = True)
print('---'*20, f'\nNumber of rows in training dataset: {x_train.shape[0]}')
print(f'Number of columns in training dataset: {x_train.shape[1]}')
print(f'Number of unique words in training dataset: {len(np.unique(np.hstack(x_train)))}')
print('---'*20, f'\nNumber of rows in test dataset: {x_test.shape[0]}')
print(f'Number of columns in test dataset: {x_test.shape[1]}')
print(f'Number of unique words in test dataset: {len(np.unique(np.hstack(x_test)))}')
------------------------------------------------------------ Number of rows in training dataset: 21367 Number of columns in training dataset: 30 Number of unique words in training dataset: 9950 ------------------------------------------------------------ Number of rows in test dataset: 5342 Number of columns in test dataset: 30 Number of unique words in test dataset: 7297
model = Sequential()
model.add(Embedding(num_words, embedding_size, embeddings_initializer = Constant(embedding_matrix), input_length = maxlen, trainable = False))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.5, input_shape = (256,)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5, input_shape = (128,)))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5, input_shape = (64,)))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 10)
mc = ModelCheckpoint('sarcasm_detector.h5', monitor = 'val_loss', mode = 'min', save_best_only = True, verbose = 1)
lr_r = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.1, patience = 5),
logdir = 'log'; tb = TensorBoard(logdir, histogram_freq = 1)
callbacks = [es, mc, lr_r, tb]
print(model.summary())
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 30, 200) 2000200 bidirectional_1 (Bidirecti (None, 30, 256) 336896 onal) global_max_pooling1d_1 (Gl (None, 256) 0 obalMaxPooling1D) dropout_3 (Dropout) (None, 256) 0 dense_3 (Dense) (None, 128) 32896 dropout_4 (Dropout) (None, 128) 0 dense_4 (Dense) (None, 64) 8256 dropout_5 (Dropout) (None, 64) 0 dense_5 (Dense) (None, 1) 65 ================================================================= Total params: 2378313 (9.07 MB) Trainable params: 378113 (1.44 MB) Non-trainable params: 2000200 (7.63 MB) _________________________________________________________________ None
tf.keras.utils.plot_model(model, show_shapes = True)
batch_size = 100
epochs = 6
h = model.fit(x_train, y_train, epochs = epochs, validation_split = 0.2, batch_size = batch_size, verbose = 2, callbacks = callbacks)
Epoch 1/6 Epoch 1: val_loss improved from inf to 0.52759, saving model to sarcasm_detector.h5
/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py:3079: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`. saving_api.save_model(
171/171 - 44s - loss: 0.6198 - accuracy: 0.6481 - val_loss: 0.5276 - val_accuracy: 0.7387 - lr: 0.0010 - 44s/epoch - 256ms/step Epoch 2/6 Epoch 2: val_loss improved from 0.52759 to 0.47907, saving model to sarcasm_detector.h5 171/171 - 38s - loss: 0.5212 - accuracy: 0.7464 - val_loss: 0.4791 - val_accuracy: 0.7637 - lr: 0.0010 - 38s/epoch - 223ms/step Epoch 3/6 Epoch 3: val_loss improved from 0.47907 to 0.47037, saving model to sarcasm_detector.h5 171/171 - 37s - loss: 0.4697 - accuracy: 0.7775 - val_loss: 0.4704 - val_accuracy: 0.7693 - lr: 0.0010 - 37s/epoch - 214ms/step Epoch 4/6 Epoch 4: val_loss improved from 0.47037 to 0.44223, saving model to sarcasm_detector.h5 171/171 - 38s - loss: 0.4236 - accuracy: 0.8079 - val_loss: 0.4422 - val_accuracy: 0.7894 - lr: 0.0010 - 38s/epoch - 225ms/step Epoch 5/6 Epoch 5: val_loss did not improve from 0.44223 171/171 - 31s - loss: 0.3851 - accuracy: 0.8265 - val_loss: 0.4454 - val_accuracy: 0.7857 - lr: 0.0010 - 31s/epoch - 183ms/step Epoch 6/6 Epoch 6: val_loss improved from 0.44223 to 0.43770, saving model to sarcasm_detector.h5 171/171 - 39s - loss: 0.3523 - accuracy: 0.8437 - val_loss: 0.4377 - val_accuracy: 0.7974 - lr: 0.0010 - 39s/epoch - 230ms/step
%load_ext tensorboard
%tensorboard --logdir log/
The tensorboard extension is already loaded. To reload it, use: %reload_ext tensorboard
Reusing TensorBoard on port 6006 (pid 14360), started 0:22:20 ago. (Use '!kill 14360' to kill it.)
f, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 7.2))
f.suptitle('Monitoring the performance of the model')
ax1.plot(h.history['loss'], label = 'Train')
ax1.plot(h.history['val_loss'], label = 'Test')
ax1.set_title('Model Loss')
ax1.legend(['Train', 'Test'])
ax2.plot(h.history['accuracy'], label = 'Train')
ax2.plot(h.history['val_accuracy'], label = 'Test')
ax2.set_title('Model Accuracy')
ax2.legend(['Train', 'Test'])
plt.show()
# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, verbose = 0)
print('Overall Accuracy: {}'.format(round(accuracy * 100, 0)))
Overall Accuracy: 81.0
y_pred = (model.predict(x_test) > 0.5).astype('int32')
print(f'Classification Report:\n{classification_report(y_pred, y_test)}')
167/167 [==============================] - 7s 37ms/step Classification Report: precision recall f1-score support 0 0.83 0.83 0.83 2991 1 0.78 0.78 0.78 2351 accuracy 0.81 5342 macro avg 0.80 0.80 0.80 5342 weighted avg 0.81 0.81 0.81 5342
print('--'*30); print('Confusion Matrix')
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm , index = ['Non-sarcastic', 'Sarcastic'] , columns = ['Non-sarcastic','Sarcastic'])
display(cm); print('--'*30)
plt.figure(figsize = (8, 5))
_ = sns.heatmap(cm, cmap= 'Blues', linecolor = 'black' , linewidth = 1 , annot = True,
fmt = '' , xticklabels = ['Non-sarcastic', 'Sarcastic'],
yticklabels = ['Non-sarcastic', 'Sarcastic']).set_title('Confusion Matrix')
------------------------------------------------------------ Confusion Matrix
Non-sarcastic | Sarcastic | |
---|---|---|
Non-sarcastic | 2480 | 516 |
Sarcastic | 511 | 1835 |
------------------------------------------------------------
print('Evaluate model on sample sarcastic lines'); print('--'*30)
statements = ['Are you always so stupid or is today a special ocassion?', #Sarcasm
'I feel so miserable without you, it\'s almost like having you here.', #Sarcasm
'If you find me offensive. Then I suggest you quit finding me.', #Sarcasm
'If I wanted to kill myself I would climb your ego and jump to your IQ.', #Sarcasm
'Amphibious pitcher makes debut', #Sarcasm
'It\'s okay if you don\'t like me. Not everyone has good taste.' #Sarcasm
]
# Process the statements
for statement in statements:
statement = denoise_text(statement)
headline = tokenizer.texts_to_sequences(statement)
headline = pad_sequences(headline, maxlen = maxlen, dtype = 'int32', value = 0)
sentiment = (model.predict(headline) > 0.5).astype('int32')
if(np.argmax(sentiment) == 0):
print(f'`{statement}` is a Non-sarcastic statement.')
elif (np.argmax(sentiment) == 1):
print(f'`{statement}` is a Sarcastic statement.')
Evaluate model on sample sarcastic lines ------------------------------------------------------------ 2/2 [==============================] - 0s 21ms/step `always stupid today special ocassion?` is a Sarcastic statement. 2/2 [==============================] - 0s 46ms/step `feel miserable without you, almost like here.` is a Sarcastic statement. 2/2 [==============================] - 0s 33ms/step 2/2 [==============================] - 0s 19ms/step `wanted kill would climb ego jump IQ.` is a Non-sarcastic statement. 1/1 [==============================] - 0s 141ms/step 2/2 [==============================] - 0s 23ms/step
The model achieved an accuracy of 81.0%.