News Headlines Dataset For Sarcasm Detection¶

• DOMAIN: Social media analytics

• Dataset: Dataset : News Headlines Dataset For Sarcasm Detection .

• CONTEXT: Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.In this hands-on project, the goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.

• DATA DESCRIPTION: The dataset is collected from two news websites, theonion.com and huffingtonpost.com. This new dataset has the following advantages over the existing Twitter datasets: Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings. Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets. Unlike tweets that reply to other tweets, the news headlines obtained are self-contained. This would help us in teasing apart the real sarcastic elements

Content: Each record consists of three attributes: is_sarcastic: 1 if the record is sarcastic otherwise 0 headline: the headline of the news article

article_link: link to the original news article. Useful in collecting supplementary data Reference: https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection

In [ ]:

# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

In [ ]:

#importing the packages.
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from tensorflow import keras
%matplotlib inline

import random, re
import nltk
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
from bs4 import BeautifulSoup
from wordcloud import WordCloud
from nltk.stem import WordNetLemmatizer

# Models
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout, Flatten, Bidirectional, GlobalMaxPool1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import Constant

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import re
import os
import tensorflow as tf
!pip install text-preprocessing
from text_preprocessing import *
# Set random state
random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)

tf.__version__

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Requirement already satisfied: text-preprocessing in /usr/local/lib/python3.10/dist-packages (0.1.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (3.8.1)
Requirement already satisfied: pyspellchecker in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (0.7.2)
Requirement already satisfied: contractions in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (0.1.73)
Requirement already satisfied: names-dataset==2.1 in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (2.1.0)
Requirement already satisfied: unittest-xml-reporting in /usr/local/lib/python3.10/dist-packages (from text-preprocessing) (3.2.0)
Requirement already satisfied: textsearch>=0.0.21 in /usr/local/lib/python3.10/dist-packages (from contractions->text-preprocessing) (0.0.24)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk->text-preprocessing) (4.66.1)
Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from unittest-xml-reporting->text-preprocessing) (4.9.3)
Requirement already satisfied: anyascii in /usr/local/lib/python3.10/dist-packages (from textsearch>=0.0.21->contractions->text-preprocessing) (0.3.2)
Requirement already satisfied: pyahocorasick in /usr/local/lib/python3.10/dist-packages (from textsearch>=0.0.21->contractions->text-preprocessing) (2.0.0)

Out[ ]:

'2.14.0'

In [ ]:

#Set your project path
project_path =  '/content/drive/MyDrive/GL_files/GL_Dataset/'
DF = pd.read_json(os.path.join(project_path,'Sarcasm_Headlines_Dataset.json'),lines=True)
##DF = pd.read_json('/content/drive/MyDrive/GL_files/GL_Dataset/Sarcasm_Headlines_Dataset.json',lines = True)

In [ ]:

#Printing the dataset
DF

Out[ ]:

	article_link	headline	is_sarcastic
0	https://www.huffingtonpost.com/entry/versace-b...	former versace store clerk sues over secret 'b...	0
1	https://www.huffingtonpost.com/entry/roseanne-...	the 'roseanne' revival catches up to our thorn...	0
2	https://local.theonion.com/mom-starting-to-fea...	mom starting to fear son's web series closest ...	1
3	https://politics.theonion.com/boehner-just-wan...	boehner just wants wife to listen, not come up...	1
4	https://www.huffingtonpost.com/entry/jk-rowlin...	j.k. rowling wishes snape happy birthday in th...	0
...	...	...	...
26704	https://www.huffingtonpost.com/entry/american-...	american politics in moral free-fall	0
26705	https://www.huffingtonpost.com/entry/americas-...	america's best 20 hikes	0
26706	https://www.huffingtonpost.com/entry/reparatio...	reparations and obama	0
26707	https://www.huffingtonpost.com/entry/israeli-b...	israeli ban targeting boycott supporters raise...	0
26708	https://www.huffingtonpost.com/entry/gourmet-g...	gourmet gifts for the foodie 2014	0

26709 rows × 3 columns

In [ ]:

DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26709 entries, 0 to 26708
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   article_link  26709 non-null  object
 1   headline      26709 non-null  object
 2   is_sarcastic  26709 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 626.1+ KB

In [ ]:

#Visualizing the stats of predicted data
sns.countplot(x='is_sarcastic',data=DF)

Out[ ]:

<Axes: xlabel='is_sarcastic', ylabel='count'>

No description has been provided for this image

In [ ]:

print('--'*50); print('Value Counts for `is_sarcastic` label'); print('--'*50)
print(f'Is Sarcastic count: {DF[DF.is_sarcastic == 1].shape[0]} i.e. {round(DF[DF.is_sarcastic == 1].shape[0]/DF.shape[0]*100, 0)}%')
print(f'Isn\'t Sarcastic count: {DF[DF.is_sarcastic == 0].shape[0]} i.e. {round(DF[DF.is_sarcastic == 0].shape[0]/DF.shape[0]*100, 0)}%')
print('--'*50)

----------------------------------------------------------------------------------------------------
Value Counts for `is_sarcastic` label
----------------------------------------------------------------------------------------------------
Is Sarcastic count: 11724 i.e. 44.0%
Isn't Sarcastic count: 14985 i.e. 56.0%
----------------------------------------------------------------------------------------------------

In [ ]:

print('Analysis of `is_sarcastic` label by news website'); print('--'*50)

hf = DF[DF['article_link'].str.contains('huffingtonpost.com')].shape[0]
op = DF[DF['article_link'].str.contains('theonion.com')].shape[0]

is_sarcastic_hf = DF.loc[(DF['article_link'].str.contains('huffingtonpost.com')) & (DF['is_sarcastic'] == 1)].shape[0]
not_sarcastic_hf = DF.loc[(DF['article_link'].str.contains('huffingtonpost.com')) & (DF['is_sarcastic'] == 0)].shape[0]

is_sarcastic_op = DF.loc[(DF['article_link'].str.contains('theonion.com')) & (DF['is_sarcastic'] == 1)].shape[0]
not_sarcastic_op = DF.loc[(DF['article_link'].str.contains('theonion.com')) & (DF['is_sarcastic'] == 0)].shape[0]

display(pd.DataFrame([[is_sarcastic_hf, is_sarcastic_op], [not_sarcastic_hf, not_sarcastic_op]],
                     columns = ['huffingtonpost', 'theonion'], index = ['Sarcastic', 'Non-sarcastic']))
print('--'*50)

Analysis of `is_sarcastic` label by news website
----------------------------------------------------------------------------------------------------

	huffingtonpost	theonion
Sarcastic	0	11724
Non-sarcastic	14985	1

----------------------------------------------------------------------------------------------------

In [ ]:

# Checking 5 random headlines and labels from the data where the length of headline is > 100
print('--'*30); print('Checking 5 random headlines and labels from the data where the length of headline is > 100'); print('--'*30)
indexes = list(DF.loc[DF['headline'].str.len() > 100, 'headline'].index)
rands = random.sample(indexes, 5)
headlines, labels = list(DF.loc[rands, 'headline']), list(DF.loc[rands, 'is_sarcastic'])

_ = [print(f'Headline: {head}\nlabel: {label}\n') for head, label in zip(headlines, labels)]

print('--'*30); print('Distributon of label where the length of headline is > 100'); print('--'*30)
_ = DF.loc[indexes, 'is_sarcastic'].value_counts().plot(kind = 'pie', autopct = '%.0f%%', labels = ['Sarcastic', 'Non-sarcastic'], figsize = (10, 6))

------------------------------------------------------------
Checking 5 random headlines and labels from the data where the length of headline is > 100
------------------------------------------------------------
Headline: 'this women's strike won't accomplish anything,' reports man who will boycott upcoming 'avengers' movie
label: 1

Headline: bernie sanders asks trump's education nominee if she's only getting the job because she's a billionaire
label: 0

Headline: study finds controlled washington, d.c. wildfires crucial for restoring healthy political environment
label: 1

Headline: trump's prefrontal cortex admits it can't possibly filter all impulsive comments coming from rest of brain
label: 1

Headline: motion picture academy releases complete list of films that can be enjoyed without supporting sexual predator
label: 1

------------------------------------------------------------
Distributon of label where the length of headline is > 100
------------------------------------------------------------

In [ ]:

# Define regex patterns for URL patterns to remove
url_patterns = [
    r"http[s]?://\S+",
    r"www\.\S+",
    r"\S+\.com",
    r"\S+\.org",
    # Will add more specific extensions if needed
]

# Combine the regex patterns into a single pattern
combined_pattern = "|".join(url_patterns)

# Remove URLs and specific patterns from the 'text' column
DF['headline'] = DF['headline'].apply(lambda x: re.sub(combined_pattern, '', x))

In [ ]:

# Convert text to lowercase and remove non-alphabetic characters
DF['headline'] = DF['headline'].str.lower()
DF['headline'] = DF['headline'].apply(lambda x: re.sub("[^a-z\s]", " ", x))

# Remove extra whitespaces
DF['headline'] = DF['headline'].str.replace('\s+', ' ')

<ipython-input-63-0a1e37562e6c>:6: FutureWarning: The default value of regex will change from True to False in a future version.
  DF['headline'] = DF['headline'].str.replace('\s+', ' ')

In [ ]:

stop = set(stopwords.words('english'))
punctuation = list(string.punctuation)
stop.update(punctuation)
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)
# Removing URL's
def remove_between_square_brackets(text):
    return re.sub(r'http\S+', '', text)
#Removing the stopwords from text
def remove_stopwords(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            final_text.append(i.strip())
    return " ".join(final_text)
# Create a WordNetLemmatizer instance
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize a string
def lemmatize_text(text):
    words = text.split()  # Split the text into words
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)  # Join the lemmatized words back into a string

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    text = remove_stopwords(text)
    text = lemmatize_text(text)

    return text
DF['headline']=DF['headline'].apply(denoise_text)

In [ ]:

DF['headline']

Out[ ]:

0        former versace store clerk sue secret black co...
1        roseanne revival catch thorny political mood b...
2        mom starting fear son web series closest thing...
3        boehner want wife listen come alternative debt...
4        j k rowling wish snape happy birthday magical way
                               ...                        
26704                    american politics moral free fall
26705                                    america best hike
26706                                     reparation obama
26707    israeli ban targeting boycott supporter raise ...
26708                                  gourmet gift foodie
Name: headline, Length: 26709, dtype: object

In [ ]:

DF['headline']

Out[ ]:

0        former versace store clerk sue secret black co...
1        roseanne revival catch thorny political mood b...
2        mom starting fear son web series closest thing...
3        boehner want wife listen come alternative debt...
4        j k rowling wish snape happy birthday magical way
                               ...                        
26704                    american politics moral free fall
26705                                    america best hike
26706                                     reparation obama
26707    israeli ban targeting boycott supporter raise ...
26708                                  gourmet gift foodie
Name: headline, Length: 26709, dtype: object

Word Cloud of headline¶

In [ ]:

plt.figure(figsize = (20,20))
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(DF[DF.is_sarcastic == 0].headline))
plt.imshow(wc , interpolation = 'bilinear')

Out[ ]:

<matplotlib.image.AxesImage at 0x7b22144dbdf0>

In [ ]:

plt.figure(figsize = (20,20)) # Text that is Sarcastic
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate(" ".join(DF[DF.is_sarcastic == 1].headline))
plt.imshow(wc , interpolation = 'bilinear')

Out[ ]:

<matplotlib.image.AxesImage at 0x7b2215083610>

Visualization of characters in text¶

In [ ]:

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,5))
text_len=DF[DF['is_sarcastic']==1]['headline'].str.len()
ax1.hist(text_len)
ax1.set_title('Sarcastic text')
text_len=DF[DF['is_sarcastic']==0]['headline'].str.len()
ax2.hist(text_len, color='green')
ax2.set_title('Not Sarcastic text')
fig.suptitle('Characters in texts')
plt.show()

Visualization of words in texts¶

In [ ]:

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
text_len=DF[DF['is_sarcastic']==1]['headline'].str.split().map(lambda x: len(x))
ax1.hist(text_len)
ax1.set_title('Sarcastic text')
text_len=DF[DF['is_sarcastic']==0]['headline'].str.split().map(lambda x: len(x))
ax2.hist(text_len,color='green')
ax2.set_title('Not Sarcastic text')
fig.suptitle('Words in texts')
plt.show()

Visualization of average word length in each text¶

In [ ]:

fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
word=DF[DF['is_sarcastic']==1]['headline'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1)
ax1.set_title('Sarcastic text')
word=DF[DF['is_sarcastic']==0]['headline'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='green')
ax2.set_title('Not Sarcastic text')
fig.suptitle('Average word length in each text')

<ipython-input-72-a285dda25c60>:3: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1)
/usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
<ipython-input-72-a285dda25c60>:6: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='green')

Out[ ]:

Text(0.5, 0.98, 'Average word length in each text')

Determine the length of each line and identify the maximum length. Since lines vary in length, it is necessary to pad our sequences using the maximum length.

Modelling¶

Determine Line Lengths and Identify the Maximum Length¶

Since lines vary in length, it is necessary to pad our sequences using the maximum length.

In [ ]:

print('--'*40); print('Get the length of each line, find the maximum length and print the maximum length line');
print('Length of line ranges from 3 to 247.'); print('--'*40)
# Get length of each line
DF['line_length'] = DF['headline'].str.len()

print('Minimum line length: {}'.format(DF['line_length'].min()))
print('Maximum line length: {}'.format(DF['line_length'].max()))
print('Line with maximum length: {}'.format(DF[DF['line_length'] == DF['line_length'].max()]['headline'].values[0]))

--------------------------------------------------------------------------------
Get the length of each line, find the maximum length and print the maximum length line
Length of line ranges from 3 to 247.
--------------------------------------------------------------------------------
Minimum line length: 0
Maximum line length: 227
Line with maximum length: maya angelou poet author civil right activist holy cow tony award nominated actress college professor magazine editor streetcar conductor really streetcar conductor wow calypso singer nightclub performer foreign journalist dead

In [ ]:

print('--'*40); print('Get the number of words, find the maximum number of words and print the maximum number of words');
print('Number of words ranges from 1 to 27.'); print('--'*40)
# Get length of each line
DF['nb_words'] = DF['headline'].apply(lambda x: len(x.split(' ')))

print('Minimum number of words: {}'.format(DF['nb_words'].min()))
print('Maximum number of words: {}'.format(DF['nb_words'].max()))
print('Line with maximum number of words: {}'.format(DF[DF['nb_words'] == DF['nb_words'].max()]['headline'].values[0]))

--------------------------------------------------------------------------------
Get the number of words, find the maximum number of words and print the maximum number of words
Number of words ranges from 1 to 27.
--------------------------------------------------------------------------------
Minimum number of words: 1
Maximum number of words: 30
Line with maximum number of words: maya angelou poet author civil right activist holy cow tony award nominated actress college professor magazine editor streetcar conductor really streetcar conductor wow calypso singer nightclub performer foreign journalist dead

In [ ]:

print('--'*30); print('Five point summary for number of words')
display(DF['nb_words'].describe().round(0).astype(int));

print('99% quantilie: {}'.format(DF['nb_words'].quantile(0.99)));print('--'*30)

------------------------------------------------------------
Five point summary for number of words

count    26709
mean         7
std          2
min          1
25%          5
50%          7
75%          9
max         30
Name: nb_words, dtype: int64

99% quantilie: 13.0
------------------------------------------------------------

Set Different Parameters for the model¶

In [ ]:

max_features = 10000
maxlen = DF['nb_words'].max()
embedding_size = 200
trunc_type='post'
padding_type='post'

Preprocessing the dataset¶

Applying Keras Tokenizer¶

In [ ]:

tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(DF['headline']))

Define X and y for your model.

In [ ]:

X = tokenizer.texts_to_sequences(DF['headline'])
X = pad_sequences(X, maxlen = maxlen, padding=padding_type, truncating=trunc_type)
y = np.asarray(DF['is_sarcastic'])

print(f'Number of Samples: {len(X)}')
print(f'Number of Labels: {len(y)}')
print(f'\nFirst headline:\n{X[0]}\n\nLabel of the first headline: {y[0]}')

Number of Samples: 26709
Number of Labels: 26709

First headline:
[ 240  492 2699 1174  207   44 1766 1942 3527    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]

Label of the first headline: 0

Get the Vocabulary size¶

In [ ]:

# Reserve padding (indexed zero)
w2i = tokenizer.word_index
vocab_size = len(w2i) + 1
print(f'Number of unique tokens: {vocab_size}')

Number of unique tokens: 21719

In [ ]:

glove_file = '/content/drive/MyDrive/GL_files/GL_Dataset/archive.zip'

In [ ]:

#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
  z.extractall()

In [ ]:

EMBEDDING_FILE = './glove.6B.200d.txt'

embeddings = {}
for o in open(EMBEDDING_FILE):
    word = o.split(' ')[0]
    embd = o.split(' ')[1:]
    embd = np.asarray(embd, dtype = 'float32')
    embeddings[word] = embd

In [ ]:

# Getting the minimum number of words
num_words = min(max_features, vocab_size) + 1

embedding_matrix = np.zeros((num_words, embedding_size))
for word, i in tokenizer.word_index.items():
    if i > max_features: continue
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

len(embeddings.values())

Out[ ]:

In [ ]:

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, shuffle = True)

print('---'*20, f'\nNumber of rows in training dataset: {x_train.shape[0]}')
print(f'Number of columns in training dataset: {x_train.shape[1]}')
print(f'Number of unique words in training dataset: {len(np.unique(np.hstack(x_train)))}')

print('---'*20, f'\nNumber of rows in test dataset: {x_test.shape[0]}')
print(f'Number of columns in test dataset: {x_test.shape[1]}')
print(f'Number of unique words in test dataset: {len(np.unique(np.hstack(x_test)))}')

------------------------------------------------------------ 
Number of rows in training dataset: 21367
Number of columns in training dataset: 30
Number of unique words in training dataset: 9950
------------------------------------------------------------ 
Number of rows in test dataset: 5342
Number of columns in test dataset: 30
Number of unique words in test dataset: 7297

In [ ]:

model = Sequential()
model.add(Embedding(num_words, embedding_size, embeddings_initializer = Constant(embedding_matrix), input_length = maxlen, trainable = False))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.5, input_shape = (256,)))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.5, input_shape = (128,)))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5, input_shape = (64,)))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 10)
mc = ModelCheckpoint('sarcasm_detector.h5', monitor = 'val_loss', mode = 'min', save_best_only = True, verbose = 1)
lr_r = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.1, patience = 5),
logdir = 'log'; tb = TensorBoard(logdir, histogram_freq = 1)
callbacks = [es, mc, lr_r, tb]

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, 30, 200)           2000200   
                                                                 
 bidirectional_1 (Bidirecti  (None, 30, 256)           336896    
 onal)                                                           
                                                                 
 global_max_pooling1d_1 (Gl  (None, 256)               0         
 obalMaxPooling1D)                                               
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                                 
 dense_3 (Dense)             (None, 128)               32896     
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 64)                8256      
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 2378313 (9.07 MB)
Trainable params: 378113 (1.44 MB)
Non-trainable params: 2000200 (7.63 MB)
_________________________________________________________________
None

In [ ]:

tf.keras.utils.plot_model(model, show_shapes = True)

Out[ ]:

In [ ]:

batch_size = 100
epochs = 6

h = model.fit(x_train, y_train, epochs = epochs, validation_split = 0.2, batch_size = batch_size, verbose = 2, callbacks = callbacks)

Epoch 1/6

Epoch 1: val_loss improved from inf to 0.52759, saving model to sarcasm_detector.h5

/usr/local/lib/python3.10/dist-packages/keras/src/engine/training.py:3079: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.
  saving_api.save_model(

171/171 - 44s - loss: 0.6198 - accuracy: 0.6481 - val_loss: 0.5276 - val_accuracy: 0.7387 - lr: 0.0010 - 44s/epoch - 256ms/step
Epoch 2/6

Epoch 2: val_loss improved from 0.52759 to 0.47907, saving model to sarcasm_detector.h5
171/171 - 38s - loss: 0.5212 - accuracy: 0.7464 - val_loss: 0.4791 - val_accuracy: 0.7637 - lr: 0.0010 - 38s/epoch - 223ms/step
Epoch 3/6

Epoch 3: val_loss improved from 0.47907 to 0.47037, saving model to sarcasm_detector.h5
171/171 - 37s - loss: 0.4697 - accuracy: 0.7775 - val_loss: 0.4704 - val_accuracy: 0.7693 - lr: 0.0010 - 37s/epoch - 214ms/step
Epoch 4/6

Epoch 4: val_loss improved from 0.47037 to 0.44223, saving model to sarcasm_detector.h5
171/171 - 38s - loss: 0.4236 - accuracy: 0.8079 - val_loss: 0.4422 - val_accuracy: 0.7894 - lr: 0.0010 - 38s/epoch - 225ms/step
Epoch 5/6

Epoch 5: val_loss did not improve from 0.44223
171/171 - 31s - loss: 0.3851 - accuracy: 0.8265 - val_loss: 0.4454 - val_accuracy: 0.7857 - lr: 0.0010 - 31s/epoch - 183ms/step
Epoch 6/6

Epoch 6: val_loss improved from 0.44223 to 0.43770, saving model to sarcasm_detector.h5
171/171 - 39s - loss: 0.3523 - accuracy: 0.8437 - val_loss: 0.4377 - val_accuracy: 0.7974 - lr: 0.0010 - 39s/epoch - 230ms/step

In [ ]:

%load_ext tensorboard
%tensorboard --logdir log/

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard

Reusing TensorBoard on port 6006 (pid 14360), started 0:22:20 ago. (Use '!kill 14360' to kill it.)

In [ ]:

f, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 7.2))
f.suptitle('Monitoring the performance of the model')

ax1.plot(h.history['loss'], label = 'Train')
ax1.plot(h.history['val_loss'], label = 'Test')
ax1.set_title('Model Loss')
ax1.legend(['Train', 'Test'])

ax2.plot(h.history['accuracy'], label = 'Train')
ax2.plot(h.history['val_accuracy'], label = 'Test')
ax2.set_title('Model Accuracy')
ax2.legend(['Train', 'Test'])

plt.show()

In [ ]:

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, verbose = 0)
print('Overall Accuracy: {}'.format(round(accuracy * 100, 0)))

Overall Accuracy: 81.0

In [ ]:

y_pred = (model.predict(x_test) > 0.5).astype('int32')
print(f'Classification Report:\n{classification_report(y_pred, y_test)}')

167/167 [==============================] - 7s 37ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.83      0.83      2991
           1       0.78      0.78      0.78      2351

    accuracy                           0.81      5342
   macro avg       0.80      0.80      0.80      5342
weighted avg       0.81      0.81      0.81      5342

In [ ]:

print('--'*30); print('Confusion Matrix')
cm = confusion_matrix(y_test, y_pred)
cm = pd.DataFrame(cm , index = ['Non-sarcastic', 'Sarcastic'] , columns = ['Non-sarcastic','Sarcastic'])
display(cm); print('--'*30)

plt.figure(figsize = (8, 5))
_ = sns.heatmap(cm, cmap= 'Blues', linecolor = 'black' , linewidth = 1 , annot = True,
            fmt = '' , xticklabels = ['Non-sarcastic', 'Sarcastic'],
            yticklabels = ['Non-sarcastic', 'Sarcastic']).set_title('Confusion Matrix')

------------------------------------------------------------
Confusion Matrix

	Non-sarcastic	Sarcastic
Non-sarcastic	2480	516
Sarcastic	511	1835

------------------------------------------------------------

In [ ]:

print('Evaluate model on sample sarcastic lines'); print('--'*30)
statements = ['Are you always so stupid or is today a special ocassion?', #Sarcasm
              'I feel so miserable without you, it\'s almost like having you here.', #Sarcasm
              'If you find me offensive. Then I suggest you quit finding me.', #Sarcasm
              'If I wanted to kill myself I would climb your ego and jump to your IQ.', #Sarcasm
              'Amphibious pitcher makes debut', #Sarcasm
              'It\'s okay if you don\'t like me. Not everyone has good taste.' #Sarcasm
              ]
# Process the statements
for statement in statements:
  statement = denoise_text(statement)

  headline = tokenizer.texts_to_sequences(statement)
  headline = pad_sequences(headline, maxlen = maxlen, dtype = 'int32', value = 0)

  sentiment = (model.predict(headline) > 0.5).astype('int32')
  if(np.argmax(sentiment) == 0):
      print(f'`{statement}` is a Non-sarcastic statement.')
  elif (np.argmax(sentiment) == 1):
      print(f'`{statement}` is a Sarcastic statement.')

Evaluate model on sample sarcastic lines
------------------------------------------------------------
2/2 [==============================] - 0s 21ms/step
`always stupid today special ocassion?` is a Sarcastic statement.
2/2 [==============================] - 0s 46ms/step
`feel miserable without you, almost like here.` is a Sarcastic statement.
2/2 [==============================] - 0s 33ms/step
2/2 [==============================] - 0s 19ms/step
`wanted kill would climb ego jump IQ.` is a Non-sarcastic statement.
1/1 [==============================] - 0s 141ms/step
2/2 [==============================] - 0s 23ms/step

The model achieved an accuracy of 81.0%.