Electronics and Telecommunication¶
- Jayabharathi Hari(https://www.jayabharathi-hari.com/)¶
DataSet:
Dataset : Singal Processing Dataset .• CONTEXT: A communications equipment manufacturing company has a product which is responsible for emitting informative signals. Company wants to build a machine learning model which can help the company to predict the equipment’s signal quality using various parameters.
• DATA DESCRIPTION: The data set contains information on various signal tests performed:
- Parameters: Various measurable signal parameters.
- Signal_Quality: Final signal strength or quality
• PROJECT OBJECTIVE: To build a classifier which can use the given parameters to determine the signal strength or quality.
1. Exploratory data analysis
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from tensorflow import keras
%matplotlib inline
import tensorflow as tf
tf.__version__
'2.13.0'
import h5py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix
from keras import backend as k
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense,Dropout,ReLU,Reshape,BatchNormalization
from tensorflow.keras import regularizers, optimizers
from sklearn.metrics import r2_score
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import ModelCheckpoint
import math
# intialize random no: generator
import random
random.seed(7)
import warnings
warnings.filterwarnings("ignore")
df= pd.read_csv('/content/drive/MyDrive/GL_files/GL_Dataset/Signal.csv')
df.sample(4)
print("Shape:Columns&Rows",df.shape)
print("\nSize:",df.size)
print("\nUnique Values in Signal Strength:",df['Signal_Strength'].unique())
Shape:Columns&Rows (1599, 12) Size: 19188 Unique Values in Signal Strength: [5 6 7 4 8 3]
#Since it has only 6 classes converting the values from 0 to 5 inplace of 3 to 8
to_replace = {3:0,4:1,5:2,6:3,7:4,8:5}
df=df.replace({'Signal_Strength': to_replace})
df['Signal_Strength'].unique()
array([2, 3, 4, 1, 5, 0])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Parameter 1 1599 non-null float64 1 Parameter 2 1599 non-null float64 2 Parameter 3 1599 non-null float64 3 Parameter 4 1599 non-null float64 4 Parameter 5 1599 non-null float64 5 Parameter 6 1599 non-null float64 6 Parameter 7 1599 non-null float64 7 Parameter 8 1599 non-null float64 8 Parameter 9 1599 non-null float64 9 Parameter 10 1599 non-null float64 10 Parameter 11 1599 non-null float64 11 Signal_Strength 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Parameter 1 | 1599.0 | 8.319637 | 1.741096 | 4.60000 | 7.1000 | 7.90000 | 9.200000 | 15.90000 |
Parameter 2 | 1599.0 | 0.527821 | 0.179060 | 0.12000 | 0.3900 | 0.52000 | 0.640000 | 1.58000 |
Parameter 3 | 1599.0 | 0.270976 | 0.194801 | 0.00000 | 0.0900 | 0.26000 | 0.420000 | 1.00000 |
Parameter 4 | 1599.0 | 2.538806 | 1.409928 | 0.90000 | 1.9000 | 2.20000 | 2.600000 | 15.50000 |
Parameter 5 | 1599.0 | 0.087467 | 0.047065 | 0.01200 | 0.0700 | 0.07900 | 0.090000 | 0.61100 |
Parameter 6 | 1599.0 | 15.874922 | 10.460157 | 1.00000 | 7.0000 | 14.00000 | 21.000000 | 72.00000 |
Parameter 7 | 1599.0 | 46.467792 | 32.895324 | 6.00000 | 22.0000 | 38.00000 | 62.000000 | 289.00000 |
Parameter 8 | 1599.0 | 0.996747 | 0.001887 | 0.99007 | 0.9956 | 0.99675 | 0.997835 | 1.00369 |
Parameter 9 | 1599.0 | 3.311113 | 0.154386 | 2.74000 | 3.2100 | 3.31000 | 3.400000 | 4.01000 |
Parameter 10 | 1599.0 | 0.658149 | 0.169507 | 0.33000 | 0.5500 | 0.62000 | 0.730000 | 2.00000 |
Parameter 11 | 1599.0 | 10.422983 | 1.065668 | 8.40000 | 9.5000 | 10.20000 | 11.100000 | 14.90000 |
Signal_Strength | 1599.0 | 2.636023 | 0.807569 | 0.00000 | 2.0000 | 3.00000 | 3.000000 | 5.00000 |
df['Signal_Strength'].value_counts().sort_values()
0 10 5 18 1 53 4 199 3 638 2 681 Name: Signal_Strength, dtype: int64
duplicates = df.duplicated()
duplicates
0 False 1 False 2 False 3 False 4 True ... 1594 False 1595 False 1596 True 1597 False 1598 False Length: 1599, dtype: bool
num_duplicates =duplicates.sum()
print(f"No: of duplicate : {num_duplicates}")
No: of duplicate : 240
df.drop_duplicates(inplace=True)
duplicates=df.duplicated()
duplicates
0 False 1 False 2 False 3 False 5 False ... 1593 False 1594 False 1595 False 1597 False 1598 False Length: 1359, dtype: bool
num_duplicates_postimputation =duplicates.sum()
print(f"No: of duplicate : {num_duplicates_postimputation}") #We have removed the duplicates
No: of duplicate : 0
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Parameter 1 | 1359.0 | 8.310596 | 1.736990 | 4.60000 | 7.1000 | 7.9000 | 9.20000 | 15.90000 |
Parameter 2 | 1359.0 | 0.529478 | 0.183031 | 0.12000 | 0.3900 | 0.5200 | 0.64000 | 1.58000 |
Parameter 3 | 1359.0 | 0.272333 | 0.195537 | 0.00000 | 0.0900 | 0.2600 | 0.43000 | 1.00000 |
Parameter 4 | 1359.0 | 2.523400 | 1.352314 | 0.90000 | 1.9000 | 2.2000 | 2.60000 | 15.50000 |
Parameter 5 | 1359.0 | 0.088124 | 0.049377 | 0.01200 | 0.0700 | 0.0790 | 0.09100 | 0.61100 |
Parameter 6 | 1359.0 | 15.893304 | 10.447270 | 1.00000 | 7.0000 | 14.0000 | 21.00000 | 72.00000 |
Parameter 7 | 1359.0 | 46.825975 | 33.408946 | 6.00000 | 22.0000 | 38.0000 | 63.00000 | 289.00000 |
Parameter 8 | 1359.0 | 0.996709 | 0.001869 | 0.99007 | 0.9956 | 0.9967 | 0.99782 | 1.00369 |
Parameter 9 | 1359.0 | 3.309787 | 0.155036 | 2.74000 | 3.2100 | 3.3100 | 3.40000 | 4.01000 |
Parameter 10 | 1359.0 | 0.658705 | 0.170667 | 0.33000 | 0.5500 | 0.6200 | 0.73000 | 2.00000 |
Parameter 11 | 1359.0 | 10.432315 | 1.082065 | 8.40000 | 9.5000 | 10.2000 | 11.10000 | 14.90000 |
Signal_Strength | 1359.0 | 2.623252 | 0.823578 | 0.00000 | 2.0000 | 3.0000 | 3.00000 | 5.00000 |
Correlation=df.corr()
Correlation
Parameter 1 | Parameter 2 | Parameter 3 | Parameter 4 | Parameter 5 | Parameter 6 | Parameter 7 | Parameter 8 | Parameter 9 | Parameter 10 | Parameter 11 | Signal_Strength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Parameter 1 | 1.000000 | -0.255124 | 0.667437 | 0.111025 | 0.085886 | -0.140580 | -0.103777 | 0.670195 | -0.686685 | 0.190269 | -0.061596 | 0.119024 |
Parameter 2 | -0.255124 | 1.000000 | -0.551248 | -0.002449 | 0.055154 | -0.020945 | 0.071701 | 0.023943 | 0.247111 | -0.256948 | -0.197812 | -0.395214 |
Parameter 3 | 0.667437 | -0.551248 | 1.000000 | 0.143892 | 0.210195 | -0.048004 | 0.047358 | 0.357962 | -0.550310 | 0.326062 | 0.105108 | 0.228057 |
Parameter 4 | 0.111025 | -0.002449 | 0.143892 | 1.000000 | 0.026656 | 0.160527 | 0.201038 | 0.324522 | -0.083143 | -0.011837 | 0.063281 | 0.013640 |
Parameter 5 | 0.085886 | 0.055154 | 0.210195 | 0.026656 | 1.000000 | 0.000749 | 0.045773 | 0.193592 | -0.270893 | 0.394557 | -0.223824 | -0.130988 |
Parameter 6 | -0.140580 | -0.020945 | -0.048004 | 0.160527 | 0.000749 | 1.000000 | 0.667246 | -0.018071 | 0.056631 | 0.054126 | -0.080125 | -0.050463 |
Parameter 7 | -0.103777 | 0.071701 | 0.047358 | 0.201038 | 0.045773 | 0.667246 | 1.000000 | 0.078141 | -0.079257 | 0.035291 | -0.217829 | -0.177855 |
Parameter 8 | 0.670195 | 0.023943 | 0.357962 | 0.324522 | 0.193592 | -0.018071 | 0.078141 | 1.000000 | -0.355617 | 0.146036 | -0.504995 | -0.184252 |
Parameter 9 | -0.686685 | 0.247111 | -0.550310 | -0.083143 | -0.270893 | 0.056631 | -0.079257 | -0.355617 | 1.000000 | -0.214134 | 0.213418 | -0.055245 |
Parameter 10 | 0.190269 | -0.256948 | 0.326062 | -0.011837 | 0.394557 | 0.054126 | 0.035291 | 0.146036 | -0.214134 | 1.000000 | 0.091621 | 0.248835 |
Parameter 11 | -0.061596 | -0.197812 | 0.105108 | 0.063281 | -0.223824 | -0.080125 | -0.217829 | -0.504995 | 0.213418 | 0.091621 | 1.000000 | 0.480343 |
Signal_Strength | 0.119024 | -0.395214 | 0.228057 | 0.013640 | -0.130988 | -0.050463 | -0.177855 | -0.184252 | -0.055245 | 0.248835 | 0.480343 | 1.000000 |
import seaborn as sns
import matplotlib.pyplot as plt
plt.subplots(figsize=(20,8))
sns.heatmap(Correlation)
plt.show()
sns.pairplot(df,diag_kind ='kde')
plt.show()
#Checking on the corelation
df.plot(kind='box')
plt.show()
##Performing univariate, bivariate and multivariate analysis
# Feature Importance
# Independent variables
X=df.drop('Signal_Strength',axis=1)
# Target variable
Y=df['Signal_Strength']
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,Y)
#using inbuilt class "feature_importances" of tree based classifiers
print(model.feature_importances_)
#plotting graph of feature importances
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
#Observation: Most Effective - Parameter 11
[0.07673646 0.0960267 0.08050729 0.07999156 0.07986296 0.07372861 0.10095128 0.084817 0.07564383 0.09979721 0.15193711]
out_df = df.drop('Signal_Strength',axis=1)
out_df.shape
(1359, 11)
# function to treat outliers
def detect_treat_outliers(df,operation):
cols=[]
IQR_list=[]
lower_boundary_list=[]
upper_boundary_list=[]
outliers_count=[]
for col in df.columns:
print('col',col)
if((df[col].dtype =='int64' or df[col].dtype =='float64') and (col != 'HR')):
#print('Inside if')
IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
lower_boundary = df[col].quantile(0.25) - (1.5 * IQR)
upper_boundary = df[col].quantile(0.75) + (1.5 * IQR)
up_cnt = df[df[col]>upper_boundary][col].shape[0]
#print('Upper count=',up_cnt)
lw_cnt = df[df[col]<lower_boundary][col].shape[0]
#print('lower count=',lw_cnt)
if(up_cnt+lw_cnt) > 0:
cols.append(col)
IQR_list.append(IQR)
lower_boundary_list.append(lower_boundary)
upper_boundary_list.append(upper_boundary)
outliers_count.append(up_cnt+lw_cnt)
if operation == 'update':
df.loc[df[col] > upper_boundary,col] = upper_boundary
df.loc[df[col] < lower_boundary,col] = lower_boundary
else:
pass
else:
pass
newdf = pd.DataFrame(list(zip(cols,IQR_list,lower_boundary_list,upper_boundary_list,outliers_count)),columns=['Features','IQR','Lower Boundary','Upper Boundary','Outlier Count'])
#print('Df=',newdf)
#print('Columns having outliers=',cols)
if operation == 'update':
return (len(cols),df)
else:
return (len(cols),newdf)
#Removing outliers by replacing the data below lower and above upper whisker
count,new_out_df=detect_treat_outliers(out_df,'update')
if count>0:
print('Updated dataset')
newdf=df
col Parameter 1 col Parameter 2 col Parameter 3 col Parameter 4 col Parameter 5 col Parameter 6 col Parameter 7 col Parameter 8 col Parameter 9 col Parameter 10 col Parameter 11 Updated dataset
new_out_df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Parameter 1 | 1359.0 | 8.284069 | 1.658319 | 4.60000 | 7.1000 | 7.9000 | 9.20000 | 12.35000 |
Parameter 2 | 1359.0 | 0.527840 | 0.177262 | 0.12000 | 0.3900 | 0.5200 | 0.64000 | 1.01500 |
Parameter 3 | 1359.0 | 0.272288 | 0.195379 | 0.00000 | 0.0900 | 0.2600 | 0.43000 | 0.94000 |
Parameter 4 | 1359.0 | 2.324099 | 0.607558 | 0.90000 | 1.9000 | 2.2000 | 2.60000 | 3.65000 |
Parameter 5 | 1359.0 | 0.081323 | 0.018486 | 0.03850 | 0.0700 | 0.0790 | 0.09100 | 0.12250 |
Parameter 6 | 1359.0 | 15.714496 | 9.852641 | 1.00000 | 7.0000 | 14.0000 | 21.00000 | 42.00000 |
Parameter 7 | 1359.0 | 46.092715 | 30.877994 | 6.00000 | 22.0000 | 38.0000 | 63.00000 | 124.50000 |
Parameter 8 | 1359.0 | 0.996707 | 0.001798 | 0.99227 | 0.9956 | 0.9967 | 0.99782 | 1.00115 |
Parameter 9 | 1359.0 | 3.308889 | 0.149982 | 2.92500 | 3.2100 | 3.3100 | 3.40000 | 3.68500 |
Parameter 10 | 1359.0 | 0.649963 | 0.137403 | 0.33000 | 0.5500 | 0.6200 | 0.73000 | 1.00000 |
Parameter 11 | 1359.0 | 10.428734 | 1.070647 | 8.40000 | 9.5000 | 10.2000 | 11.10000 | 13.50000 |
- Df has 1599 rows and 12 columns
- All parameters are floating point and
signal strength is an integer
3. STD higher for parmeter 7 (32.8953)and
lower for parameter
4. There are no null values in the data
5. Duplicates were there and removed
6. there is high correlation between Parameter 6 and 7 and have no correlation with other parameters.
7. Parameter 3 ranges between 0 and 1.
8. Maximum value of Parameter 5 is 0.6
9. Removed the Outliers.
10. Parameter 1 is positively correlated to Parameter 3 and Parameter 8 and negatively correlated to Parameter 2 and Parameter 9. Parameter 4 has very low correlation with other Parameters.
2. Data preprocessing¶
#Split the data into X & Y.
X = new_out_df
y = df['Signal_Strength']
#Split the data into train & test with 70:30 proportion.[
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=4)
#shape of all the 4 variables
print('Shape of X_train', X_train.shape)
print('Shape of X_test', X_test.shape)
print('Shape of y_test', y_test.shape)
print('Shape of y_train', y_train.shape)
Shape of X_train (951, 11) Shape of X_test (408, 11) Shape of y_test (408,) Shape of y_train (951,)
#Verifying if train and test are in sync
print('Are test and train data in sync?',(X_train.index==y_train.index).all() and (X_test.index== y_test.index).all())
Are test and train data in sync? True
##Normalise the train test
#Scaling training data
X_train = StandardScaler().fit_transform(X_train)
# Scaling testing data
X_test = StandardScaler().fit_transform(X_test)
#Transform Labels into format acceptable by Neural Network / Converting y data into categorical (one-hot encoding)
trainY = tf.keras.utils.to_categorical(y_train,num_classes=6)
testY = tf.keras.utils.to_categorical(y_test,num_classes=6)
Model Training & Evaluation using Neural Network¶
num_features =11
num_classes = 6
#model= tf.keras.models.Sequential()
#creates a sequential model
model = Sequential()
model.add(Dense(11, activation='relu',input_shape = (11,)))
model.add(Dense(11, activation='relu'))
model.add(Dense(6, activation='softmax'))
model.summary()
Model: "sequential_8" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_31 (Dense) (None, 11) 132 dense_32 (Dense) (None, 11) 132 dense_33 (Dense) (None, 6) 72 ================================================================= Total params: 336 (1.31 KB) Trainable params: 336 (1.31 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
#model.compile(optimizer= 'sgd', loss= 'mse', metrics=['accuracy'])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, trainY,validation_split=.3, epochs=20, batch_size=25)
Epoch 1/20 27/27 [==============================] - 1s 16ms/step - loss: 1.7453 - accuracy: 0.2526 - val_loss: 1.6909 - val_accuracy: 0.2797 Epoch 2/20 27/27 [==============================] - 0s 5ms/step - loss: 1.6151 - accuracy: 0.3098 - val_loss: 1.5860 - val_accuracy: 0.3497 Epoch 3/20 27/27 [==============================] - 0s 5ms/step - loss: 1.5063 - accuracy: 0.3504 - val_loss: 1.4896 - val_accuracy: 0.4231 Epoch 4/20 27/27 [==============================] - 0s 5ms/step - loss: 1.4065 - accuracy: 0.4241 - val_loss: 1.4000 - val_accuracy: 0.4580 Epoch 5/20 27/27 [==============================] - 0s 5ms/step - loss: 1.3157 - accuracy: 0.4692 - val_loss: 1.3231 - val_accuracy: 0.4965 Epoch 6/20 27/27 [==============================] - 0s 6ms/step - loss: 1.2393 - accuracy: 0.5008 - val_loss: 1.2642 - val_accuracy: 0.5245 Epoch 7/20 27/27 [==============================] - 0s 5ms/step - loss: 1.1788 - accuracy: 0.5233 - val_loss: 1.2141 - val_accuracy: 0.5315 Epoch 8/20 27/27 [==============================] - 0s 6ms/step - loss: 1.1315 - accuracy: 0.5383 - val_loss: 1.1742 - val_accuracy: 0.5420 Epoch 9/20 27/27 [==============================] - 0s 6ms/step - loss: 1.0941 - accuracy: 0.5398 - val_loss: 1.1482 - val_accuracy: 0.5385 Epoch 10/20 27/27 [==============================] - 0s 6ms/step - loss: 1.0653 - accuracy: 0.5519 - val_loss: 1.1307 - val_accuracy: 0.5420 Epoch 11/20 27/27 [==============================] - 0s 7ms/step - loss: 1.0434 - accuracy: 0.5714 - val_loss: 1.1177 - val_accuracy: 0.5455 Epoch 12/20 27/27 [==============================] - 0s 5ms/step - loss: 1.0264 - accuracy: 0.5744 - val_loss: 1.1101 - val_accuracy: 0.5420 Epoch 13/20 27/27 [==============================] - 0s 5ms/step - loss: 1.0122 - accuracy: 0.5789 - val_loss: 1.1043 - val_accuracy: 0.5420 Epoch 14/20 27/27 [==============================] - 0s 5ms/step - loss: 1.0012 - accuracy: 0.5759 - val_loss: 1.1009 - val_accuracy: 0.5350 Epoch 15/20 27/27 [==============================] - 0s 7ms/step - loss: 0.9918 - accuracy: 0.5955 - val_loss: 1.0984 - val_accuracy: 0.5350 Epoch 16/20 27/27 [==============================] - 0s 7ms/step - loss: 0.9829 - accuracy: 0.5805 - val_loss: 1.0954 - val_accuracy: 0.5350 Epoch 17/20 27/27 [==============================] - 0s 5ms/step - loss: 0.9749 - accuracy: 0.5820 - val_loss: 1.0936 - val_accuracy: 0.5420 Epoch 18/20 27/27 [==============================] - 0s 5ms/step - loss: 0.9687 - accuracy: 0.5925 - val_loss: 1.0904 - val_accuracy: 0.5420 Epoch 19/20 27/27 [==============================] - 0s 5ms/step - loss: 0.9635 - accuracy: 0.5880 - val_loss: 1.0882 - val_accuracy: 0.5455 Epoch 20/20 27/27 [==============================] - 0s 5ms/step - loss: 0.9567 - accuracy: 0.5925 - val_loss: 1.0861 - val_accuracy: 0.5385
#The model is slightly overfitting since the training accuracy is higher than the validation accuracy
# Evaluate the model
loss, accuracy = model.evaluate(X_test, testY)
print("Test loss:", loss)
print("Test accuracy:", accuracy)
13/13 [==============================] - 0s 2ms/step - loss: 1.0177 - accuracy: 0.5784 Test loss: 1.0176830291748047 Test accuracy: 0.5784313678741455
loss_train = Network_Classifier.history['loss']
loss_val = Network_Classifier.history['val_loss']
epochs = range(1,EPOCH+1)
plt.plot(epochs, loss_train, 'g', label='Training loss')
plt.plot(epochs, loss_val, 'b', label='validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
Acc_train = Network_Classifier.history['accuracy']
Acc_val = Network_Classifier.history['val_accuracy']
epochs = range(1,EPOCH+1)
plt.plot(epochs, Acc_train, 'g', label='Training accuracy')
plt.plot(epochs, Acc_val, 'b', label='validation accuracy')
plt.title('Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('accuracy')
plt.legend()
plt.show()