Determining Trade Union Status

Determining Trade Union Status

This Project deals with implementation of different models and doing preprocessing with the data in order to compare the results and performance of different models. Weapplied statistical techniques to see which model is performing best. In this project we will create a binary classifier which will predict that either the data scientist will remaina USDU member or not.

Reading data for preprocessing

TRAIN.csv
genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdFeatureAConnectivityFeatureCFeatureDFeatureEFeatureFFeatureGFeatureBDuesFrequencyPaperlessBillingPaymentMethodMonthlyDuesTotalDues
0Male0NoNo26YesNoFiber opticNoYesNoNoYesNoMonth-to-monthYesElectronic check862147
1Female0YesNo34YesYesFiber opticNoNoYesNoYesYesMonth-to-monthYesElectronic check1003415
2Male0NoNo1YesNoFiber opticNoNoNoNoNoNoMonth-to-monthYesElectronic check7070
3Male1YesNo1YesNoFiber opticNoNoNoNoNoNoMonth-to-monthNoMailed check7171
4Male1YesNo62YesYesFiber opticNoYesYesNoYesYesOne yearYesElectronic check1046383
Test_Data
genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdFeatureAConnectivityFeatureCFeatureDFeatureEFeatureFFeatureGFeatureBDuesFrequencyPaperlessBillingPaymentMethodMonthlyDuesTotalDues
DS_ID
10000Male0YesNo1NoMaryvilleDSLNoYesNoNoNoNoMonth-to-monthYesElectronic check3030
10001Female0NoNo34YesNoDSLYesNoYesNoNoNoOne yearNoMailed check571890
10002Female0NoNo2YesNoDSLYesYesNoNoNoNoMonth-to-monthYesMailed check54108
10003Female0NoNo45NoMaryvilleDSLYesNoYesYesNoNoOne yearNoBank transfer (automatic)421841
10004Male0NoNo2YesNoFiber opticNoNoNoNoNoNoMonth-to-monthYesElectronic check71152
TRAIN.csv “LeftUnion”
LeftUnion
0No
1No
2No
3Yes
4Yes

Train and Test Split

Doing Train and Test Split between data. It involves importing a function from scikit learn librarywhich can perform this task very easily. Now doing Train and Test Split between data. So that we will apply all the preprocessing on train data but not test data. Otherwise our model will get prone to data leakage and it will perform worse in production when newdata arrives.

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdFeatureAConnectivityFeatureCFeatureDFeatureEFeatureFFeatureGFeatureBDuesFrequencyPaperlessBillingPaymentMethodMonthlyDuesTotalDues
0Male0YesYes2YesNoFiber opticNoNoNoNoNoNoMonth-to-monthNoMailed check70144
1Female0NoNo16YesNoDSLNoYesNoYesNoNoMonth-to-monthYesMailed check54834
2Male1NoNo7YesYesFiber opticNoNoNoNoNoNoMonth-to-monthNoMailed check74545
3Male0NoNo26YesNoFiber opticNoYesNoNoYesNoMonth-to-monthYesElectronic check862147
4Male0NoNo2YesNoDSLNoNoNoNoNoNoMonth-to-monthNoMailed check4575
y_train
LeftUnion
0No
1Yes
2Yes
3No
4Yes

Merge data By label

X_train and y_train
genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdFeatureAConnectivityFeatureCFeatureDFeatureEFeatureFFeatureGFeatureBDuesFrequencyPaperlessBillingPaymentMethodMonthlyDuesTotalDuesLeftUnion
0Male0YesYes2YesNoFiber opticNoNoNoNoNoNoMonth-to-monthNoMailed check70144No
1Female0NoNo16YesNoDSLNoYesNoYesNoNoMonth-to-monthYesMailed check54834Yes
2Male1NoNo7YesYesFiber opticNoNoNoNoNoNoMonth-to-monthNoMailed check74545Yes
3Male0NoNo26YesNoFiber opticNoYesNoNoYesNoMonth-to-monthYesElectronic check862147No
4Male0NoNo2YesNoDSLNoNoNoNoNoNoMonth-to-monthNoMailed check4575Yes
X_test And y_test
genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdFeatureAConnectivityFeatureCFeatureDFeatureEFeatureFFeatureGFeatureBDuesFrequencyPaperlessBillingPaymentMethodMonthlyDuesTotalDuesLeftUnion
0Male0YesNo53YesNoDial-inMaryvilleMaryvilleMaryvilleMaryvilleMaryvilleMaryvilleTwo yearNoCredit card (automatic)201110No
1Male1NoNo52NoMaryvilleDSLNoYesNoNoYesYesMonth-to-monthYesElectronic check492551Yes
2Female1NoNo1YesNoFiber opticNoNoNoNoYesNoMonth-to-monthNoElectronic check7878Yes
3Male1NoNo56YesYesFiber opticNoNoYesNoYesYesMonth-to-monthYesElectronic check1015594No
4Female0NoNo3YesNoDSLNoNoNoNoYesNoMonth-to-monthYesElectronic check54140No

NaN values Check

Checking For Nan values in the dataset column wise. Because we have to remove the nan values before fitting out the ML model on data.For that purpose we are creating a function named check_nan() in which we are passing a dataframe as an argument. It gives us output telling the no of NaN values.

gender              0
Management          0
USAcitizen          0
Married             0
MonthsInUnion       0
ContinuingEd        0
FeatureA            0
Connectivity        0
FeatureC            0
FeatureD            0
FeatureE            0
FeatureF            0
FeatureG            0
FeatureB            0
DuesFrequency       0
PaperlessBilling    0
PaymentMethod       0
MonthlyDues         0
TotalDues           0
LeftUnion           0
dtype: int64

gender              0
Management          0
USAcitizen          0
Married             0
MonthsInUnion       0
ContinuingEd        0
FeatureA            0
Connectivity        0
FeatureC            0
FeatureD            0
FeatureE            0
FeatureF            0
FeatureG            0
FeatureB            0
DuesFrequency       0
PaperlessBilling    0
PaymentMethod       0
MonthlyDues         0
TotalDues           0
LeftUnion           0
dtype: int64

gender              0
Management          0
USAcitizen          0
Married             0
MonthsInUnion       0
ContinuingEd        0
FeatureA            0
Connectivity        0
FeatureC            0
FeatureD            0
FeatureE            0
FeatureF            0
FeatureG            0
FeatureB            0
DuesFrequency       0
PaperlessBilling    0
PaymentMethod       0
MonthlyDues         0
TotalDues           0
dtype: int64

Counting unique values:

Here we are counting unique values for every column in the dataset. For that purpose we again created a function named count_unique() taking dataframe column name as an input.

Checking dataset columns
Yes    338
No     331
Name: USAcitizen, dtype: int64
Binary unique values
No     464
Yes    205
Name: Married, dtype: int64
Yes    602
No      67
Name: ContinuingEd, dtype: int64
Yes    379
No     290
Name: PaperlessBilling, dtype: int64

Encoding

Here we are also encoding our categorical values into binary format so that our machine learning model doesn’t generate any type of error while fitting on data.

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdFeatureAConnectivityFeatureCFeatureDFeatureEFeatureFFeatureGFeatureBDuesFrequencyPaperlessBillingPaymentMethodMonthlyDuesTotalDuesLeftUnion
0101121NoFiber opticNoNoNoNoNoNoMonth-to-month0Mailed check70144No
10000161NoDSLNoYesNoYesNoNoMonth-to-month1Mailed check54834Yes
2110071YesFiber opticNoNoNoNoNoNoMonth-to-month0Mailed check74545Yes
31000261NoFiber opticNoYesNoNoYesNoMonth-to-month1Electronic check862147No
4100021NoDSLNoNoNoNoNoNoMonth-to-month0Mailed check4575Yes

Plotting Histogram

Below we are using matplotlib for Plotting of Histogram. This is used for checking the frequency distribution of different values inside a column or feature. Each column is a different unique feature for our model. As we can see from the output there are 3 labels Yes, No and Maryville.We are plotting for FeatureA and FeatureB.

png png

Non binary unique values
No           326
Yes          276
Maryville     67
Name: FeatureA, dtype: int64
No           270
Yes          257
Maryville    142
Name: FeatureB, dtype: int64
No           308
Yes          219
Maryville    142
Name: FeatureC, dtype: int64
No           290
Yes          237
Maryville    142
Name: FeatureD, dtype: int64
No           285
Yes          242
Maryville    142
Name: FeatureE, dtype: int64
No           336
Yes          191
Maryville    142
Name: FeatureF, dtype: int64
Yes          274
No           253
Maryville    142
Name: FeatureG, dtype: int64
Fiber optic    300
DSL            227
other           92
Dial-in         50
Name: Connectivity, dtype: int64
Month-to-month    355
Two year          160
One year          154
Name: DuesFrequency, dtype: int64
Electronic check             231
Mailed check                 157
Credit card (automatic)      145
Bank transfer (automatic)    136
Name: PaymentMethod, dtype: int64

One hot Encoding:

Doing One hot Encoding for those columns which are containing non binary values. One hot encoding simple converts the values between 0’s and 1’s e.g. 0000001 etc. We use one hotencoding in order to convert our categorical feature column into numeric columns so that modelcan easily do learning. For this purpose we created a function named encode_nb() which is taking 3 arguments. 1 is dataframe, 2nd is the column name and 3rd is the prefix that we wantin the name of every new column.

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0101121070144No...0101000001
10000161154834Yes...0001000001
2110071074545Yes...0101000001
310002611862147No...0101000010
410002104575Yes...0001000001

5 rows × 42 columns

Nan values in every column in The Dataset
Nan values in every column in The Testset
Nan values in every row.
0      0
1      0
2      0
3      0
4      0
      ..
664    0
665    0
666    0
667    0
668    0
Length: 669, dtype: int64

Plotting and Visualization

Box and whisker plot:

Doing Box and whisker plot for Checking the availability of outliers in the code. The outliers are simply unwanted values in the code that can generate bias if not removed. We are using aseaborn library for plotting Box and whisker plot. Box and whisker plot. Below we are also checking no of unique values for MonthlyDues and TotalDues features.

detecting outlier

png

certainly there are outliers Unique MonthlyDues

array([   70,    54,    74,    86,    45,    90,    25,    20,    75,
          94,    60,    79,   111,   100,    69,    85,    76,   104,
         103,    71,    46,    84,   114,    78,   110,   106,    89,
          95,    81,    97,    44,    55,    80,   115,    91,    50,
         109,   108,   112,    67,    56,    53,    24,    64,    99,
          77,    29,    31,   116,   101,    88,    72,    19,    83,
          36,    30,    92,    93,    41,   105,    82,    66,    49,
          26,    21,    58,    98,    51,    68,    96,    34,   113,
          73,   102, 10878,   107,    48,    40, 15453,    61,    65,
          87,    35,    18,    62, 10938,    57,   119,    42])

Unique TotalDues

array(['144', '834', '545', '2147', '75', '145', '248', '25', '952',
       '1129', '1608', '3036', '171', '5565', '70', '7512', '5201', '69',
       '1350', '152', '3467', '4108', '20', '5538', '1975', '1993', '46',
       '5982', '7939', '2840', '68', '855', '1832', '7535', '3650',
       '4513', '2258', '7041', '4614', '3106', '400', '303', '5879',
       '143', '2684', '52', '2018', '573', '563', '2861', '5657', '457',
       '93', '4246', '2614', '4307', '605', '320', '271', '7334', '169',
       '311', '2920', '267', '6938', '470', '7931', '4915', '369', '7796',
       '832', '5000', '2387', '202', '1150', '1208', '1733', '863',
       '1391', '5648', '906', '6442', '3369', '1464', '2708', '2866',
       '8004', '1204', '302', '73', '3632', '196', '3777', '1759', '265',
       '227', '926', '7159', '8425', '4113', '220', '6521', '3173', '19',
       '5213', '1799', '831', '#VALUE!', '261', '2296', '2352', '244',
       '6414', '1169', '476', '7509', '1929', '4698', '1648', '1009',
       '4179', '321', '481', '6083', '1134', '2549', '3211', '255',
       '1381', '3230', '454', '3674', '463', '5013', '3415', '988', '339',
       '4052', '590', '7298', '4965', '6683', '7083', '429', '1212',
       '163', '372', '4689', '621', '5064', '4641', '161', '1314', '1017',
       '3822', '119', '530', '1291', '78', '1521', '1306', '6633', '1238',
       '368', '5031', '29', '1206', '8405', '1527', '81', '134', '744',
       '7554', '3942', '256', '5038', '425', '1395', '1601', '6069',
       '1368', '5731', '1126', '2897', '2821', '7752', '801', '4520',
       '5914', '1171', '1269', '3773', '2931', '2570', '4117', '6591',
       '129', '541', '2263', '3545', '296', '1682', '3858', '1346', '111',
       '451', '3029', '498', '4925', '2245', '6373', '232', '4805',
       '7904', '80', '3883', '1741', '1523', '6735', '2094', '5222',
       '204', '4946', '1597', '2964', '5045', '2398', '1994', '7406',
       '867', '42', '3605', '857', '59', '2110', '245', '865', '4017',
       '1626', '1664', '406', '899', '7031', '5958', '37', '1475', '1264',
       '6046', '219', '5481', '5154', '6393', '6510', '738', '1862',
       '1734', '773', '1125', '4484', '6945', '4009', '1399', '3379',
       '1284', '5770', '332', '2910', '250', '36', '5318', '3510', '7099',
       '4054', '5436', '2388', '1752', '1724', '4304', '464', '1622',
       '756', '3626', '4738', '1410', '434', '5175', '4751', '2509',
       '382', '43', '1629', '4349', '8110', '5681', '92', '1173', '5151',
       '390', '3870', '4542', '7062', '4084', '4754', '94', '6230',
       '4684', '3243', '297', '35', '71', '6980', '564', '48', '158',
       '4665', '617', '294', '453', '5265', '7895', '5826', '947', '5937',
       '44', '6688', '3638', '2724', '79', '7962', '403', '5818', '4056',
       '535', '30', '34', '2239', '4859', '3205', '2443', '3266', '579',
       '7726', '1398', '3067', '5720', '1743', '4992', '6589', '279',
       '6558', '2511', '1461', '1802', '299', '341', '2596', '1725', '74',
       '503', '4677', '696', '91', '522', '1821', '1818', '24', '497',
       '243', '132', '7338', '1320', '32', '663', '3442', '1188', '5043',
       '2586', '1425', '2453', '1531', '123', '1505', '897', '2169',
       '815', '542', '1718', '2722', '6033', '2235', '230', '8313', '521',
       '3046', '235', '8093', '1146', '4817', '679', '3682', '2680',
       '6126', '280', '3646', '1327', '5981', '7807', '659', '5011',
       '1441', '6056', '247', '4109', '1559', '967', '1428', '200',
       '6741', '1139', '334', '2313', '4370', '4854', '6142', '990',
       '483', '1463', '1557', '1580', '5986', '2200', '423', '6669',
       '866', '1813', '2413', '4973', '4708', '229', '6841', '307', '435',
       '3976', '3872', '2095', '858', '7267', '4131', '5684', '3133',
       '1375', '3474', '3724', '2275', '5941', '511', '5459', '115',
       '566', '3077', '1013', '1157', '7887', '3166', '387', '779',
       '6383', '1715', '317', '6431', '7016', '5215', '875', '4429',
       '3140', '4145', '1075', '1777', '1554', '989', '2975', '3090',
       '2656', '1653', '284', '415', '167', '1209', '344', '306', '238',
       '5315', '6654', '593', '1530', '39', '3268', '7366', '718', '798',
       '95', '49', '1874', '3606', '139', '3572', '2188', '1863', '195',
       '2754', '1034', '135', '21', '5515', '99', '835', '4507', '330',
       '203', '1424', '5310', '449', '1415', '4132', '1172', '4733',
       '107', '4158', '90', '346', '6283', '5501', '388', '1274', '7139',
       '474', '1305', '55', '1672', '6300', '618', '327', '772', '1210',
       '2348', '3188', '1939', '1131', '2847', '946', '54', '861', '1199',
       '3948', '995', '51', '1111', '1380', '4652', '273', '5484', '915',
       '4424', '3635', '516', '6707', '3994', '587', '940', '4539', '66',
       '4457', '2697', '6293', '4448', '7550', '819', '904', '389',
       '6302', '5254', '1600', '147', '3326', '6362', '4905', '680',
       '700', '40', '754', '1187', '2013', '1435', '2641', '6081', '540',
       '205', '5163', '153', '1716', '4220', '292', '6585', '392'],
      dtype=object)

Converting TotalDues column in the traning and test set from strings to integers/float

0       144.0
1       834.0
2       545.0
3      2147.0
4        75.0
        ...  
664     292.0
665    6585.0
666      74.0
667    1327.0
668     392.0
Name: TotalDues, Length: 669, dtype: float64

0      1110
1      2551
2        78
3      5594
4       140
       ... 
325    4495
326    4534
327     443
328      44
329    6474
Name: TotalDues, Length: 330, dtype: int64

0         30.0
1       1890.0
2        108.0
3       1841.0
4        152.0
         ...  
4995     553.0
4996    3496.0
4997      94.0
4998    7053.0
4999     302.0
Name: TotalDues, Length: 5000, dtype: float64

Check NaN for specific Columns:

Checking for those rows which contain the NaN values. NaN values are supposed to beremoved before fitting the model otherwise the code will throw an error. We will remove the outlier by providing a threshold value to our column so it will remove the outlier row. Below we are also printing the data frame row which is containing NaN value. Then we are taking mean of that specific column which is containing NaN value in order to fill the NaN value.

Checking nan for training set and test set

Number of nan value in training set: 1
Number of nan value in test set: 0
Number of nan value in test set: 8

Finding the row which contains Nan value

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
112101101020NaNNo...1000010001

1 rows × 42 columns

Filling Nan values in TotalDues Column

df3['TotalDues'] = round(df3['TotalDues'].fillna((df3['TotalDues'].mean())),0)
test_set['TotalDues'] = round(test_set['TotalDues'].fillna((test_set['TotalDues'].mean())),0)
df_test['TotalDues'] = round(df_test['TotalDues'].fillna((df_test['TotalDues'].mean())),0)

Checking Nan values again

df3["TotalDues"].isna().sum(axis = 0) 
0

BOX PLOT:

Plotting Box plot for checking Outliers for other columns, As here we can see there is no outlier in our data. We have removed the outlier previously. We can also plot scatter plot for detecting outlier.

png

As we can see there is no outlier in this data

Scatter Plot:

Again checking for outliers, But now we are plotting scatter plot for this. Here we found 3 outliersin total dues. We again removed it by taking mean of the available values There are certainlyother ways too, but this works best for our problem.

png

Removing Outlier:

Here we are removing the outlier by simply providing the threshold value. The values above thatthreshold will be removed. And values below that threshold will be kept in our dataframe and later those values will be used as an input to our dataframe.

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0101121070144.0No...0101000001
10000161154834.0Yes...0001000001
2110071074545.0Yes...0101000001
310002611862147.0No...0101000010
410002104575.0Yes...0001000001

5 rows × 42 columns

Scatter and violin Plot:

We are again plotting scatter plots to confirm that our outliers has been removed and as we cansee our values are good now. Below we are plotting a Scatter and violin plot. The violin plot simply tells the density about how much distributed values we have in our data.

png png

Scree plot:

Below we are plotting the scree plot for monthly dues column to see how are distributed our values. It’s another way of visualization. We are using matplotlib library for scree plot. png

Bivariate plot:

Below we are plotting a Bivariate plot between monthly dues and Months in union to see the difference between both the column values. png

Normalization

After plotting we are normalizing our columns. Normalization simply convert values between 0 and 1.

MonthsInUnionMonthlyDuesTotalDues
0270144.0
11654834.0
2774545.0
326862147.0
424575.0
sc = StandardScaler()
df_train_new_num = sc.fit_transform(df_train_new_num)
(np.mean(df_train_new_num), np.std(df_train_new_num))
(7.1125398974985e-18, 1.0)
genderManagementUSAcitizenMarriedContinuingEdPaperlessBillingLeftUnionA_MaryvilleA_NoA_Yes...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0101110No010...0101000001
1000011Yes010...0001000001
2110010Yes001...0101000001
3100011No010...0101000010
4100010Yes010...0001000001

5 rows × 39 columns

MonthsInUnionMonthlyDuesTotalDues
0-1.2467520.151339-0.960675
1-0.681128-0.376021-0.664369
2-1.0447430.283179-0.788474
3-0.2771100.678699-0.100527
4-1.246752-0.672661-0.990306
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0-1.2467520.151339-0.9606751.00.01.01.01.00.0No...0.01.00.01.00.00.00.00.00.01.0
1-0.681128-0.376021-0.6643690.00.00.00.01.01.0Yes...0.00.00.01.00.00.00.00.00.01.0
2-1.0447430.283179-0.7884741.01.00.00.01.00.0Yes...0.01.00.01.00.00.00.00.00.01.0
3-0.2771100.678699-0.1005271.00.00.00.01.01.0No...0.01.00.01.00.00.00.00.01.00.0
4-1.246752-0.672661-0.9903061.00.00.00.01.00.0Yes...0.00.00.01.00.00.00.00.00.01.0

5 rows × 42 columns

MonthsInUnionMonthlyDuesTotalDues
013030.0
134571890.0
2254108.0
345421841.0
4271152.0
df_test_new_num = sc.fit_transform(df_test_new_num)
(np.mean(df_test_new_num), np.std(df_test_new_num))
(6.963318810448982e-17, 1.0)
df_test_new_cat = df_test_new.drop(['MonthsInUnion','MonthlyDues','TotalDues'] , axis = 1)
df_test_new_cat.head()
genderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_MaryvilleA_NoA_YesB_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
01010011000...0001000010
10000100100...0000100001
20000110100...0001000001
30000001000...0000101000
41000110100...0101000010

5 rows × 38 columns

df_test_new_num = pd.DataFrame(df_test_new_num, columns = ['MonthsInUnion','MonthlyDues','TotalDues'])  
df_test_new_num.head()
MonthsInUnionMonthlyDuesTotalDues
0-1.268931-1.154574-0.990430
10.070734-0.258882-0.169917
2-1.228335-0.358403-0.956021
30.517289-0.756488-0.191533
4-1.2283350.205551-0.936611
df_test_final = pd.concat([df_test_new_num, df_test_new_cat], axis = 1)
df_test_final.head()
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0-1.268931-1.154574-0.9904301010011...0001000010
10.070734-0.258882-0.1699170000100...0000100001
2-1.228335-0.358403-0.9560210000110...0001000001
30.517289-0.756488-0.1915330000001...0000101000
4-1.2283350.205551-0.9366111000110...0101000010

5 rows × 41 columns

Perform a PCA

Train and Test Split:

Here we are separating train test data along with their labels. So that we can perform training. We are using the drop keyword in order to drop the label column from our dataframe. Same process goes with the train and test dataframe.

X_train = df_train_final.drop(['LeftUnion'], axis = 1)
y_train = df_train_final['LeftUnion']
componentsWanted = len(X_train.columns)
print(f'Components wanted = {componentsWanted}')
componentList = ['component'+ str(n) for n in range(componentsWanted)]
Components wanted = 41
X_train.isnull().sum()
MonthsInUnion                      3
MonthlyDues                        3
TotalDues                          3
gender                             3
Management                         3
USAcitizen                         3
Married                            3
ContinuingEd                       3
PaperlessBilling                   3
A_Maryville                        3
A_No                               3
A_Yes                              3
B_Maryville                        3
B_No                               3
B_Yes                              3
C_Maryville                        3
C_No                               3
C_Yes                              3
D_Maryville                        3
D_No                               3
D_Yes                              3
E_Maryville                        3
E_No                               3
E_Yes                              3
F_Maryville                        3
F_No                               3
F_Yes                              3
G_Maryville                        3
G_No                               3
G_Yes                              3
conn_DSL                           3
conn_Dial-in                       3
conn_Fiber optic                   3
conn_other                         3
dues_F_Month-to-month              3
dues_F_One year                    3
dues_F_Two year                    3
pay_M_Bank transfer (automatic)    3
pay_M_Credit card (automatic)      3
pay_M_Electronic check             3
pay_M_Mailed check                 3
dtype: int64
X_train = X_train.dropna()
y_train = y_train.dropna()
pca = PCA(n_components=6)
pca.fit(X_train)
x_pca = pca.transform(X_train)
pca = PCA(n_components=6)
principalComponents_train_data = pca.fit_transform(X_train)
print(principalComponents_train_data.shape)
(663, 6)
principalComponents_train_data_Df = pd.DataFrame(data = principalComponents_train_data, 
                                                 columns = ['p_c_1', 'p_c_2','p_c_3','p_c_4','p_c_5','p_c_6'])
principalComponents_train_data_Df.head()
p_c_1p_c_2p_c_3p_c_4p_c_5p_c_6
0-1.123156-1.765350-0.568427-0.202989-0.736615-0.304088
1-0.952456-1.1285040.2484330.710367-1.043984-0.181601
2-0.733406-1.910995-0.745070-0.098707-0.033341-0.329689
30.376765-1.248122-1.049732-0.300304-0.040315-0.259779
4-1.725237-1.782674-0.1297250.761185-0.8571320.426444
X_train.head()
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0-1.2467520.151339-0.9606751.00.01.01.01.00.00.0...0.01.00.01.00.00.00.00.00.01.0
1-0.681128-0.376021-0.6643690.00.00.00.01.01.00.0...0.00.00.01.00.00.00.00.00.01.0
2-1.0447430.283179-0.7884741.01.00.00.01.00.00.0...0.01.00.01.00.00.00.00.00.01.0
3-0.2771100.678699-0.1005271.00.00.00.01.01.00.0...0.01.00.01.00.00.00.00.01.00.0
4-1.246752-0.672661-0.9903061.00.00.00.01.00.00.0...0.00.00.01.00.00.00.00.00.01.0

5 rows × 41 columns

df_comp = pd.DataFrame(pca.components_,index=list(['component 0', 'component 1', 'component 2',
                                                  'component 3','component 4', 'component 5']))
components = df_comp.sort_values(by ='component 0', axis=1,ascending=False).round(decimals=6)
components.transpose()
component 0component 1component 2component 3component 4component 5
20.5307840.251965-0.2279770.165887-0.0845990.033256
10.453539-0.042664-0.315131-0.482817-0.515765-0.128799
00.4303030.350500-0.1237950.5317170.2428570.114482
290.160671-0.0157900.131636-0.2285920.1444200.285478
140.145454-0.0022930.195500-0.2227630.1531490.215368
230.1416420.0351810.268401-0.1070320.0869990.089575
320.133672-0.151133-0.152434-0.1855380.288628-0.079900
200.1185690.0009990.1996220.030107-0.010979-0.424902
110.112630-0.0115250.066390-0.1132200.349218-0.252231
170.1098080.0297520.2691910.014545-0.160624-0.012298
80.102247-0.103436-0.022782-0.1023840.145836-0.070760
260.0944540.0344550.267092-0.035373-0.1319020.174539
50.0767110.0659130.153082-0.1134360.028181-0.297475
250.055811-0.253331-0.1333040.0793100.126028-0.142762
160.040457-0.248628-0.1354030.0293920.1547490.044074
390.039411-0.150297-0.137597-0.0258980.1865190.005930
190.031695-0.219874-0.0658340.0138300.0051040.456679
40.031681-0.063952-0.0484770.0114040.097666-0.004383
360.0301850.1458660.177738-0.043666-0.0346650.002692
370.0244720.0441970.088848-0.004064-0.011901-0.051369
380.0189890.0632580.081571-0.011789-0.0092770.009723
60.0188540.0624850.113503-0.043353-0.064934-0.148601
300.016593-0.0677420.2862220.229476-0.2945020.111677
70.0097570.038767-0.111882-0.0899480.095372-0.033014
220.008623-0.254056-0.1346120.150969-0.092873-0.057798
350.0076080.0853520.071200-0.0012510.013626-0.122814
30.0059050.0187940.0269530.0623450.0177090.020251
130.004810-0.216582-0.0617120.266700-0.159023-0.183591
9-0.009757-0.0387670.1118820.089948-0.0953720.033014
28-0.010406-0.2030860.0021520.272529-0.150294-0.253701
34-0.037793-0.231218-0.2489380.0449170.0210390.120122
31-0.0477910.087520-0.022428-0.0063080.026738-0.038232
40-0.0828720.042842-0.0328230.041751-0.1653400.035716
33-0.1024740.131355-0.111360-0.037629-0.0208640.006455
10-0.1028730.050293-0.1782720.023272-0.2538450.219217
27-0.1502650.218875-0.133788-0.0439370.005874-0.031777
21-0.1502650.218875-0.133788-0.0439370.005874-0.031777
18-0.1502650.218875-0.133788-0.0439370.005874-0.031777
15-0.1502650.218875-0.133788-0.0439370.005874-0.031777
12-0.1502650.218875-0.133788-0.0439370.005874-0.031777
24-0.1502650.218875-0.133788-0.0439370.005874-0.031777
pca.explained_variance_ratio_
array([0.26047333, 0.16587216, 0.08584148, 0.06586519, 0.05341772,
       0.03435168])
X_train.iloc[:, [12, 15, 18, 21, 24, 27]].head()
B_MaryvilleC_MaryvilleD_MaryvilleE_MaryvilleF_MaryvilleG_Maryville
00.00.00.00.00.00.0
10.00.00.00.00.00.0
20.00.00.00.00.00.0
30.00.00.00.00.00.0
40.00.00.00.00.00.0
X_train_final = X_train.drop(['C_Maryville', 'D_Maryville', 'E_Maryville', 'F_Maryville', 'G_Maryville'], axis = 1)
X_train_final.head()
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0-1.2467520.151339-0.9606751.00.01.01.01.00.00.0...0.01.00.01.00.00.00.00.00.01.0
1-0.681128-0.376021-0.6643690.00.00.00.01.01.00.0...0.00.00.01.00.00.00.00.00.01.0
2-1.0447430.283179-0.7884741.01.00.00.01.00.00.0...0.01.00.01.00.00.00.00.00.01.0
3-0.2771100.678699-0.1005271.00.00.00.01.01.00.0...0.01.00.01.00.00.00.00.01.00.0
4-1.246752-0.672661-0.9903061.00.00.00.01.00.00.0...0.00.00.01.00.00.00.00.00.01.0

5 rows × 36 columns

X_train = df2.drop(['LeftUnion'], axis=1)
table1 = X_train.head()   # Check
# For test set
X_test = df_test.drop(['LeftUnion'], axis=1)
table2 = X_test.head()  # Check
display(table1)
display(table2)
genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0101121070144.00...0101000001
10000161154834.00...0001000001
2110071074545.00...0101000001
310002611862147.00...0101000010
410002104575.00...0001000001

5 rows × 41 columns

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check
0101053102011100...1000010100
1110052014925511...0001000010
2010011078780...0101000010
31100561110155940...0101000010
40000311541400...0001000010

5 rows × 41 columns

For training set

  • Convert series to DataFrame.
  • Encoding target values. Encoding target values into 1 and 0.
y_train = df2["LeftUnion"]
y_train = y_train.to_frame()
table1 = y_train.head()
y_train = y_train.astype(str).apply(encode)
table2 = y_train.head()
display(table1)
display(table2)
LeftUnion
0No
1Yes
2Yes
3No
4Yes
LeftUnion
00
11
21
30
41

For testing set

  • Convert series to df.
  • Encoding target values. Encoding target values into 1 and 0.
y_test = df_test["LeftUnion"]
y_test = y_test.to_frame()
table1 = y_test.head()
y_test = y_test.apply(encode)
table2 = y_test.head()
display(table1)
display(table2)
LeftUnion
0No
1Yes
2Yes
3No
4No
LeftUnion
00
11
21
30
40

Fitting models

Regression model

In this model we achieved fairly high accuracy.

logisticRegr = LogisticRegression(solver='lbfgs',max_iter=1000)
logisticRegr.fit(X_train.values, y_train.values.ravel())
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
y_pred = logisticRegr.predict(X_test)
print(y_pred)
[0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1
 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0
 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 1 0 0 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0
 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1
 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0
 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0]

Plot Confusion Matrix

png

Printing the Accuracy Score

Accuracy Score : 0.78

Diplay Classification report as Data Frame

precisionrecallf1-scoresupport
00.8455280.8630710.854209241.000000
10.6071430.5730340.58959589.000000
accuracy0.7848480.7848480.7848480.784848
macro avg0.7263360.7180520.721902330.000000
weighted avg0.7812370.7848480.782844330.000000

Testing with new dataset

pred = logisticRegr.predict(test_set[0:100])
print(pred)
[1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1
 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1]

Decision tree model

[0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1
 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0
 1 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 0 0 0
 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0]

Plot Confusion Matrix

png

Printing the Accuracy Score

Accuracy Score : 0.69

Diplay Classification report as Data Frame

precisionrecallf1-scoresupport
00.7875000.7842320.785863241.000000
10.4222220.4269660.42458189.000000
accuracy0.6878790.6878790.6878790.687879
macro avg0.6048610.6055990.605222330.000000
weighted avg0.6889860.6878790.688426330.000000

Support Vector Machine

Now here we are running our support vector machine model and we got fairly good accuracy ontest set

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Plot Confusion Matrix

png

Printing the Accuracy Score

Accuracy Score : 0.73

Diplay Classification report as Data Frame

precisionrecallf1-scoresupport
00.7303031.0000000.844133241.000000
10.0000000.0000000.00000089.000000
accuracy0.7303030.7303030.7303030.730303
macro avg0.3651520.5000000.422067330.000000
weighted avg0.5333430.7303030.616473330.000000

Random Forest

Time to play with a random forest model. It’s an ensemble technique which utilized multiple trees in order to learn best features and perform well on test set. It’s a very famous machine learning model.

[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1
 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0
 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0]

Plot Confusion Matrix

png

Printing the Accuracy Score

Accuracy Score : 0.79

Diplay Classification report as Data Frame

precisionrecallf1-scoresupport
00.8301160.8921160.860000241.000000
10.6338030.5056180.56250089.000000
accuracy0.7878790.7878790.7878790.787879
macro avg0.7319590.6988670.711250330.000000
weighted avg0.7771710.7878790.779765330.000000

Neural Network

Now we trained a neural network to see how well our model is performing on a simple DNNnetwork.

WARNING:tensorflow:From /opt/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 12)                504       
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 104       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
=================================================================
Total params: 617
Trainable params: 617
Non-trainable params: 0
_________________________________________________________________

Compile and fit the keras model

WARNING:tensorflow:From /opt/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/150
666/666 [==============================] - 0s 362us/sample - loss: 8.2351 - acc: 0.7402
Epoch 2/150
666/666 [==============================] - 0s 152us/sample - loss: 0.7725 - acc: 0.6877
Epoch 3/150
666/666 [==============================] - 0s 146us/sample - loss: 0.5178 - acc: 0.7387
Epoch 4/150
666/666 [==============================] - 0s 148us/sample - loss: 0.5769 - acc: 0.7462
Epoch 5/150
666/666 [==============================] - 0s 163us/sample - loss: 0.6633 - acc: 0.7342
Epoch 6/150
666/666 [==============================] - 0s 148us/sample - loss: 0.5178 - acc: 0.7553
Epoch 7/150
666/666 [==============================] - 0s 149us/sample - loss: 0.5436 - acc: 0.7598
Epoch 8/150
666/666 [==============================] - 0s 149us/sample - loss: 0.5062 - acc: 0.7838
Epoch 9/150
666/666 [==============================] - 0s 145us/sample - loss: 0.5271 - acc: 0.7853
Epoch 10/150
666/666 [==============================] - 0s 149us/sample - loss: 0.4971 - acc: 0.7748
Epoch 11/150
666/666 [==============================] - 0s 145us/sample - loss: 0.4631 - acc: 0.7973
Epoch 12/150
666/666 [==============================] - 0s 146us/sample - loss: 0.6097 - acc: 0.7538
Epoch 13/150
666/666 [==============================] - 0s 147us/sample - loss: 0.4964 - acc: 0.7838
Epoch 14/150
666/666 [==============================] - 0s 145us/sample - loss: 0.4721 - acc: 0.7748
Epoch 15/150
666/666 [==============================] - 0s 151us/sample - loss: 0.4736 - acc: 0.7853
Epoch 16/150
666/666 [==============================] - 0s 142us/sample - loss: 0.5045 - acc: 0.8018
Epoch 17/150
666/666 [==============================] - 0s 141us/sample - loss: 0.5331 - acc: 0.7838
Epoch 18/150
666/666 [==============================] - 0s 141us/sample - loss: 0.5281 - acc: 0.7823
Epoch 19/150
666/666 [==============================] - 0s 145us/sample - loss: 0.4544 - acc: 0.8018
Epoch 20/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4513 - acc: 0.8063
Epoch 21/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4966 - acc: 0.7883
Epoch 22/150
666/666 [==============================] - 0s 152us/sample - loss: 0.5678 - acc: 0.7763
Epoch 23/150
666/666 [==============================] - 0s 148us/sample - loss: 0.4587 - acc: 0.8123
Epoch 24/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4463 - acc: 0.8003
Epoch 25/150
666/666 [==============================] - 0s 153us/sample - loss: 0.4676 - acc: 0.7988
Epoch 26/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4695 - acc: 0.8003
Epoch 27/150
666/666 [==============================] - 0s 141us/sample - loss: 0.5317 - acc: 0.7868
Epoch 28/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4589 - acc: 0.8033
Epoch 29/150
666/666 [==============================] - 0s 145us/sample - loss: 0.4937 - acc: 0.7913
Epoch 30/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4410 - acc: 0.8018
Epoch 31/150
666/666 [==============================] - 0s 147us/sample - loss: 0.4422 - acc: 0.8078
Epoch 32/150
666/666 [==============================] - 0s 147us/sample - loss: 0.4714 - acc: 0.8093
Epoch 33/150
666/666 [==============================] - 0s 149us/sample - loss: 0.4257 - acc: 0.8093
Epoch 34/150
666/666 [==============================] - 0s 150us/sample - loss: 0.4649 - acc: 0.8018
Epoch 35/150
666/666 [==============================] - 0s 147us/sample - loss: 0.4341 - acc: 0.8123
Epoch 36/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4822 - acc: 0.7868
Epoch 37/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4445 - acc: 0.7988
Epoch 38/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4484 - acc: 0.8078
Epoch 39/150
666/666 [==============================] - 0s 161us/sample - loss: 0.4193 - acc: 0.8138
Epoch 40/150
666/666 [==============================] - 0s 154us/sample - loss: 0.4386 - acc: 0.8108
Epoch 41/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4781 - acc: 0.7808
Epoch 42/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4706 - acc: 0.7868
Epoch 43/150
666/666 [==============================] - 0s 142us/sample - loss: 0.4562 - acc: 0.8078
Epoch 44/150
666/666 [==============================] - 0s 142us/sample - loss: 0.5031 - acc: 0.7958
Epoch 45/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4337 - acc: 0.8108
Epoch 46/150
666/666 [==============================] - 0s 150us/sample - loss: 0.4187 - acc: 0.8093
Epoch 47/150
666/666 [==============================] - 0s 136us/sample - loss: 0.4294 - acc: 0.8183
Epoch 48/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4279 - acc: 0.8063
Epoch 49/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4331 - acc: 0.8078
Epoch 50/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4395 - acc: 0.8168
Epoch 51/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4346 - acc: 0.8018
Epoch 52/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4352 - acc: 0.8138
Epoch 53/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4642 - acc: 0.8063
Epoch 54/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4376 - acc: 0.8153
Epoch 55/150
666/666 [==============================] - 0s 141us/sample - loss: 0.4242 - acc: 0.8153
Epoch 56/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4332 - acc: 0.8153
Epoch 57/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4424 - acc: 0.8183
Epoch 58/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4302 - acc: 0.8033
Epoch 59/150
666/666 [==============================] - 0s 141us/sample - loss: 0.4294 - acc: 0.8093
Epoch 60/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4540 - acc: 0.8048
Epoch 61/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4212 - acc: 0.8168
Epoch 62/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4194 - acc: 0.8138
Epoch 63/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4234 - acc: 0.8228
Epoch 64/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4249 - acc: 0.8213
Epoch 65/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4191 - acc: 0.8138
Epoch 66/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4366 - acc: 0.8108
Epoch 67/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4382 - acc: 0.8228
Epoch 68/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4391 - acc: 0.8078
Epoch 69/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4370 - acc: 0.8108
Epoch 70/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4195 - acc: 0.8198
Epoch 71/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4275 - acc: 0.8228
Epoch 72/150
666/666 [==============================] - 0s 141us/sample - loss: 0.4212 - acc: 0.8153
Epoch 73/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4196 - acc: 0.8258
Epoch 74/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4067 - acc: 0.8198
Epoch 75/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4134 - acc: 0.8108
Epoch 76/150
666/666 [==============================] - 0s 143us/sample - loss: 0.4258 - acc: 0.8048
Epoch 77/150
666/666 [==============================] - 0s 141us/sample - loss: 0.4170 - acc: 0.8243
Epoch 78/150
666/666 [==============================] - 0s 141us/sample - loss: 0.4303 - acc: 0.8138
Epoch 79/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4320 - acc: 0.8138
Epoch 80/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4277 - acc: 0.8258
Epoch 81/150
666/666 [==============================] - 0s 139us/sample - loss: 0.3979 - acc: 0.8288
Epoch 82/150
666/666 [==============================] - 0s 141us/sample - loss: 0.4183 - acc: 0.8303
Epoch 83/150
666/666 [==============================] - 0s 149us/sample - loss: 0.4195 - acc: 0.8108
Epoch 84/150
666/666 [==============================] - 0s 149us/sample - loss: 0.4228 - acc: 0.8243
Epoch 85/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4082 - acc: 0.8288
Epoch 86/150
666/666 [==============================] - 0s 154us/sample - loss: 0.4287 - acc: 0.8108
Epoch 87/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4302 - acc: 0.8213
Epoch 88/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4091 - acc: 0.8138
Epoch 89/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4217 - acc: 0.8288
Epoch 90/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4184 - acc: 0.8213
Epoch 91/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4188 - acc: 0.8408
Epoch 92/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4051 - acc: 0.8333
Epoch 93/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4059 - acc: 0.8258
Epoch 94/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4019 - acc: 0.8273
Epoch 95/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4329 - acc: 0.8108
Epoch 96/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4018 - acc: 0.8393
Epoch 97/150
666/666 [==============================] - 0s 146us/sample - loss: 0.4083 - acc: 0.8348
Epoch 98/150
666/666 [==============================] - 0s 142us/sample - loss: 0.4087 - acc: 0.8243
Epoch 99/150
666/666 [==============================] - 0s 142us/sample - loss: 0.4468 - acc: 0.8228
Epoch 100/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4129 - acc: 0.8198
Epoch 101/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4327 - acc: 0.8258
Epoch 102/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4063 - acc: 0.8318
Epoch 103/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4068 - acc: 0.8318
Epoch 104/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4083 - acc: 0.8303
Epoch 105/150
666/666 [==============================] - 0s 143us/sample - loss: 0.3960 - acc: 0.8213
Epoch 106/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4094 - acc: 0.8258
Epoch 107/150
666/666 [==============================] - 0s 144us/sample - loss: 0.4313 - acc: 0.8243
Epoch 108/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4163 - acc: 0.8363
Epoch 109/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4072 - acc: 0.8273
Epoch 110/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4113 - acc: 0.8183
Epoch 111/150
666/666 [==============================] - 0s 137us/sample - loss: 0.3955 - acc: 0.8378
Epoch 112/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4096 - acc: 0.8408
Epoch 113/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4099 - acc: 0.8348
Epoch 114/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4026 - acc: 0.8198
Epoch 115/150
666/666 [==============================] - 0s 139us/sample - loss: 0.3947 - acc: 0.8333
Epoch 116/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4073 - acc: 0.8333
Epoch 117/150
666/666 [==============================] - 0s 137us/sample - loss: 0.4155 - acc: 0.8243
Epoch 118/150
666/666 [==============================] - 0s 142us/sample - loss: 0.4025 - acc: 0.8378
Epoch 119/150
666/666 [==============================] - 0s 140us/sample - loss: 0.3951 - acc: 0.8333
Epoch 120/150
666/666 [==============================] - 0s 143us/sample - loss: 0.3934 - acc: 0.8333
Epoch 121/150
666/666 [==============================] - 0s 140us/sample - loss: 0.3919 - acc: 0.8348
Epoch 122/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4007 - acc: 0.8318
Epoch 123/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4120 - acc: 0.8153
Epoch 124/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4494 - acc: 0.7883
Epoch 125/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4067 - acc: 0.8108
Epoch 126/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4027 - acc: 0.8048
Epoch 127/150
666/666 [==============================] - 0s 139us/sample - loss: 0.3961 - acc: 0.8408
Epoch 128/150
666/666 [==============================] - 0s 146us/sample - loss: 0.3914 - acc: 0.8333
Epoch 129/150
666/666 [==============================] - 0s 140us/sample - loss: 0.3871 - acc: 0.8498
Epoch 130/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4430 - acc: 0.7868
Epoch 131/150
666/666 [==============================] - 0s 139us/sample - loss: 0.3887 - acc: 0.8378
Epoch 132/150
666/666 [==============================] - 0s 140us/sample - loss: 0.3985 - acc: 0.8243
Epoch 133/150
666/666 [==============================] - 0s 136us/sample - loss: 0.4054 - acc: 0.8213
Epoch 134/150
666/666 [==============================] - 0s 138us/sample - loss: 0.4085 - acc: 0.8243
Epoch 135/150
666/666 [==============================] - 0s 137us/sample - loss: 0.3927 - acc: 0.8333
Epoch 136/150
666/666 [==============================] - 0s 138us/sample - loss: 0.3950 - acc: 0.8333
Epoch 137/150
666/666 [==============================] - 0s 138us/sample - loss: 0.3905 - acc: 0.8348
Epoch 138/150
666/666 [==============================] - 0s 137us/sample - loss: 0.3867 - acc: 0.8438
Epoch 139/150
666/666 [==============================] - 0s 140us/sample - loss: 0.3806 - acc: 0.8348
Epoch 140/150
666/666 [==============================] - 0s 138us/sample - loss: 0.3922 - acc: 0.8348
Epoch 141/150
666/666 [==============================] - 0s 139us/sample - loss: 0.4074 - acc: 0.8153
Epoch 142/150
666/666 [==============================] - 0s 140us/sample - loss: 0.4033 - acc: 0.8303
Epoch 143/150
666/666 [==============================] - 0s 139us/sample - loss: 0.3936 - acc: 0.8228
Epoch 144/150
666/666 [==============================] - 0s 141us/sample - loss: 0.3825 - acc: 0.8468
Epoch 145/150
666/666 [==============================] - 0s 149us/sample - loss: 0.3862 - acc: 0.8258
Epoch 146/150
666/666 [==============================] - 0s 153us/sample - loss: 0.3839 - acc: 0.8423
Epoch 147/150
666/666 [==============================] - 0s 142us/sample - loss: 0.3831 - acc: 0.8363
Epoch 148/150
666/666 [==============================] - 0s 145us/sample - loss: 0.3973 - acc: 0.8243
Epoch 149/150
666/666 [==============================] - 0s 153us/sample - loss: 0.3890 - acc: 0.8423
Epoch 150/150
666/666 [==============================] - 0s 155us/sample - loss: 0.3871 - acc: 0.8438

Evaluate the keras model

666/666 [==============================] - 0s 71us/sample - loss: 0.3759 - acc: 0.8453
Training Accuracy: 84.53
330/330 [==============================] - 0s 30us/sample - loss: 0.4749 - acc: 0.7515
Testing Accuracy: 75.15

Model Prediction

[0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1
 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0
 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0
 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0
 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0
 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1
 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 0 1 1 0
 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0]

Plot Confusion Matrix

png

Printing the Accuracy Score

Accuracy Score : 0.75

Diplay Classification report as Data Frame

precisionrecallf1-scoresupport
00.8382980.8174270.827731241.000000
10.5368420.5730340.55434889.000000
accuracy0.7515150.7515150.7515150.751515
macro avg0.6875700.6952310.691039330.000000
weighted avg0.7569960.7515150.754000330.000000

Explain why you think the results differed

In the blind guesses the model is not trained on any kind of data. you just give arandom prediction There is no statistical calculation involved behind the ans. Therefore the results differafter training the model. Because before training the model hasn’t leant anything fromthe data. But after training model has learnt the weights and now can perform better onlearned data.

How you would improve your project if you had more time?

I would apply some advance statistical technique for removing outliers andassigning more weights to the minority classes. Also I would like to do fine tuning byusing pre-trained deep learning model. I would apply more data cleaning techniques toclean out some redundant values.


© 2020. Zakaria Alsahfi. All rights reserved.