Determining Trade Union Status

This Project deals with implementation of different models and doing preprocessing with the data in order to compare the results and performance of different models. Weapplied statistical techniques to see which model is performing best. In this project we will create a binary classifier which will predict that either the data scientist will remaina USDU member or not.

Reading data for preprocessing

TRAIN.csv “LeftUnion”

Train and Test Split

Doing Train and Test Split between data. It involves importing a function from scikit learn librarywhich can perform this task very easily. Now doing Train and Test Split between data. So that we will apply all the preprocessing on train data but not test data. Otherwise our model will get prone to data leakage and it will perform worse in production when newdata arrives.

Merge data By label

X_train and y_train
X_test And y_test
NaN values Check

Checking For Nan values in the dataset column wise. Because we have to remove the nan values before fitting out the ML model on data.For that purpose we are creating a function named check_nan() in which we are passing a dataframe as an argument. It gives us output telling the no of NaN values.

Counting unique values:

Here we are counting unique values for every column in the dataset. For that purpose we again created a function named count_unique() taking dataframe column name as an input.

Checking dataset columns
One hot Encoding:

Doing One hot Encoding for those columns which are containing non binary values. One hot encoding simple converts the values between 0’s and 1’s e.g. 0000001 etc. We use one hotencoding in order to convert our categorical feature column into numeric columns so that modelcan easily do learning. For this purpose we created a function named encode_nb() which is taking 3 arguments. 1 is dataframe, 2nd is the column name and 3rd is the prefix that we wantin the name of every new column.

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 42 columns

Nan values in every column in The Dataset
Nan values in every column in The Testset
Nan values in every row.
0      0
1      0
2      0
3      0
4      0
664    0
665    0
666    0
667    0
668    0
Length: 669, dtype: int64

Plotting and Visualization

Box and whisker plot:

Doing Box and whisker plot for Checking the availability of outliers in the code. The outliers are simply unwanted values in the code that can generate bias if not removed. We are using aseaborn library for plotting Box and whisker plot. Box and whisker plot. Below we are also checking no of unique values for MonthlyDues and TotalDues features.

detecting outlier


certainly there are outliers Unique MonthlyDues

Converting TotalDues column in the traning and test set from strings to integers/float

Check NaN for specific Columns:

Checking for those rows which contain the NaN values. NaN values are supposed to beremoved before fitting the model otherwise the code will throw an error. We will remove the outlier by providing a threshold value to our column so it will remove the outlier row. Below we are also printing the data frame row which is containing NaN value. Then we are taking mean of that specific column which is containing NaN value in order to fill the NaN value.

Checking nan for training set and test set

Number of nan value in training set: 1
Number of nan value in test set: 0
Number of nan value in test set: 8

Finding the row which contains Nan value

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

1 rows × 42 columns

Filling Nan values in TotalDues Column

df3['TotalDues'] = round(df3['TotalDues'].fillna((df3['TotalDues'].mean())),0)
test_set['TotalDues'] = round(test_set['TotalDues'].fillna((test_set['TotalDues'].mean())),0)
df_test['TotalDues'] = round(df_test['TotalDues'].fillna((df_test['TotalDues'].mean())),0)

Checking Nan values again

df3["TotalDues"].isna().sum(axis = 0) 


Plotting Box plot for checking Outliers for other columns, As here we can see there is no outlier in our data. We have removed the outlier previously. We can also plot scatter plot for detecting outlier.


As we can see there is no outlier in this data

Scatter Plot:

Again checking for outliers, But now we are plotting scatter plot for this. Here we found 3 outliersin total dues. We again removed it by taking mean of the available values There are certainlyother ways too, but this works best for our problem.


Removing Outlier:

Here we are removing the outlier by simply providing the threshold value. The values above thatthreshold will be removed. And values below that threshold will be kept in our dataframe and later those values will be used as an input to our dataframe.

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 42 columns

Scatter and violin Plot:

We are again plotting scatter plots to confirm that our outliers has been removed and as we cansee our values are good now. Below we are plotting a Scatter and violin plot. The violin plot simply tells the density about how much distributed values we have in our data.

png png

Scree plot:

Below we are plotting the scree plot for monthly dues column to see how are distributed our values. It’s another way of visualization. We are using matplotlib library for scree plot. png

Bivariate plot:

Below we are plotting a Bivariate plot between monthly dues and Months in union to see the difference between both the column values. png


After plotting we are normalizing our columns. Normalization simply convert values between 0 and 1.

sc = StandardScaler()
df_train_new_num = sc.fit_transform(df_train_new_num)
(np.mean(df_train_new_num), np.std(df_train_new_num))
(7.1125398974985e-18, 1.0)
genderManagementUSAcitizenMarriedContinuingEdPaperlessBillingLeftUnionA_MaryvilleA_NoA_Yes...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 39 columns

MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingLeftUnion...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 42 columns

df_test_new_num = sc.fit_transform(df_test_new_num)
(np.mean(df_test_new_num), np.std(df_test_new_num))
(6.963318810448982e-17, 1.0)
df_test_new_cat = df_test_new.drop(['MonthsInUnion','MonthlyDues','TotalDues'] , axis = 1)
genderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_MaryvilleA_NoA_YesB_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 38 columns

df_test_new_num = pd.DataFrame(df_test_new_num, columns = ['MonthsInUnion','MonthlyDues','TotalDues'])  
df_test_final = pd.concat([df_test_new_num, df_test_new_cat], axis = 1)
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 41 columns

Perform a PCA

Train and Test Split:

Here we are separating train test data along with their labels. So that we can perform training. We are using the drop keyword in order to drop the label column from our dataframe. Same process goes with the train and test dataframe.

X_train = df_train_final.drop(['LeftUnion'], axis = 1)
y_train = df_train_final['LeftUnion']
componentsWanted = len(X_train.columns)
print(f'Components wanted = {componentsWanted}')
componentList = ['component'+ str(n) for n in range(componentsWanted)]
Components wanted = 41
X_train = X_train.dropna()
y_train = y_train.dropna()
pca = PCA(n_components=6)
x_pca = pca.transform(X_train)
pca = PCA(n_components=6)
principalComponents_train_data = pca.fit_transform(X_train)
(663, 6)
principalComponents_train_data_Df = pd.DataFrame(data = principalComponents_train_data, 
                                                 columns = ['p_c_1', 'p_c_2','p_c_3','p_c_4','p_c_5','p_c_6'])
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 41 columns

df_comp = pd.DataFrame(pca.components_,index=list(['component 0', 'component 1', 'component 2',
                                                  'component 3','component 4', 'component 5']))
components = df_comp.sort_values(by ='component 0', axis=1,ascending=False).round(decimals=6)
component 0component 1component 2component 3component 4component 5
array([0.26047333, 0.16587216, 0.08584148, 0.06586519, 0.05341772,
X_train.iloc[:, [12, 15, 18, 21, 24, 27]].head()
X_train_final = X_train.drop(['C_Maryville', 'D_Maryville', 'E_Maryville', 'F_Maryville', 'G_Maryville'], axis = 1)
MonthsInUnionMonthlyDuesTotalDuesgenderManagementUSAcitizenMarriedContinuingEdPaperlessBillingA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 36 columns

X_train = df2.drop(['LeftUnion'], axis=1)
table1 = X_train.head()   # Check
# For test set
X_test = df_test.drop(['LeftUnion'], axis=1)
table2 = X_test.head()  # Check
genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 41 columns

genderManagementUSAcitizenMarriedMonthsInUnionContinuingEdPaperlessBillingMonthlyDuesTotalDuesA_Maryville...conn_Dial-inconn_Fiber opticconn_otherdues_F_Month-to-monthdues_F_One yeardues_F_Two yearpay_M_Bank transfer (automatic)pay_M_Credit card (automatic)pay_M_Electronic checkpay_M_Mailed check

5 rows × 41 columns

For training set

  • Convert series to DataFrame.
  • Encoding target values. Encoding target values into 1 and 0.
y_train = df2["LeftUnion"]
y_train = y_train.to_frame()
table1 = y_train.head()
y_train = y_train.astype(str).apply(encode)
table2 = y_train.head()

For testing set

  • Convert series to df.
  • Encoding target values. Encoding target values into 1 and 0.
y_test = df_test["LeftUnion"]
y_test = y_test.to_frame()
table1 = y_test.head()
y_test = y_test.apply(encode)
table2 = y_test.head()

Fitting models

Regression model

In this model we achieved fairly high accuracy.

logisticRegr = LogisticRegression(solver='lbfgs',max_iter=1000), y_train.values.ravel())
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
y_pred = logisticRegr.predict(X_test)
[0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 1 0 1
 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0]

Plot Confusion Matrix


Printing the Accuracy Score

Accuracy Score : 0.78

Diplay Classification report as Data Frame

macro avg0.7263360.7180520.721902330.000000
weighted avg0.7812370.7848480.782844330.000000

Testing with new dataset

pred = logisticRegr.predict(test_set[0:100])
[1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1
Decision tree model

[0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1
 ... (truncated for brevity)]
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0]

Plot Confusion Matrix


Printing the Accuracy Score

Accuracy Score : 0.69

Diplay Classification report as Data Frame

macro avg0.6048610.6055990.605222330.000000
weighted avg0.6889860.6878790.688426330.000000

Support Vector Machine

Now here we are running our support vector machine model and we got fairly good accuracy ontest set

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 ... (truncated for brevity)]
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Plot Confusion Matrix


Printing the Accuracy Score

Accuracy Score : 0.73

Diplay Classification report as Data Frame

macro avg0.3651520.5000000.422067330.000000
weighted avg0.5333430.7303030.616473330.000000

Random Forest

Time to play with a random forest model. It’s an ensemble technique which utilized multiple trees in order to learn best features and perform well on test set. It’s a very famous machine learning model.

[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1
 ... (truncated for brevity)]
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0]

Plot Confusion Matrix


Printing the Accuracy Score

Accuracy Score : 0.79

Diplay Classification report as Data Frame

macro avg0.7319590.6988670.711250330.000000
weighted avg0.7771710.7878790.779765330.000000

Neural Network

Now we trained a neural network to see how well our model is performing on a simple DNNnetwork.

Evaluate the keras model

666/666 [==============================] - 0s 71us/sample - loss: 0.3759 - acc: 0.8453
Training Accuracy: 84.53
330/330 [==============================] - 0s 30us/sample - loss: 0.4749 - acc: 0.7515
Testing Accuracy: 75.15

Model Prediction

[0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1
 ... (truncated for brevity)]
 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0]

Plot Confusion Matrix


Printing the Accuracy Score

Accuracy Score : 0.75

Diplay Classification report as Data Frame

macro avg0.6875700.6952310.691039330.000000
weighted avg0.7569960.7515150.754000330.000000

Explain why you think the results differed

In the blind guesses the model is not trained on any kind of data. you just give arandom prediction There is no statistical calculation involved behind the ans. Therefore the results differafter training the model. Because before training the model hasn’t leant anything fromthe data. But after training model has learnt the weights and now can perform better onlearned data.

How you would improve your project if you had more time?

I would apply some advance statistical technique for removing outliers andassigning more weights to the minority classes. Also I would like to do fine tuning byusing pre-trained deep learning model. I would apply more data cleaning techniques toclean out some redundant values.

