Telecom Churn Prediction - Machine Learning Algorithm

Introduction :

Telecom industry is a dynamic industry where every action of customers has some signal for service provider to act on. Either recharging amount has reduced or No call was noticed or Internet uses has stopped. Reason for any odd behavior can be either service by provider or Cost of recharge.Loss of subscriber has a direct relation with loss to the company. This chance becomes more when maximum subscribers are prepaid customers. In such case it becomes an important to flag the churning behavior and approach the customers to hold him/her in service.Churning prediction ML model assists the telecom industry to change the strategy towards the churning customers. A huge subscribers drop was noticed by a telecom industry, which led a consolidated loss of ₹ 7,218.2 crore to that telecom giant. Study found that retention is 50% less costly than getting a new customer, which makes flagging of churning customer more crucial.

Impact :

Telecom industry spends about 15% of their revenue to boost infrastructure and IT. But about 20% on retention and acquisition of customers. A study has found that retention is 50% less costly than getting a new customer. This makes the retention an important factor to prevent any unwanted losses.

Problem Statement :

To have a model to get early warning on customers who may churn.

To develop strategy to hold the customer by analyzing their services used.

Approach :

Data understanding and exploration
Data cleaning
Exploratory Data analysis
Data preparation
Model building and evaluation
Summary
Conclusion

Solution :

Among all the models Decision tree is able to give 94% of accuracy while Sensitivity and Specificity of Logistic Regression is also about 80% so these models can be a good choice.There is nothing wrong to get in touch with customers who are marked Falsely Churned it will ensure their loyalty and hold them on network.

Code :

#Import the required libraries import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import RobustScaler from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, f1_score from sklearn.metrics import precision_score, auc, roc_auc_score, roc_curve, precision_recall_curve from sklearn.decomposition import PCA, IncrementalPCA from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from sklearn import preprocessing from sklearn.preprocessing import MinMaxScaler import statsmodels.api as sm from sklearn import metrics from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.metrics import precision_recall_curve from sklearn.feature_selection import RFE from sklearn.metrics import r2_score, mean_squared_error from sklearn.tree import plot_tree from sklearn.ensemble import RandomForestRegressor from IPython.display import Image from six import StringIO from sklearn.tree import export_graphviz import pydotplus, graphviz from imblearn.over_sampling import SMOTE from imblearn.combine import SMOTEENN, SMOTETomek get_ipython().run_line_magic('matplotlib', 'inline') # %% [markdown] # # 1. Data understanding and exploration # %% # Ignore warnings import warnings warnings.filterwarnings('ignore') # %% #Load the data file tcd= pd.read_csv('telecom_churn_data.csv') tcd.head() # %% [markdown] # #### Check the various attributes of data like shape (rows and cols), Columns, datatypes # %% tcd.shape # %% tcd.columns.values # %% tcd.info(verbose=True) # %% # Check the descriptive statistics of numeric variables tcd.describe() # %% [markdown] # # 2. Data cleaning # - Null values # - Drop unnecessary rows and column # - Imputing unnecessary rows and column # %% #checking percentage of null values in each column round(100*(tcd.isna().sum()/len(tcd.index)), 2).sort_values(ascending=False).head(40) [tcd.isna().sum() > 0] # %% [markdown] # * It is advisable to drop the columns having null values more than 70% but here in this case we will not prefer to drop the columns at this stage to prevent any loss of data which will assit us to decide the high value customers. # * First aim is to check that either the null value is in the recharge amount and recharge date for the same index. if so then we can just impute 0 in that. # %% # Checking and imputing for 6th month Null_rech_6_index = tcd['total_rech_data_6'].isnull() date_Null_rech_6_index = tcd['date_of_last_rech_data_6'].isnull() if Null_rech_6_index.equals(date_Null_rech_6_index): tcd['total_rech_data_6'].fillna(0, inplace=True) tcd['av_rech_amt_data_6'].fillna(0, inplace=True) #Checking and imputing for 7th month Null_rech_7_index = tcd['total_rech_data_7'].isnull() date_Null_rech_7_index = tcd['date_of_last_rech_data_7'].isnull() if Null_rech_7_index.equals(date_Null_rech_7_index): tcd['total_rech_data_7'].fillna(0, inplace=True) tcd['av_rech_amt_data_7'].fillna(0, inplace=True) # Checking and imputing for 8th month Null_rech_8_index = tcd['total_rech_data_8'].isnull() date_Null_rech_8_index = tcd['date_of_last_rech_data_8'].isnull() if Null_rech_8_index.equals(date_Null_rech_8_index): tcd['total_rech_data_8'].fillna(0, inplace=True) tcd['av_rech_amt_data_8'].fillna(0, inplace=True) # %% [markdown] # * Now we can drop the columns NA more than 70% # %% # Recheck the the columns having null values more than 70% (((tcd.isnull().sum()/ len(tcd)) * 100) >= 70).sum() # %% # We can bring out a new variable for analysis. # Total Amount for data recharge can be calculated by multiplying average amount spent on data regarge and data recharged tcd['total_data_amt_6'] = tcd['total_rech_data_6'] * tcd['av_rech_amt_data_6'] tcd['total_data_amt_7'] = tcd['total_rech_data_7'] * tcd['av_rech_amt_data_7'] tcd['total_data_amt_8'] = tcd['total_rech_data_8'] * tcd['av_rech_amt_data_8'] tcd['total_data_amt_9'] = tcd['total_rech_data_9'] * tcd['av_rech_amt_data_9'] # %% # on the basis of good phase that is 6th an 7th month define the high valued customer. Combine_amount_6_7 = tcd[['total_data_amt_6','total_data_amt_7','total_rech_amt_6', 'total_rech_amt_7']].mean(axis = 1) HV_70th_percentile = np.percentile(Combine_amount_6_7, 70) print("70th percentile is - ", HV_70th_percentile) # %% [markdown] # ### high valued customers # %% # As per business goals - high valued customers should be target. # finding high valued customers tcd = tcd[Combine_amount_6_7 >= HV_70th_percentile] # %% # check the data left with us for analysis tcd.shape # %% tcd.head() # %% # resetting the indesh tcd = tcd.reset_index(drop=True) # %% tcd.head() # %% # Volume base cost column are in different format. # it is better to have those columns in same format of month in terms of 6,7,8 and 9 tcd.rename(columns = {'jun_vbc_3g':'vbc_3g_6', 'jul_vbc_3g':'vbc_3g_7', 'aug_vbc_3g':'vbc_3g_8', 'sep_vbc_3g':'vbc_3g_9'}, inplace=True) # %% [markdown] # ### Now we will mark the churn customers # ##### Churn : 1 # ##### Not Churn : 0 # # * This will be on the basis of the 9th month data , if incoming, outgoing and data usage all are zero then those custumers wiol be sonsidered as churned. # # %% # Marking Churnn tcd['churn'] = tcd.apply(lambda x: 1 if((x.total_ic_mou_9 == 0) and (x.total_og_mou_9 == 0) and (x.vol_2g_mb_9 == 0) and (x.vol_3g_mb_9 == 0)) else 0, axis=1) # %% tcd['churn'].head() # %% # Creating dataframe for the feature used to decide churn churn_df = tcd[['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']] churn_df.head() # %% [markdown] # ### If we compare first five data we can say that our mappung is correct # %% # As mapping has been done now its time to remove the 9th month data. tcd.drop([col for col in tcd.columns if '_9' in col], axis=1, inplace = True) # %% # lets have look to shape again tcd.shape # %% # Checking prcentage of missing values now round(100*(tcd.isna().sum()/len(tcd.index)), 2).sort_values(ascending=False).head(40) [tcd.isna().sum() > 0] # %% # Columns with at least 40% missing data cols_40_percent_missing_data = ((tcd.isnull().sum()/ len(tcd)) * 100) >= 40 cols_40_percent_missing_data = cols_40_percent_missing_data[cols_40_percent_missing_data > 0.40].index cols_40_percent_missing_data # %% # As date can be imputed, and about 50% data is not available so better to drop tcd.drop(['date_of_last_rech_data_6', 'date_of_last_rech_data_7', 'date_of_last_rech_data_8'], axis=1, inplace = True) # %% # Left columns for max_rech_data_6, max_rech_data_7' & max_rech_data_8 lets see the statitstical parameters for i in ['max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8']: print(i) print("-------------------------------") print(tcd[i].describe()) print('-------------------------------------') print("NULL values : ", tcd[i].isnull().sum()) print('-------------------------------------') print('-------------------------------------') # %% # as all have minimum value upto 1 so we can say that null values are those who have not recharged which can be taken as 0 amount # Imput the missing data by 0 for i in ['max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8']: tcd[i].fillna(0, inplace=True) # %% # Checking the other columns in list for i in ['_6', '_7','_8']: print ('For the month : ', i) for j in ['count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8', 'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8']: if i in j : print(tcd[[j]].isna().sum()) print('--------------------------') # %% [markdown] # ### From above output we can see that for a month all the null values are same so there may be possibility that it might be missing for a perticular index for that month # ### We can also see that 8th month have higher missing value. # ### Also we can infere that these columns missing value means they might have stopped the service means that the value can be imputed with zero # %% columns_to_impute = ['count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8', 'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'fb_user_6', 'fb_user_7', 'fb_user_8'] tcd[columns_to_impute] = tcd[columns_to_impute].fillna(0) # %% [markdown] # ### Variance and Uniqensss # %% # Now lets have a look to the columns having all same value or we can say 0 variance. As these column will be of no use for the analysis. columns_0_variance = tcd.var() == 0 column_name_0_variance = columns_0_variance[columns_0_variance == 1].index print(column_name_0_variance) print("Total columns with no variance : ", columns_0_variance.sum()) print("------------------------------------------------------------------------------") columns_1_unique = tcd.nunique() == 1 column_name_1_unique = columns_1_unique[columns_1_unique == 1].index print(column_name_1_unique) print("Total columns with only ONE unique value : ", columns_1_unique.sum()) # %% # droping the non Date columns tcd.drop(column_name_0_variance, axis=1, inplace = True) # %% Any_NA_columns = tcd.columns[tcd.isna().any()].tolist() Any_NA_columns # %% # Checking for above columns for i in ['_6', '_7','_8']: print('-------------------------') print ('For the month : ', i) col_list = [] for j in Any_NA_columns: if i in j: col_list.append(j) print(tcd[col_list].info()) # %% [markdown] # #### Here also we can say that by the 8th moth the number of missing value has increased . That shows that service has been stopped so we can impute 0 # %% # impute the missing values from the above columns with 0. # We do not want to impute date objects with 0. So will exclude it for now. for column in Any_NA_columns: if "date_of_last_rech" not in column: tcd[column].fillna(0, inplace=True) # %% date_columns = ['date_of_last_rech_6', 'date_of_last_rech_7', 'date_of_last_rech_8', 'last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8'] tcd[date_columns].info() # %% # No use of date column during the analysis. tcd.drop(date_columns, axis=1, inplace = True) # %% # Check missing values again round(100*(tcd.isna().sum()/len(tcd.index)), 2).sort_values(ascending=False).head(40) [tcd.isna().sum() > 0] # %% [markdown] # ## No missing values left # %% pd.set_option('max_columns', None) tcd.head() # %% for i in ['arpu_6', 'arpu_7', 'arpu_8']: print(tcd[i].describe()) # %% tcd = tcd[(tcd.arpu_6 > 0) & (tcd.arpu_7 > 0) & (tcd.arpu_8 > 0)] tcd.shape # %% # Let's drop individual columns whose totals are available as a different attribute individual_cols = ['loc_ic_t2t_mou_6', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8', 'loc_ic_t2m_mou_6', 'loc_ic_t2m_mou_7', 'loc_ic_t2m_mou_8', 'loc_ic_t2f_mou_6', 'loc_ic_t2f_mou_7', 'loc_ic_t2f_mou_8', 'std_ic_t2t_mou_6', 'std_ic_t2t_mou_7', 'std_ic_t2t_mou_8', 'std_ic_t2m_mou_6', 'std_ic_t2m_mou_7', 'std_ic_t2m_mou_8', 'std_ic_t2f_mou_6', 'std_ic_t2f_mou_7', 'std_ic_t2f_mou_8', 'loc_og_t2t_mou_6', 'loc_og_t2t_mou_7', 'loc_og_t2t_mou_8', 'loc_og_t2m_mou_6', 'loc_og_t2m_mou_7', 'loc_og_t2m_mou_8', 'loc_og_t2f_mou_6', 'loc_og_t2f_mou_7', 'loc_og_t2f_mou_8', 'loc_og_t2c_mou_6', 'loc_og_t2c_mou_7', 'loc_og_t2c_mou_8', 'std_og_t2t_mou_6', 'std_og_t2t_mou_7', 'std_og_t2t_mou_8', 'std_og_t2m_mou_6', 'std_og_t2m_mou_7', 'std_og_t2m_mou_8', 'std_og_t2f_mou_6', 'std_og_t2f_mou_7', 'std_og_t2f_mou_8', 'last_day_rch_amt_6', 'last_day_rch_amt_7', 'last_day_rch_amt_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8', 'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8'] tcd.drop(individual_cols, axis = 1, inplace = True) # %% tcd.shape # %% [markdown] # # 3. Exploratory Data Analysis # %% # Statistical summary tcd.describe() # %% [markdown] # ### Checking the uniquness in the customer list # %% len(tcd.mobile_number.unique()) == len(tcd) # %% # No use mobile_number column so we can drop it now tcd.drop(['mobile_number'], axis=1, inplace = True) # %% tcd.shape # %% [markdown] # ## It is good to check the balance in data # %% sns.countplot(tcd.churn) plt.title('Churned Custmers') plt.show() print ('Churned customer : ',len(tcd[(tcd.churn ==1)])) print ('Not Churned customer : ',len(tcd[(tcd.churn == 0)])) print('Percentage Diffrence : ', round(abs((100*len(tcd[(tcd.churn ==1)])/len(tcd)) - 100*len(tcd[(tcd.churn ==0)])/len(tcd)),2),'%') print (f'Proportion of Churned to Not Churned = {100* len(tcd[(tcd.churn ==1)])/len(tcd[(tcd.churn ==0)]):.2f}%' ) # %% [markdown] # #### We can observe that dataset is imbalanced. We need to handle this. We will do it in later stage. # # %% [markdown] # ## Univariate analysis of categoprical data # # %% # Categorical fetures list cat_features = ['night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'fb_user_6', 'fb_user_7', 'fb_user_8'] subplot_loc = 1 plt.figure(figsize = (18,15)) for i in cat_features: Col_value = round(((tcd[i].value_counts(dropna = False))/(len(tcd[i])) * 100), 2) plt.subplot(2, 3, subplot_loc) ax = sns.barplot(x = Col_value.index, y = Col_value.values, order = Col_value.sort_index().index) plt.xlabel(i, labelpad = 15) plt.ylabel('Percentage Rate', labelpad = 10) plt.title('Percentage of '+i) subplot_loc += 1 # %% [markdown] # ### From univariate analysis we can imterprete that byt the month of 8th night_pack users as well as fb_users has rudeced. This might be indication of disloyalty to the service. # %% [markdown] # # Bivariate Analysis # %% # Bivariate Analysis # Defining and ploting def bivariate(y_axis): plt.figure(figsize = (18, 5)) xlabel = "Churn Status x_axis = "churn" plot_title_1 = "Month 6 - Churn Vs " + y_axis plot_title_2 = "Month 7 - Churn Vs " + y_axis plot_title_3 = "Month 8 - Churn Vs " + y_axis plt.subplot(1, 3, 1) sns.boxplot(x = x_axis, y = y_axis + "_6", hue = "churn", data = tcd, showfliers = False) plt.title(plot_title_1) plt.xlabel(xlabel, labelpad = 15) plt.ylabel(( y_axis + "_6"), labelpad = 10) plt.subplot(1, 3, 2) sns.boxplot(x = x_axis, y = y_axis + "_7", hue = "churn", data = tcd, showfliers = False) plt.title(plot_title_2) plt.xlabel(xlabel, labelpad = 15) plt.ylabel(( y_axis + "_7"), labelpad = 10) plt.subplot(1, 3, 3) sns.boxplot(x = x_axis, y = y_axis + "_8", hue = "churn", data = tcd, showfliers = False) plt.title(plot_title_3) plt.xlabel(xlabel, labelpad = 15) plt.ylabel(( y_axis + "_8"), labelpad = 10) plt.subplots_adjust(wspace = 0.4) plt.show() for i in ["arpu" ,"onnet_mou","offnet_mou","total_og_mou","total_ic_mou","total_rech_num","total_rech_amt", "total_rech_data","vol_2g_mb","vol_3g_mb","vbc_3g","total_data_amt"]: bivariate(i) # %% [markdown] # #### We can clearly see the huge drop in uses in 8th Month. # %% tcd.shape # %% # Plotting Correlation Heatmap plt.figure(figsize=[20,20]) sns.heatmap(tcd.corr(), mask= np.triu(tcd.corr())) plt.show() #Highly Correlated features list High_corr = tcd.corr().loc[np.where(tcd.corr()>0.8, 1, 0)==1].columns High_corr_list = [] for i in High_corr: High_corr_list.append(i) print('Highly correlated Features , >80%') print(High_corr_list) # %% # Creating Good Phase and Action Phase data def good_action_phase(df, var): col_6 = var + "_6" col_7 = var + "_7" col_8 = var + "_8" good_phase_col = var + "_good_phase" action_phase_col = var + "_action_phase" df[good_phase_col] = (df[col_6] + df[col_7])/2 df[action_phase_col] = df[col_8] - df[good_phase_col] df.drop([col_6, col_7, col_8], axis = 1, inplace = True) return df # %% # Derive Good and Action Phase Variables for i in ["arpu","onnet_mou","offnet_mou",'roam_ic_mou', "roam_og_mou", "loc_og_mou", "std_og_mou", "isd_og_mou", "spl_og_mou", "og_others", "total_og_mou", "loc_ic_mou", "std_ic_mou", "spl_ic_mou", "isd_ic_mou", "ic_others", "total_ic_mou", "total_rech_num", "total_rech_amt", "max_rech_amt", "total_rech_data", "max_rech_data", "count_rech_2g", "count_rech_3g", "vol_2g_mb", "vol_3g_mb", "monthly_2g", "sachet_2g", "monthly_3g", "sachet_3g", "vbc_3g", "total_data_amt"]: tcd= good_action_phase(tcd, i) tcd.head() # %% [markdown] # ## Checking outliers # %% Good_nume_var = [i for i in tcd.columns if 'good' in i] Act_NUM_var = [i for i in tcd.columns if 'action' in i ] # %% # Identifying Outliers for i , j in zip (Good_nume_var, Act_NUM_var) : plt.figure(figsize=[18,5]) plt.subplot(1,2,1) sns.boxplot(tcd[i]) plt.title(i) plt.subplot(1,2,2) sns.boxplot(tcd[j]) plt.title(j) plt.show() # %% [markdown] # ## Huge outliers are clearly visible # # #### We will handle it later lets first split the data in train and test # %% [markdown] # # 4. Data Preperation # ## splitting the data # %% X_df = tcd.drop('churn', axis=1) y_df = tcd['churn'] X_train, X_test, y_train, y_test = train_test_split(X_df,y_df, train_size=0.7, random_state=100) # %% [markdown] # ## Handeling Outliers and Rescalling # %% # Applying Robustscaler # Normalizing the data rob = RobustScaler() X_train = pd.DataFrame(data = rob.fit_transform(X_train), columns=X_train.columns, index=X_train.index) X_test = pd.DataFrame(data= rob.fit_transform(X_test), columns=X_test.columns , index=X_test.index ) # %% # Re-Scalling scalar = MinMaxScaler() X_train[:] = scalar.fit_transform(X_train[:]) X_test[:] = scalar.transform(X_test[:]) # %% print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape) # %% X_train.head() # %% # Balancing the data sampling = SMOTE(random_state = 100) X_train_smo, y_train_smo = sampling.fit_resample(X_train, y_train) # %% # Size of data print(X_train_smo.shape) print(y_train_smo.shape) # %% # Churn ratio after resampeling sns.distplot(y_train_smo) # %% [markdown] # ### Data is now balanced # %% plt.figure(figsize=[20,10]) sns.heatmap(X_train_smo.corr()) # %% # Removing Highly correlated factors # Create correlation matrix corr_matrix = X_train_smo.corr() upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) # Find features with correlation greater than 0.7 to_drop = [column for column in upper.columns if any(upper[column] > 0.7)] X_train_smo.drop(to_drop, axis=1, inplace=True) # %% [markdown] # # 5. Model Building and Evaluation # %% [markdown] # ## Logistic Regression # %% logm1 = sm.GLM(y_train_smo,(sm.add_constant(X_train_smo)), family = sm.families.Binomial()) logm1.fit().summary() # %% [markdown] # ### Applying RFE # %% logreg = LogisticRegression() # running RFE with 20 variables as output rfe = RFE(logreg, 10) rfe = rfe.fit(X_train_smo, y_train_smo) # %% rfe_columns=X_train_smo.columns[rfe.support_] # %% X_train_SM = sm.add_constant(X_train_smo[rfe_columns]) logm = sm.GLM(y_train_smo,X_train_SM, family = sm.families.Binomial()) res = logm.fit() res.summary() # %% # Geerating Prediction table y_train_pred = res.predict(X_train_SM) y_train_pred = y_train_pred.values.reshape(-1) # %% y_train_firnal_pred = pd.DataFrame({'Churn':y_train_smo.values, 'Churn_Prob':y_train_pred}) y_train_firnal_pred.head() # %% # Deriving prediction with different cutt off points numbers = [float(x)/10 for x in range(10)] for i in numbers: y_train_firnal_pred[i]= y_train_firnal_pred.Churn_Prob.map(lambda x: 1 if x > i else 0) y_train_firnal_pred.head() # %% # accuracy sensitivity and specificity for various cuttoff cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci']) num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] for i in num: cm1 = metrics.confusion_matrix(y_train_firnal_pred.Churn, y_train_firnal_pred[i] ) total1=sum(sum(cm1)) accuracy = (cm1[0,0]+cm1[1,1])/total1 speci = cm1[0,0]/(cm1[0,0]+cm1[0,1]) sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1]) cutoff_df.loc[i] =[ i ,accuracy,sensi,speci] print(cutoff_df) # %% # Plot accuracy sensitivity and specificity for various cuttof cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'], figsize=[18,7]) plt.xticks(np.arange(0,1,0.01), rotation=90) plt.yticks(np.arange(0,1,0.05)) plt.grid() plt.show() # %% [markdown] # ## We can see, 0.54 is the optimum point to take it as a cutoff probability. # # %% # predicting y_train_firnal_pred['Churn_pred'] = y_train_firnal_pred.Churn_Prob.map(lambda x: 1 if x > 0.54 else 0) y_train_firnal_pred.head() # %% [markdown] # ### Evaluating the model # ### Evaluating on train data # # %% confusion = metrics.confusion_matrix(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_pred ) TP = confusion[1,1] # true positive TN = confusion[0,0] # true negatives FP = confusion[0,1] # false positives FN = confusion[1,0] # false negatives print(confusion) print('sensitivity :', (TP / float(TP+FN))) print('Specificity : ', TN/(TN+FP)) print('Predictive Positive Value :', (TP / float(TP+FP))) print('Negative Predictive Value : ', (TN / float(TN+ FN))) # %% def draw_roc( actual, probs ): fpr, tpr, thresholds = metrics.roc_curve( actual, probs, drop_intermediate = False ) auc_score = metrics.roc_auc_score( actual, probs ) plt.figure(figsize=(5, 5)) plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score ) plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate or [1 - True Negative Prediction Rate]') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show() # %% # Plotting the curve for the obtained metrics fpr, tpr, thresholds = metrics.roc_curve(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_Prob, drop_intermediate = False ) draw_roc(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_Prob ) # %% # plotting the precoion and recall curve p, r, thresholds = precision_recall_curve(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_Prob) plt.plot(thresholds, p[:-1], "g-") plt.plot(thresholds, r[:-1], "r-") plt.show() # %% [markdown] # ### Evaluating on test data # # %% # As we have already rescalled the X_test, we wil drag out the features used in model X_test_rfe=X_test[rfe_columns] X_test_rfe.head() # %% X_test_SM = sm.add_constant(X_test_rfe) # %% y_test_pred = res.predict(X_test_SM) # %% y_test_firnal_pred = pd.DataFrame({'Churn':y_test.values, 'Churn_Prob':y_test_pred}) y_test_firnal_pred.head() # %% y_test_firnal_pred['Churn_pred'] = y_test_firnal_pred.Churn_Prob.map(lambda x: 1 if x > 0.54 else 0) y_test_firnal_pred.head() # %% confusion = metrics.confusion_matrix(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_pred ) TP = confusion[1,1] # true positive TN = confusion[0,0] # true negatives FP = confusion[0,1] # false positives FN = confusion[1,0] # false negatives print(confusion) print('sensitivity :', (TP / float(TP+FN))) print('Specificity : ', TN/(TN+FP)) print('Predictive Positive Value :', (TP / float(TP+FP))) print('Negative Predictive Value : ', (TN / float(TN+ FN))) # %% # Plotting the curve for the test fpr, tpr, thresholds = metrics.roc_curve(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_Prob, drop_intermediate = False ) draw_roc(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_Prob ) # %% # plotting the precoion and recall curve p, r, thresholds = precision_recall_curve(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_Prob) plt.plot(thresholds, p[:-1], "g-") plt.plot(thresholds, r[:-1], "r-") plt.show() # %% [markdown] # # Logistic Regression By using PCA # %% X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, train_size=0.7, random_state=1) print("X_train shape:", X_train.shape) print("X_test shape:", X_test.shape) # scaling dataset scaler = MinMaxScaler() X_train[:] = scaler.fit_transform(X_train[:]) X_test[:] = scaler.transform(X_test[:]) # Applying SMOTE - imbalance correction sm = SMOTE(random_state=42) X_train_sm,y_train_sm = sm.fit_resample(X_train,y_train) print("X_train_sm Shape:", X_train_sm.shape) print("y_train_sm Shape:", y_train_sm.shape) # %% # Applying PCA pca = PCA(random_state=85) pca.fit(X_train_sm) # %% # Fitting and transforming X_train_sm_pca=pca.fit_transform(X_train_sm) X_test_pca=pca.transform(X_test) # %% pca.components_ # %% # Applying Log regression lr_pca = LogisticRegression() lr_pca.fit(X_train_sm_pca, y_train_sm) # making the predictions y_pred_pca = lr_pca.predict(X_test_pca) # converting the prediction into a dataframe y_pred_df = pd.DataFrame(y_pred_pca) # %% [markdown] # ### Evaluating the model # %% confusion_matrix(y_test,y_pred_pca) # %% accuracy_score(y_test,y_pred_pca) # %% plt.bar(range(1,len(pca.explained_variance_ratio_)+1),pca.explained_variance_ratio_) plt.show() # %% # Getting the components # plotting explained variance ratio cum_var = np.cumsum(pca.explained_variance_ratio_) fig = plt.figure(figsize=[12,7]) plt.plot(cum_var) plt.xlabel('Principal components count') plt.ylabel('Cumulative Explained Variance') plt.xticks(np.arange(0,75,2)) plt.yticks(np.arange(0.5,1,0.1)) plt.grid() plt.show() # %% [markdown] # ### > 90% of the data can be explained with 5 PCA components* # %% # Apply 5 PCA components pca_final = IncrementalPCA(n_components=5) train_pca_final = pca_final.fit_transform(X_train_sm) test_pca_final = pca_final.transform(X_test) # %% # Applying logistic Regression logreg_pca_final = LogisticRegression() logreg_pca_final.fit(train_pca_final, y_train_sm) # predictions y_pred_final = logreg_pca_final.predict(test_pca_final) # prediction into a dataframe y_pred_final = pd.DataFrame(y_pred_final) # %% [markdown] # ### Evaluating the model # %% print( confusion_matrix(y_test,y_pred_final) ) print('Accuracy :', accuracy_score(y_test,y_pred_final) ) # %% [markdown] # # Decision Tree # %% # splitting the data X = tcd.drop('churn', axis=1) y = tcd['churn'] X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=100) # %% dt = DecisionTreeClassifier(max_depth=3) dt.fit(X_train, y_train) # %% plt.figure(figsize=[18,10]) plot_tree(dt, filled=True, rounded=True, feature_names=X.columns, class_names=['Not Churned', 'Churned']) plt.show() # %% [markdown] # ### Evaluating the model # %% y_train_pred = dt.predict(X_train) y_test_pred = dt.predict(X_test) # %% print('Train Accuracy :', accuracy_score(y_train, y_train_pred)) print('Test Accuracy : ', accuracy_score(y_test, y_test_pred)) # %% print('Train Confusion Matrix') print(confusion_matrix(y_train, y_train_pred)) # %% print('Test Confusion Matrix') print(confusion_matrix(y_test, y_test_pred)) # %% [markdown] # # Random Forest # # %% rf = RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=5, min_samples_leaf=10) # %% rf.fit(X_train, y_train) # %% [markdown] # ### Evaluating the model # %% y_train_pred = rf.predict(X_train) y_test_pred = rf.predict(X_test) # %% r2_score(y_train, y_train_pred) # %% r2_score(y_test, y_test_pred) # %% rf.feature_importances_ # %% feature_df = pd.DataFrame({ "Var_name": X_train.columns, "Imp": rf.feature_importances_}) # %% feature_df.sort_values(by="Imp", ascending=False).head(10) # %% [markdown] # # 6. Summary # %% [markdown] # ## MOdels Applied # * Logistic Regression (with RFE) # * Logistic Regression (With PCA) # * Decision Tree Model # * Random Forest Model # %% [markdown] # ## Logistic Regression Evaluation Sheet # * On Train Data # # >* sensitivity : 0.797 # >* Specificity : 0.786 # >* ROC Curve are : 0.87 # # * On Test Data # # >* sensitivity : 0.740 # >* Specificity : 0.794 # >* ROC Curve area : 0.84 # %% [markdown] # ## Logistic Regression with PCA Evaluation Sheet # * PCA Accuracy Score : 0.799 # * 5 PCA components explained 90% # * IncrementalPCA Accuracy Score with 5 PCA : 0.634 # %% [markdown] # ## Decision Tree # * Train Accuracy : 0.943 # * Test Accuracy : 0.941 # # ## Random Forest # * R2 Train : 0.26 # * R2 Test : 0.2 # # %% [markdown] # # 7. Conclusion and Suggestion # %% [markdown] # * Among all the models Decision tree is able to give 94% of accuracy while Sensitivity and Specifiicty of Logistic Regression also is about 80% so these models can be a good choice. # * There is nothing wrong to get in touch with cutomers who are marked Falsely Churned it will ensure their loyality and hold them on network. # %% [markdown] # ### From Analysis we found that following features can be good indication to flag Churning customers # * aon # * loc_ic_mou_action_phase # * arpu_action_phase # * std_ic_mou_action_phase # * max_rech_amt_action_phase # * loc_ic_mou_good_phase # * total_ic_mou_good_phase # # > Here good phase explains 6th and 7th month while action phase is 8th Month.

ML Model