Telecom Churn Prediction - Machine Learning Algorithm

Industry

Telecom

Type

ML Model

Language

Python

Introduction

Telecom industry is a dynamic industry where every action of customers has some signal for service provider to act on. Either recharging amount has reduced or No call was noticed or Internet uses has stopped. Reason for any odd behavior can be either service by provider or Cost of recharge.

Loss of subscriber has a direct relation with loss to the company. This chance becomes more when maximum subscribers are prepaid customers. In such case it becomes an important to flag the churning behavior and approach the customers to hold him/her in service.

Churning prediction ML model assists the telecom industry to change the strategy towards the churning customers. A huge subscribers drop was noticed by a telecom industry, which led a consolidated loss of ₹ 7,218.2 crore to that telecom giant. Study found that retention is 50% less costly than getting a new customer, which makes flagging of churning customer more crucial. 

Impact

Telecom industry spends about 15% of their revenue to boost infrastructure and IT. But about 20% on retention and acquisition of customers. A study has found that retention is 50% less costly than getting a new customer. This makes the retention an important factor to prevent any unwanted losses.

Problem Statement

​To have a model to get early warning on customers who may churn.  

To develop strategy to hold the customer by analyzing their services used.

Approach

​1. Data understanding and exploration

2. Data cleaning

3. Exploratory Data analysis

4. Data preparation

5. Model building and evaluation

6. Summary

7. Conclusion


Check the last section

Solution

​Among all the models Decision tree is able to give 94% of accuracy while Sensitivity and Specificity of Logistic Regression is also about 80% so these models can be a good choice.

There is nothing wrong to get in touch with customers who are marked Falsely Churned it will ensure their loyalty and hold them on network.

Related Data

#Import the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix, f1_score
from sklearn.metrics import precision_score, auc, roc_auc_score, roc_curve, precision_recall_curve
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
from sklearn import metrics
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import precision_recall_curve
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestRegressor
from IPython.display import Image  
from six import StringIO
from sklearn.tree import export_graphviz
import pydotplus, graphviz
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
get_ipython().run_line_magic('matplotlib', 'inline')

# %% [markdown]
# # 1. Data understanding and exploration

# %%
# Ignore  warnings
import warnings
warnings.filterwarnings('ignore')


# %%
#Load the data file
tcd= pd.read_csv('telecom_churn_data.csv')
tcd.head()

# %% [markdown]
# #### Check the various attributes of data like shape (rows and cols), Columns, datatypes

# %%
tcd.shape


# %%
tcd.columns.values


# %%
tcd.info(verbose=True)


# %%
# Check the descriptive statistics of numeric variables
tcd.describe()

# %% [markdown]
# # 2. Data cleaning
#  - Null values
#  - Drop unnecessary rows and column
#  - Imputing unnecessary rows and column

# %%
#checking percentage of null values in each column

round(100*(tcd.isna().sum()/len(tcd.index)), 2).sort_values(ascending=False).head(40) [tcd.isna().sum() > 0]

# %% [markdown]
# * It is advisable to drop the columns having null values more than 70% but here in this case we will not prefer to drop the columns at this stage to prevent any loss of data which will assit us to decide the high value customers.
# * First aim is to check that either the null value is in the recharge amount and recharge date for the same index. if so then we can just impute 0 in that.

# %%
# Checking and imputing for 6th month
Null_rech_6_index = tcd['total_rech_data_6'].isnull()
date_Null_rech_6_index = tcd['date_of_last_rech_data_6'].isnull()
if Null_rech_6_index.equals(date_Null_rech_6_index):
   tcd['total_rech_data_6'].fillna(0, inplace=True)
   tcd['av_rech_amt_data_6'].fillna(0, inplace=True)
#Checking and imputing for 7th month
Null_rech_7_index = tcd['total_rech_data_7'].isnull()
date_Null_rech_7_index = tcd['date_of_last_rech_data_7'].isnull()
if Null_rech_7_index.equals(date_Null_rech_7_index):
   tcd['total_rech_data_7'].fillna(0, inplace=True)
   tcd['av_rech_amt_data_7'].fillna(0, inplace=True)
# Checking and imputing for 8th month
Null_rech_8_index = tcd['total_rech_data_8'].isnull()
date_Null_rech_8_index = tcd['date_of_last_rech_data_8'].isnull()
if Null_rech_8_index.equals(date_Null_rech_8_index):
   tcd['total_rech_data_8'].fillna(0, inplace=True)
   tcd['av_rech_amt_data_8'].fillna(0, inplace=True)

# %% [markdown]
# * Now we can drop the columns NA more than 70%

# %%
# Recheck the the columns having null values more than 70%

(((tcd.isnull().sum()/ len(tcd)) * 100) >= 70).sum()


# %%
# We can bring out a new variable for analysis.
# Total Amount for data recharge can be calculated by multiplying average amount spent on data regarge and data recharged
tcd['total_data_amt_6'] = tcd['total_rech_data_6'] * tcd['av_rech_amt_data_6']
tcd['total_data_amt_7'] = tcd['total_rech_data_7'] * tcd['av_rech_amt_data_7']
tcd['total_data_amt_8'] = tcd['total_rech_data_8'] * tcd['av_rech_amt_data_8']
tcd['total_data_amt_9'] = tcd['total_rech_data_9'] * tcd['av_rech_amt_data_9']


# %%
# on the basis of good phase that is 6th an 7th month define the high valued customer.
Combine_amount_6_7 = tcd[['total_data_amt_6','total_data_amt_7','total_rech_amt_6',
                                            'total_rech_amt_7']].mean(axis = 1)

HV_70th_percentile = np.percentile(Combine_amount_6_7, 70)

print("70th percentile is - ", HV_70th_percentile)

# %% [markdown]
# ### high valued customers

# %%
# As per business goals - high valued customers should be target.
# finding high valued customers
tcd = tcd[Combine_amount_6_7 >= HV_70th_percentile]


# %%
# check the data left with us for analysis
tcd.shape


# %%
tcd.head()


# %%
# resetting the indesh
tcd = tcd.reset_index(drop=True)


# %%
tcd.head()


# %%
# Volume base cost column are in different format.
# it is better to have those columns in same format of month in terms of 6,7,8 and 9
tcd.rename(columns = {'jun_vbc_3g':'vbc_3g_6', 'jul_vbc_3g':'vbc_3g_7', 'aug_vbc_3g':'vbc_3g_8', 'sep_vbc_3g':'vbc_3g_9'}, inplace=True)

# %% [markdown]
# ### Now we will mark the churn customers
# ##### Churn : 1
# ##### Not Churn : 0
#
# * This will be on the basis of the 9th month data , if incoming, outgoing and data usage all are zero then those custumers wiol be sonsidered as churned.
#

# %%
# Marking Churnn

tcd['churn'] = tcd.apply(lambda x: 1 if((x.total_ic_mou_9 == 0) and (x.total_og_mou_9 == 0) and
                       (x.vol_2g_mb_9 == 0) and (x.vol_3g_mb_9 == 0)) else 0, axis=1)


# %%
tcd['churn'].head()


# %%
# Creating dataframe for the feature used to decide churn
churn_df = tcd[['total_ic_mou_9', 'total_og_mou_9', 'vol_2g_mb_9', 'vol_3g_mb_9']]
churn_df.head()

# %% [markdown]
# ### If we compare first five data we can say that our mappung is correct

# %%
# As mapping has been done now its time to remove the 9th month data.
tcd.drop([col for col in tcd.columns if '_9' in col], axis=1, inplace = True)


# %%
# lets have look to shape again
tcd.shape


# %%
# Checking prcentage of missing values now
round(100*(tcd.isna().sum()/len(tcd.index)), 2).sort_values(ascending=False).head(40) [tcd.isna().sum() > 0]


# %%
# Columns with at least 40% missing data
cols_40_percent_missing_data = ((tcd.isnull().sum()/ len(tcd)) * 100) >= 40
cols_40_percent_missing_data = cols_40_percent_missing_data[cols_40_percent_missing_data > 0.40].index
cols_40_percent_missing_data


# %%
# As date can be imputed, and about 50% data is not available so better to drop
tcd.drop(['date_of_last_rech_data_6', 'date_of_last_rech_data_7', 'date_of_last_rech_data_8'], axis=1, inplace = True)


# %%
# Left columns for max_rech_data_6, max_rech_data_7' & max_rech_data_8 lets see the statitstical parameters
for i in ['max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8']:
   print(i)
   print("-------------------------------")
   print(tcd[i].describe())
   print('-------------------------------------')
   print("NULL values : ", tcd[i].isnull().sum())
   print('-------------------------------------')
   print('-------------------------------------')


# %%
# as all have minimum value upto 1 so we can say that null values are those who have not recharged which can be taken as 0 amount
# Imput the missing data by 0
for i in ['max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8']:
   tcd[i].fillna(0, inplace=True)


# %%
# Checking the other columns in list
for i in ['_6', '_7','_8']:
   print ('For the month : ', i)
   for j in ['count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 'count_rech_3g_6', 'count_rech_3g_7',
       'count_rech_3g_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8',
       'night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8']:
       if i in j :
           print(tcd[[j]].isna().sum())
   print('--------------------------')

# %% [markdown]
# ### From above output we can see that for a month all the null values are same so there may be possibility that it might be missing for a perticular index for that month
# ### We can also see that 8th month have higher missing value.
# ### Also we can infere that these columns missing value means they might have stopped the service means that the value can be imputed with zero

# %%
columns_to_impute = ['count_rech_2g_6', 'count_rech_2g_7',
      'count_rech_2g_8', 'count_rech_3g_6', 'count_rech_3g_7',
      'count_rech_3g_8', 'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8', 'arpu_2g_6',
      'arpu_2g_7', 'arpu_2g_8', 'night_pck_user_6', 'night_pck_user_7',
      'night_pck_user_8', 'fb_user_6', 'fb_user_7', 'fb_user_8']

tcd[columns_to_impute] = tcd[columns_to_impute].fillna(0)

# %% [markdown]
# ### Variance and Uniqensss

# %%
# Now lets have a look to the columns having all same value or we can say 0 variance. As these column will be of no use for the analysis.
columns_0_variance = tcd.var() == 0
column_name_0_variance = columns_0_variance[columns_0_variance == 1].index
print(column_name_0_variance)
print("Total columns with no variance : ", columns_0_variance.sum())

print("------------------------------------------------------------------------------")
columns_1_unique = tcd.nunique() == 1
column_name_1_unique = columns_1_unique[columns_1_unique == 1].index
print(column_name_1_unique)
print("Total columns with only ONE unique value : ", columns_1_unique.sum())


# %%
# droping the non Date columns
tcd.drop(column_name_0_variance, axis=1, inplace = True)


# %%
Any_NA_columns = tcd.columns[tcd.isna().any()].tolist()
Any_NA_columns


# %%
# Checking for above columns
for i in ['_6', '_7','_8']:
   print('-------------------------')
   print ('For the month : ', i)
   col_list = []
   for j in Any_NA_columns:
       if i in j:
           col_list.append(j)
   print(tcd[col_list].info())

# %% [markdown]
# #### Here also we can say that by the 8th moth the number of missing value has increased . That shows that service has been stopped so we can impute 0

# %%
# impute the missing values from the above columns with 0.
# We do not want to impute date objects with 0. So will exclude it for now.

for column in Any_NA_columns:
   if "date_of_last_rech" not in column:
       tcd[column].fillna(0, inplace=True)


# %%
date_columns = ['date_of_last_rech_6', 'date_of_last_rech_7', 'date_of_last_rech_8',
               'last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8']

tcd[date_columns].info()


# %%
# No use of date column during the analysis.
tcd.drop(date_columns, axis=1, inplace = True)


# %%
# Check missing values again
round(100*(tcd.isna().sum()/len(tcd.index)), 2).sort_values(ascending=False).head(40) [tcd.isna().sum() > 0]

# %% [markdown]
# ## No missing values left

# %%
pd.set_option('max_columns', None)
tcd.head()


# %%
for i in ['arpu_6', 'arpu_7',   'arpu_8']:
   print(tcd[i].describe())


# %%
tcd = tcd[(tcd.arpu_6 > 0) & (tcd.arpu_7 > 0) & (tcd.arpu_8 > 0)]
tcd.shape


# %%
# Let's drop individual columns whose totals are available as a different attribute

individual_cols = ['loc_ic_t2t_mou_6', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8',
                  'loc_ic_t2m_mou_6', 'loc_ic_t2m_mou_7', 'loc_ic_t2m_mou_8',
                  'loc_ic_t2f_mou_6', 'loc_ic_t2f_mou_7', 'loc_ic_t2f_mou_8',
                  'std_ic_t2t_mou_6', 'std_ic_t2t_mou_7', 'std_ic_t2t_mou_8',
                  'std_ic_t2m_mou_6', 'std_ic_t2m_mou_7', 'std_ic_t2m_mou_8',
                  'std_ic_t2f_mou_6', 'std_ic_t2f_mou_7', 'std_ic_t2f_mou_8',
                  'loc_og_t2t_mou_6', 'loc_og_t2t_mou_7', 'loc_og_t2t_mou_8',
                  'loc_og_t2m_mou_6', 'loc_og_t2m_mou_7', 'loc_og_t2m_mou_8',
                  'loc_og_t2f_mou_6', 'loc_og_t2f_mou_7', 'loc_og_t2f_mou_8',
                  'loc_og_t2c_mou_6', 'loc_og_t2c_mou_7', 'loc_og_t2c_mou_8',
                  'std_og_t2t_mou_6', 'std_og_t2t_mou_7', 'std_og_t2t_mou_8',
                  'std_og_t2m_mou_6', 'std_og_t2m_mou_7', 'std_og_t2m_mou_8',
                  'std_og_t2f_mou_6', 'std_og_t2f_mou_7', 'std_og_t2f_mou_8',
                  'last_day_rch_amt_6', 'last_day_rch_amt_7', 'last_day_rch_amt_8',
                  'arpu_3g_6', 'arpu_3g_7', 'arpu_3g_8',
                  'arpu_2g_6', 'arpu_2g_7', 'arpu_2g_8',
                  'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8']

tcd.drop(individual_cols, axis = 1, inplace = True)


# %%
tcd.shape

# %% [markdown]
# # 3. Exploratory Data Analysis

# %%
#  Statistical summary

tcd.describe()

# %% [markdown]
# ### Checking the uniquness in the customer list

# %%
len(tcd.mobile_number.unique()) == len(tcd)


# %%
# No use mobile_number column so we can drop it now
tcd.drop(['mobile_number'], axis=1, inplace = True)


# %%
tcd.shape

# %% [markdown]
# ## It is good to check the balance in data

# %%
sns.countplot(tcd.churn)
plt.title('Churned Custmers')
plt.show()
print ('Churned customer : ',len(tcd[(tcd.churn ==1)]))
print ('Not Churned customer : ',len(tcd[(tcd.churn == 0)]))
print('Percentage Diffrence : ', round(abs((100*len(tcd[(tcd.churn ==1)])/len(tcd)) - 100*len(tcd[(tcd.churn ==0)])/len(tcd)),2),'%')
print (f'Proportion of Churned to Not Churned = {100* len(tcd[(tcd.churn ==1)])/len(tcd[(tcd.churn ==0)]):.2f}%' )

# %% [markdown]
# #### We can observe that dataset is imbalanced. We need to handle this. We will do it in later stage.
#
# %% [markdown]
# ## Univariate analysis of categoprical data
#

# %%
# Categorical fetures list
cat_features = ['night_pck_user_6', 'night_pck_user_7', 'night_pck_user_8', 'fb_user_6', 'fb_user_7', 'fb_user_8']
subplot_loc = 1
plt.figure(figsize = (18,15))
for i in cat_features:
   Col_value = round(((tcd[i].value_counts(dropna = False))/(len(tcd[i])) * 100), 2)
   plt.subplot(2, 3, subplot_loc)
   ax = sns.barplot(x = Col_value.index, y = Col_value.values, order = Col_value.sort_index().index)
   plt.xlabel(i, labelpad = 15)
   plt.ylabel('Percentage Rate', labelpad = 10)
   plt.title('Percentage of '+i)
   subplot_loc += 1

# %% [markdown]
# ### From univariate analysis we can imterprete that byt the month of 8th night_pack users as well as fb_users has rudeced. This might be indication of disloyalty to the service.
# %% [markdown]
# # Bivariate Analysis

# %%
# Bivariate Analysis
# Defining and ploting
def bivariate(y_axis):
   plt.figure(figsize = (18, 5))
   xlabel = "Churn Status"
   x_axis = "churn"
   plot_title_1 = "Month 6 - Churn Vs " + y_axis
   plot_title_2 = "Month 7 - Churn Vs " + y_axis
   plot_title_3 = "Month 8 - Churn Vs " + y_axis
   plt.subplot(1, 3, 1)
   sns.boxplot(x = x_axis, y = y_axis + "_6", hue = "churn", data = tcd, showfliers = False)
   plt.title(plot_title_1)
   plt.xlabel(xlabel, labelpad = 15)
   plt.ylabel(( y_axis + "_6"), labelpad = 10)
   plt.subplot(1, 3, 2)
   sns.boxplot(x = x_axis, y = y_axis + "_7", hue = "churn", data = tcd, showfliers = False)
   plt.title(plot_title_2)
   plt.xlabel(xlabel, labelpad = 15)
   plt.ylabel(( y_axis + "_7"), labelpad = 10)
   plt.subplot(1, 3, 3)
   sns.boxplot(x = x_axis, y = y_axis + "_8", hue = "churn", data = tcd, showfliers = False)
   plt.title(plot_title_3)
   plt.xlabel(xlabel, labelpad = 15)
   plt.ylabel(( y_axis + "_8"), labelpad = 10)
   plt.subplots_adjust(wspace = 0.4)
   plt.show()
for i in ["arpu" ,"onnet_mou","offnet_mou","total_og_mou","total_ic_mou","total_rech_num","total_rech_amt",
               "total_rech_data","vol_2g_mb","vol_3g_mb","vbc_3g","total_data_amt"]:
               bivariate(i)

# %% [markdown]
# #### We can clearly see the huge drop in uses in 8th Month.

# %%
tcd.shape


# %%
# Plotting Correlation Heatmap
plt.figure(figsize=[20,20])
sns.heatmap(tcd.corr(), mask= np.triu(tcd.corr()))
plt.show()
#Highly Correlated features list
High_corr = tcd.corr().loc[np.where(tcd.corr()>0.8, 1, 0)==1].columns
High_corr_list = []
for i in High_corr:
   High_corr_list.append(i)
print('Highly correlated Features , >80%')
print(High_corr_list)


# %%
# Creating Good Phase and Action Phase data
def good_action_phase(df, var):
   col_6 = var + "_6"
   col_7 = var + "_7"
   col_8 = var + "_8"
   good_phase_col = var + "_good_phase"
   action_phase_col = var + "_action_phase"
   df[good_phase_col] = (df[col_6] + df[col_7])/2
   df[action_phase_col] = df[col_8] - df[good_phase_col]
   df.drop([col_6, col_7, col_8], axis = 1, inplace = True)
   return df


# %%
# Derive Good and Action Phase Variables

for i in ["arpu","onnet_mou","offnet_mou",'roam_ic_mou', "roam_og_mou", "loc_og_mou", "std_og_mou", "isd_og_mou", "spl_og_mou", "og_others",
                       "total_og_mou", "loc_ic_mou", "std_ic_mou", "spl_ic_mou", "isd_ic_mou", "ic_others", "total_ic_mou", "total_rech_num",
                       "total_rech_amt", "max_rech_amt", "total_rech_data", "max_rech_data", "count_rech_2g", "count_rech_3g", "vol_2g_mb",
                       "vol_3g_mb", "monthly_2g", "sachet_2g", "monthly_3g", "sachet_3g", "vbc_3g", "total_data_amt"]:
                       tcd= good_action_phase(tcd, i)
tcd.head()

# %% [markdown]
# ## Checking outliers

# %%
Good_nume_var = [i for i in tcd.columns if 'good' in i]
Act_NUM_var = [i for i in tcd.columns if 'action' in i ]


# %%
# Identifying Outliers
for i , j in zip (Good_nume_var, Act_NUM_var) :
   plt.figure(figsize=[18,5])
   plt.subplot(1,2,1)
   sns.boxplot(tcd[i])
   plt.title(i)
   plt.subplot(1,2,2)
   sns.boxplot(tcd[j])
   plt.title(j)
   plt.show()


# %% [markdown]
# ## Huge outliers are clearly visible
#
# #### We will handle it later lets first split the data in train and test
# %% [markdown]
# # 4. Data Preperation
# ## splitting the data

# %%
X_df = tcd.drop('churn', axis=1)
y_df = tcd['churn']
X_train, X_test, y_train, y_test = train_test_split(X_df,y_df, train_size=0.7, random_state=100)

# %% [markdown]
# ## Handeling Outliers and Rescalling

# %%
# Applying Robustscaler
# Normalizing the data
rob = RobustScaler()
X_train = pd.DataFrame(data = rob.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(data= rob.fit_transform(X_test), columns=X_test.columns , index=X_test.index )


# %%
# Re-Scalling
scalar = MinMaxScaler()
X_train[:] = scalar.fit_transform(X_train[:])
X_test[:] = scalar.transform(X_test[:])


# %%
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

# %%
X_train.head()


# %%
# Balancing the data
sampling = SMOTE(random_state = 100)
X_train_smo, y_train_smo = sampling.fit_resample(X_train, y_train)


# %%
# Size of data
print(X_train_smo.shape)
print(y_train_smo.shape)


# %%
# Churn ratio after resampeling
sns.distplot(y_train_smo)

# %% [markdown]
# ### Data is now balanced

# %%
plt.figure(figsize=[20,10])
sns.heatmap(X_train_smo.corr())


# %%
# Removing Highly correlated factors
# Create correlation matrix
corr_matrix = X_train_smo.corr()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.7
to_drop = [column for column in upper.columns if any(upper[column] > 0.7)]
X_train_smo.drop(to_drop, axis=1, inplace=True)

# %% [markdown]
# # 5. Model Building and Evaluation
# %% [markdown]
# ## Logistic Regression

# %%
logm1 = sm.GLM(y_train_smo,(sm.add_constant(X_train_smo)), family = sm.families.Binomial())
logm1.fit().summary()

# %% [markdown]
# ### Applying RFE

# %%
logreg = LogisticRegression()
# running RFE with 20 variables as output
rfe = RFE(logreg, 10)             
rfe = rfe.fit(X_train_smo, y_train_smo)


# %%
rfe_columns=X_train_smo.columns[rfe.support_]


# %%
X_train_SM = sm.add_constant(X_train_smo[rfe_columns])
logm = sm.GLM(y_train_smo,X_train_SM, family = sm.families.Binomial())
res = logm.fit()
res.summary()


# %%
# Geerating Prediction table
y_train_pred = res.predict(X_train_SM)
y_train_pred = y_train_pred.values.reshape(-1)


# %%
y_train_firnal_pred = pd.DataFrame({'Churn':y_train_smo.values, 'Churn_Prob':y_train_pred})
y_train_firnal_pred.head()


# %%
# Deriving prediction with different cutt off points
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
   y_train_firnal_pred[i]= y_train_firnal_pred.Churn_Prob.map(lambda x: 1 if x > i else 0)
y_train_firnal_pred.head()


# %%
# accuracy sensitivity and specificity for various cuttoff
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
   cm1 = metrics.confusion_matrix(y_train_firnal_pred.Churn, y_train_firnal_pred[i] )
   total1=sum(sum(cm1))
   accuracy = (cm1[0,0]+cm1[1,1])/total1
   
   speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
   sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
   cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)


# %%
# Plot accuracy sensitivity and specificity for various cuttof
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'], figsize=[18,7])
plt.xticks(np.arange(0,1,0.01), rotation=90)
plt.yticks(np.arange(0,1,0.05))
plt.grid()
plt.show()

# %% [markdown]
# ## We can see, 0.54 is the optimum point to take it as a cutoff probability.
#

# %%
# predicting
y_train_firnal_pred['Churn_pred'] = y_train_firnal_pred.Churn_Prob.map(lambda x: 1 if x > 0.54 else 0)
y_train_firnal_pred.head()

# %% [markdown]
# ### <b>Evaluating the model</b>
# ### Evaluating on train data
#

# %%
confusion = metrics.confusion_matrix(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_pred )
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
print(confusion)
print('sensitivity :', (TP / float(TP+FN)))
print('Specificity : ', TN/(TN+FP))
print('Predictive Positive Value :', (TP / float(TP+FP)))
print('Negative Predictive Value : ', (TN / float(TN+ FN)))


# %%
def draw_roc( actual, probs ):
   fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                             drop_intermediate = False )
   auc_score = metrics.roc_auc_score( actual, probs )
   plt.figure(figsize=(5, 5))
   plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
   plt.plot([0, 1], [0, 1], 'k--')
   plt.xlim([0.0, 1.0])
   plt.ylim([0.0, 1.05])
   plt.xlabel('False Positive Rate or [1 - True Negative Prediction Rate]')
   plt.ylabel('True Positive Rate')
   plt.title('Receiver operating characteristic example')
   plt.legend(loc="lower right")
   plt.show()


# %%
# Plotting the curve for the obtained metrics
fpr, tpr, thresholds = metrics.roc_curve(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_Prob, drop_intermediate = False )
draw_roc(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_Prob )


# %%
# plotting the precoion and recall curve
p, r, thresholds = precision_recall_curve(y_train_firnal_pred.Churn, y_train_firnal_pred.Churn_Prob)
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

# %% [markdown]
# ### Evaluating on test data
#

# %%
# As we have already rescalled the X_test, we wil drag out the features used in model
X_test_rfe=X_test[rfe_columns]
X_test_rfe.head()


# %%
X_test_SM = sm.add_constant(X_test_rfe)


# %%
y_test_pred = res.predict(X_test_SM)


# %%
y_test_firnal_pred = pd.DataFrame({'Churn':y_test.values, 'Churn_Prob':y_test_pred})
y_test_firnal_pred.head()


# %%
y_test_firnal_pred['Churn_pred'] = y_test_firnal_pred.Churn_Prob.map(lambda x: 1 if x > 0.54 else 0)
y_test_firnal_pred.head()


# %%
confusion = metrics.confusion_matrix(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_pred )
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
print(confusion)
print('sensitivity :', (TP / float(TP+FN)))
print('Specificity : ', TN/(TN+FP))
print('Predictive Positive Value :', (TP / float(TP+FP)))
print('Negative Predictive Value : ', (TN / float(TN+ FN)))


# %%
# Plotting the curve for the test
fpr, tpr, thresholds = metrics.roc_curve(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_Prob, drop_intermediate = False )
draw_roc(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_Prob )


# %%
# plotting the precoion and recall curve
p, r, thresholds = precision_recall_curve(y_test_firnal_pred.Churn, y_test_firnal_pred.Churn_Prob)
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()

# %% [markdown]
# # <b>Logistic Regression By using PCA</b>

# %%
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3, train_size=0.7, random_state=1)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
# scaling dataset
scaler = MinMaxScaler()
X_train[:] = scaler.fit_transform(X_train[:])
X_test[:] = scaler.transform(X_test[:])
# Applying SMOTE - imbalance correction
sm = SMOTE(random_state=42)
X_train_sm,y_train_sm = sm.fit_resample(X_train,y_train)
print("X_train_sm Shape:", X_train_sm.shape)
print("y_train_sm Shape:", y_train_sm.shape)


# %%
# Applying PCA
pca = PCA(random_state=85)
pca.fit(X_train_sm)


# %%
# Fitting and transforming
X_train_sm_pca=pca.fit_transform(X_train_sm)
X_test_pca=pca.transform(X_test)


# %%
pca.components_


# %%
# Applying Log regression
lr_pca = LogisticRegression()
lr_pca.fit(X_train_sm_pca, y_train_sm)
# making the predictions
y_pred_pca = lr_pca.predict(X_test_pca)
# converting the prediction into a dataframe
y_pred_df = pd.DataFrame(y_pred_pca)

# %% [markdown]
# ### <b>Evaluating the model</b>

# %%
confusion_matrix(y_test,y_pred_pca)


# %%
accuracy_score(y_test,y_pred_pca)


# %%
plt.bar(range(1,len(pca.explained_variance_ratio_)+1),pca.explained_variance_ratio_)
plt.show()


# %%
# Getting the components
# plotting explained variance ratio
cum_var = np.cumsum(pca.explained_variance_ratio_)
fig = plt.figure(figsize=[12,7])
plt.plot(cum_var)
plt.xlabel('Principal components count')
plt.ylabel('Cumulative Explained Variance')
plt.xticks(np.arange(0,75,2))
plt.yticks(np.arange(0.5,1,0.1))
plt.grid()
plt.show()

# %% [markdown]
# ### > 90% of the data can be explained with 5 PCA components*

# %%
# Apply 5 PCA components
pca_final = IncrementalPCA(n_components=5)
train_pca_final = pca_final.fit_transform(X_train_sm)
test_pca_final = pca_final.transform(X_test)


# %%
# Applying logistic Regression
logreg_pca_final = LogisticRegression()
logreg_pca_final.fit(train_pca_final, y_train_sm)
# predictions
y_pred_final = logreg_pca_final.predict(test_pca_final)
# prediction into a dataframe
y_pred_final = pd.DataFrame(y_pred_final)

# %% [markdown]
# ### <b>Evaluating the model</b>

# %%
print( confusion_matrix(y_test,y_pred_final) )
print('Accuracy :', accuracy_score(y_test,y_pred_final) )

# %% [markdown]
# # <b>Decision Tree</b>

# %%
# splitting the data
X = tcd.drop('churn', axis=1)
y = tcd['churn']
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=100)


# %%
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)


# %%
plt.figure(figsize=[18,10])
plot_tree(dt, filled=True, rounded=True, feature_names=X.columns, class_names=['Not Churned', 'Churned'])
plt.show()

# %% [markdown]
# ### <b>Evaluating the model</b>

# %%
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)


# %%
print('Train Accuracy :', accuracy_score(y_train, y_train_pred))
print('Test Accuracy : ', accuracy_score(y_test, y_test_pred))


# %%
print('Train Confusion Matrix')
print(confusion_matrix(y_train, y_train_pred))


# %%
print('Test Confusion Matrix')

print(confusion_matrix(y_test, y_test_pred))

# %% [markdown]
# # <b>Random Forest</b>
#

# %%
rf = RandomForestRegressor(random_state=42, n_jobs=-1, max_depth=5, min_samples_leaf=10)


# %%
rf.fit(X_train, y_train)

# %% [markdown]
# ### <b>Evaluating the model</b>

# %%
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)


# %%
r2_score(y_train, y_train_pred)


# %%
r2_score(y_test, y_test_pred)


# %%
rf.feature_importances_


# %%
feature_df = pd.DataFrame({
   "Var_name": X_train.columns,
   "Imp": rf.feature_importances_})


# %%
feature_df.sort_values(by="Imp", ascending=False).head(10)

# %% [markdown]
# # 6. Summary
# %% [markdown]
# ## MOdels Applied
# * Logistic Regression (with RFE)
# * Logistic Regression (With PCA)
# * Decision Tree Model
# * Random Forest Model
# %% [markdown]
# ## Logistic Regression Evaluation Sheet
# * On Train Data
#
# >* sensitivity : 0.797
# >* Specificity :  0.786
# >* ROC Curve are : 0.87
#
# * On Test Data
#
# >* sensitivity : 0.740
# >* Specificity :  0.794
# >* ROC Curve area : 0.84
# %% [markdown]
# ## Logistic Regression with PCA Evaluation Sheet
# * PCA Accuracy Score : 0.799
# * 5 PCA components explained 90%
# * IncrementalPCA Accuracy Score with 5 PCA :  0.634
# %% [markdown]
# ## Decision Tree
# * Train Accuracy : 0.943
# * Test Accuracy :  0.941
#
# ## Random Forest
# * R<sup>2</sup> Train : 0.26
# * R<sup>2</sup> Test : 0.2
#
# %% [markdown]
# # 7. Conclusion and Suggestion
# %% [markdown]
# * Among all the models Decision tree is able to give 94% of accuracy while Sensitivity and Specifiicty of Logistic Regression also is about 80% so these models can be a good choice.
# * There is nothing wrong to get in touch with cutomers who are marked Falsely Churned it will ensure their loyality and hold them on network.
# %% [markdown]
# ### From Analysis we found that following features can be good indication to flag Churning customers
# * aon
# * loc_ic_mou_action_phase
# * arpu_action_phase
# * std_ic_mou_action_phase
# * max_rech_amt_action_phase
# * loc_ic_mou_good_phase
# * total_ic_mou_good_phase
#
# > Here good phase explains 6th and 7th month while action phase is 8th Month.