X Training, a web-based training firm, faces the problem of poor lead conversion regardless of producing many leads every day. They purpose to reinforce effectivity by figuring out ‘Scorching Leads,’ these most certainly to transform into paying prospects, thereby growing their lead conversion charge to round 80%. Leads are acquired by means of varied channels, and as soon as obtained, the gross sales crew engages in communication efforts to transform them. By assigning lead scores to prioritize potential leads, the corporate hopes to focus its assets successfully and enhance conversion charges.

Enterprise Targets

Construct a logistic regression mannequin to assign a lead rating between 0 and 100 to every of the leads, which can be utilized by the corporate to focus on potential leads. The next rating would imply that the lead is sizzling, i.e., is most certainly to transform, whereas a decrease rating would imply that the lead is chilly and can principally not get transformed.

Dataset

Discover the hyperlink for the dataset right here: Leads_Data

Discover the information dictionary right here: Leads_Data_Dictionary

Analysis of Knowledge

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as snsfrom sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm

from sklearn.linear_model import LogisticRegression

from sklearn.feature_selection import RFE

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn import metrics

# Replace settings to show all columns of the dataset

pd.choices.show.max_columns = None

Learn the Dataset:

`leads_df = pd.read_csv('Leads.csv')`

leads_df.head()#Statistical description of the numerical knowledge

leads_df.describe()

#Data relating to null values and datatypes

leads_df.information()

Knowledge Preparation

`#Checking for duplicate values`

print(leads_df['Prospect ID'].duplicated().sum())

print(leads_df['Lead Number'].duplicated().sum())

`Prospect ID`

and `Lead Quantity`

have a novel set of values that aren’t essential for our evaluation. We will select to drop these columns.

`#Dropping "Prospect ID" and "Lead Quantity" columns`

cols_to_drop = ['Prospect ID', 'Lead Number']

leads_df = leads_df.drop(cols_to_drop, axis=1)#examine the % of null values in every column

spherical(100 * leads_df.isnull().sum()/len(leads_df), 2)

Some columns have values as “Choose”. These might be the clean fields whereby the respondent didn’t point out the small print. We will deal with these values as NULL and change “Choose” with “NaN”.

`#Remedy of "Choose" worth in whole dataset`

leads_df = leads_df.change('Choose', np.nan)#recalculating null values within the columns

spherical(100 * leads_df.isnull().sum()/len(leads_df), 2)

We observe a big rise within the null values of those columns: “Specialization”, “How did you hear about X training”, “Lead Profile”, and “Metropolis”.

`#dropping columns with greater than 40% null values`

cols_to_drop = ['How did you hear about X Education', 'Lead Quality', 'Lead Profile', 'Asymmetrique Activity Index',

'Asymmetrique Profile Index', 'Asymmetrique Activity Score', 'Asymmetrique Profile Score']leads_df = leads_df.drop(cols_to_drop, axis=1)

`#Checking if there are variables with just one distinctive worth as a result of these variables will not present any helpful perception in our evaluation`

leads_df.nunique()

These following columns have just one distinctive worth. These columns won’t have any affect on the evaluation. Thus, we will drop them.

“Journal”, “Recieve Extra Updates About Our Course”, “Replace me on Provide Chain Content material”, “Get updates on DM Content material”, and “I comply with pay the quantity by means of cheque”.

`#dropping above talked about columns`

cols_to_drop = ['Magazine', 'Receive More Updates About Our Courses', 'Update me on Supply Chain Content',

'Get updates on DM Content', 'I agree to pay the amount through cheque']

leads_df = leads_df.drop(cols_to_drop, axis=1)

**Univariate Evaluation:**

`#plotting unfold of Nation columnn`

plt.determine(figsize=(15,5))

s1 = sns.countplot(x=leads_df.Nation, hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(), rotation=90)

plt.present()

`#dropping "Nation" column`

leads_df.drop('Nation', axis=1, inplace=True)

`#counting metropolis values`

leads_df['City'].value_counts(dropna=False)

`#changing "NaN" to "Not Specified"`

leads_df.Metropolis.fillna('Not Specified', inplace=True)

leads_df['City'].value_counts(dropna=False)

`#Plotting unfold of the function after changing NaN values`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df.Metropolis, hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking worth counts in "Specialization" discipline`leads_df['Specialization'].value_counts(dropna=False)

`#Changing NaN values with "Not Specified"`leads_df['Specialization'] = leads_df['Specialization'].change(np.nan, 'Not Specified')

`#Plotting unfold of options`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df.Specialization, hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

**Insights :**

- We see that specialization with Administration in them have increased variety of leads and convertion charge can also be good. So that is positively a big variable and shouldn’t be dropped.
- Additionally allow us to merge all Administration specialization into one worth

`#Combining Administration specialization as a result of they present related developments`leads_df['Specialization'] = leads_df['Specialization'].change(['Finance Management','Human Resource Management',

'Marketing Management','Operations Management',

'IT Projects Management','Supply Chain Management',

'Healthcare Management','Hospitality Management',

'Retail Management'] ,

'Administration Specializations')

`#Replotting the unfold of the options`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df.Specialization, hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking worth depend for 'What's your present occupation' discipline`leads_df['What is your current occupation'].value_counts(dropna=False)

`#Changing NaN values with "Not Specified"`leads_df['What is your current occupation'] = leads_df['What is your current occupation'].change(np.nan, 'Not Specified')

`#Plotting unfold of the options`s1=sns.countplot(x=leads_df['What is your current occupation'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

**Insights :**

- Although the lead depend is much less, Working Professionals have a highest conversion charge.
- Lead depend for Unemployed section is the best and conversion charge can also be good.
- Whereas the lead depend beneath Not Specified class is excellent, conversion charge is comparitevly low towards the earlier two.

`#Checking worth counts for 'What issues most to you in selecting a course' discipline`leads_df['What matters most to you in choosing a course'].value_counts(dropna=False

`#Changing NaN values with "Not Specified"`leads_df['What matters most to you in choosing a course'] = leads_df['What matters most to you in choosing a course'].change(np.nan, 'Not Specified')

`#Plotting unfold of the options`s1=sns.countplot(x=leads_df['What matters most to you in choosing a course'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking worth counts in 'Tags'`leads_df['Tags'].value_counts(dropna=False)

`#Changing NaN values with "Not Specified" and checking worth depend once more`leads_df.Tags.fillna('Not Specified', inplace=True)

leads_df['Tags'].value_counts(dropna=False)

`#Plotting the unfold of the function`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df['Tags'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Combining values with low frequency beneath "Different Tags"`leads_df['Tags'] = leads_df['Tags'].change(['In confusion whether part time or DLP', 'in touch with EINS',

'Diploma holder (Not Eligible)', 'Approached upfront','Graduation in progress',

'number not provided', 'opp hangup','Still Thinking', 'Lost to Others',

'Shall take in the next coming month','Lateral student',

'Interested in Next batch', 'Recognition issue (DEC approval)',

'Want to take admission but has financial problems',

'University not recognized', 'switched off', 'Already a student',

'Not doing further education', 'invalid number', 'wrong number given',

'Interested in full time MBA'],

'Different Tags')

`#Replotting the options`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df['Tags'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking proportion of lacking values once more for whole dataset`spherical(100*(leads_df.isnull().sum()/len(leads_df)), 2)

`#Checking worth counts of 'Lead Supply'`leads_df['Lead Source'].value_counts(dropna=False)

The Lead Supply column appear to be skewed in the direction of sure values. There are lots of low frequency values as properly. Allow us to repair them by altering them to extra meanful values.

`#Changing sure values in 'Lead Supply'`leads_df['Lead Source'] = leads_df['Lead Source'].change(np.nan,'Others')

leads_df['Lead Source'] = leads_df['Lead Source'].change('google','Google')

leads_df['Lead Source'] = leads_df['Lead Source'].change('Fb','Social Media')

leads_df['Lead Source'] = leads_df['Lead Source'].change(['bing','Click2call','Press_Release', 'youtubechannel',

'welearnblog_Home', 'WeLearn','blog','Pay per Click Ads',

'testone','NC_EDM', 'Live Chat'] ,

'Different Sources')

leads_df['Lead Source'].value_counts(dropna=False)

`#Plotting unfold of the options`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df['Lead Source'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

**Insights :**

- Most leads are generated from Google and Direct Site visitors and the conversion charge can also be good
- The conversion charge may be very excessive for leads generated from Welingak Web site and Reference
- The corporate ought to work on growing their leads from Welingak Web site and Reference
- Whereas the lead depend from Natural search is relatively low, conversion charge is excellent
- The corporate ought to work on introducing higher methods for lead conversion from Olark Chat, Natural Search, Direct Site visitors and Google

`#Checking worth counts in 'Final Exercise' discipline`leads_df['Last Activity'].value_counts(dropna=False)

`#Exchange the NULL values with "Others" and mixing variables with low frequency`leads_df['Last Activity'] = leads_df['Last Activity'].change(np.nan,'Others')

leads_df['Last Activity'] = leads_df['Last Activity'].change(['Unreachable','Unsubscribed', 'Had a Phone Conversation',

'Approached upfront', 'View in browser link Clicked',

'Email Marked Spam', 'Email Received',

'Resubscribed to emails', 'Visited Booth in Tradeshow'],

'Others')

leads_df['Last Activity'].value_counts(dropna=False)

`#Plotting unfold of the options`plt.determine(figsize=(15,5))

s1=sns.countplot(x=leads_df['Last Activity'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking null values in all columns`spherical(100*(leads_df.isnull().sum()/len(leads_df)), 2)

**For the reason that lacking values are lower than 2% we will select to drop the NAN rows**

`#dropping NaN values and checking null worth once more`leads_df = leads_df.dropna()

spherical(100*(leads_df.isnull().sum()/len(leads_df)), 2)

`#Checking worth counts in 'Lead Origin' discipline`leads_df['Lead Origin'].value_counts(dropna=False)

`#Plotting the unfold of the function`plt.determine(figsize=(8,5))

s1=sns.countplot(x=leads_df['Lead Origin'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

**Insights :**

`Touchdown Web page Submission`

and`API`

have the best variety of leads in addition to conversion`Lead Add Kind`

has a really excessive conversion charge however depend of leads generated shouldn’t be very excessive`Lead Import`

has negligible leads- With a view to enhance general lead conversion charge, we now have to enhance lead converion of
`API`

and`Touchdown Web page Submission`

origin and generate extra leads from`Lead Add Kind`

`#checking worth counts of final 'Notable Exercise'`leads_df['Last Notable Activity'].value_counts()

`#Plotting the unfold of the function`plt.determine(figsize = (15,5))

ax1=sns.countplot(x = "Final Notable Exercise", hue = "Transformed", knowledge = leads_df)

ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)

plt.present()

`#Clubbing decrease frequency values and checking the worth depend`leads_df['Last Notable Activity'] = leads_df['Last Notable Activity'].change(['Had a Phone Conversation', 'Email Marked Spam',

'Unreachable', 'Unsubscribed',

'Email Bounced', 'Resubscribed to emails',

'View in browser link Clicked', 'Approached upfront',

'Form Submitted on Website', 'Email Received'],

'Different Exercise')

leads_df['Last Notable Activity'].value_counts()

`#Replotting the unfold of the options`plt.determine(figsize = (15,5))

ax1=sns.countplot(x = "Final Notable Exercise", hue = "Transformed", knowledge = leads_df)

ax1.set_xticklabels(ax1.get_xticklabels(),rotation=90)

plt.present()

There are 09 options for which the values are in Binary type (Sure/No). Let’s consider the worth break up to evaluate the usability of those options in our evaluation.

`#Allow us to examine the columns having variable imbalance`cols_chk_imbal = ["A free copy of Mastering The Interview",'Do Not Email','Do Not Call','Search',

'Newspaper Article', 'X Education Forums', 'Newspaper','Digital Advertisement',

'Through Recommendations' ]

for c in cols_chk_imbal:

print(leads_df[c].value_counts(dropna=False))

print('='*30)

**Besides the variable “A free copy of Mastering The Interview” & ‘Do Not Electronic mail’, all others have almost one-sided solutions (principally No), so we don’t suppose it might give us a lot of an perception.**

`#For the reason that values in these columns are imbalanced we will select to drop these columns`cols_chk_imbal = ['Do Not Call','Search', 'Newspaper Article', 'X Education Forums', 'Newspaper',

'Digital Advertisement', 'Through Recommendations']

#dropping columns with imbalance

leads_df.drop(cols_chk_imbal, axis=1, inplace=True)

`#Checking the worth depend of 'A free copy of Mastering The Interview'`leads_df['A free copy of Mastering The Interview'].value_counts(dropna=False)

`#Plotting unfold of the function`plt.determine(figsize=(8,5))

s1=sns.countplot(x=leads_df['A free copy of Mastering The Interview'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking worth counts of 'Do Not Electronic mail'`leads_df['Do Not Email'].value_counts(dropna=False)

`#Plotting the unfold of the function`plt.determine(figsize=(8,5))

s1=sns.countplot(x=leads_df['Do Not Email'], hue=leads_df.Transformed)

s1.set_xticklabels(s1.get_xticklabels(),rotation=90)

plt.present()

`#Checking the dataset`leads_df.information()

## We will see that the dataset appears to have cleaner values than earlier than with 0 Null values.

Univariate Evaluation for Numerical variables

`# Examine the correlation`sns.heatmap(leads_df.corr(numeric_only=True), annot=True, cmap='RdYlGn')

plt.present()

`# Examine the share of knowledge that has Transformed Values = 1:`conv_perc = (sum(leads_df['Converted'])/len(leads_df))*100

conv_perc

**Ans:** 38.02043282434362

## The dataset exhibits that the present lead conversion charge is 38%.

Dealing with Outliers

`#making a operate for outlier therapy to take away prime and backside 1% information`def outlier_treatment(df, col):

Q3 = df[col].quantile(0.99)

df = df[(df[col] <= Q3)]

Q1 = df[col].quantile(0.01)

df = df[(df[col] >= Q1)]

plt.determine(figsize=(6,4))

sns.boxplot(y=df[col])

plt.present()

return df

`#Plotting field plot to examine outliers for 'TotalVisits'`plt.determine(figsize=(6,4))

sns.boxplot(y=leads_df['TotalVisits'])

plt.present()

`#Treating outlier`leads_df['TotalVisits'].describe(percentiles=[0.05,0.25, 0.5, 0.75, 0.90, 0.95, 0.99])

## We see lots of variation of knowledge between 95%, 99% and 100% so we’ll take away the highest 1% outlier values.

`#Creating field plot once more to examine outlier for 'TotalVisits'`leads_df = outlier_treatment(leads_df, 'TotalVisits')

`#Plotting field plot to examine outliers for 'Whole Time Spent on Web site'`plt.determine(figsize=(6,4))

sns.boxplot(y=leads_df['Total Time Spent on Website'])

plt.present()

`#Treating outliers`leads_df['Total Time Spent on Website'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])

## Although greater than 75% of values are under 1000, there are not any outliers in Whole Time Spent on Web site column.

`#Plotting field plot to examine outliers for 'Web page Views Per Go to'`plt.determine(figsize=(6,4))

sns.boxplot(y=leads_df['Page Views Per Visit'])

plt.present()

`leads_df['Page Views Per Visit'].describe(percentiles=[0.05,.25, .5, .75, .90, .95, .99])`

## We see lots of variation of knowledge between 99% and 100% so we’ll take away the highest 1% outlier values.

`#Creating field plot once more to examine`leads_df = outlier_treatment(leads_df, 'Web page Views Per Go to')

Dummy variable creation

`#Creating a listing of categorical variables`cat_cols= leads_df.select_dtypes(embody=['object']).columns

## We are going to convert the binary columns to 0/1 values utilizing Map operate.

`#Making a operate`def binary_map(x):

return x.map({'Sure': 1, "No": 0})

`binary_cols = ['A free copy of Mastering The Interview','Do Not Email']`

leads_df[binary_cols] = leads_df[binary_cols].apply(binary_map)

leads_df[binary_cols].head()

`#Making a operate to generate dummy variables`def create_dummy_cols(df, col, prefix, drop_col):

if (drop_col == ''):

dummy_df = pd.get_dummies(df[col], dtype='int', prefix=prefix, drop_first=True)

else:

dummy_df = pd.get_dummies(df[col], dtype='int', prefix=prefix)

dummy_df = dummy_df.drop(drop_col, axis=1)

df = pd.concat([df, dummy_df], axis=1)

return df

`#Utilizing the operate to create dummy variables`leads_df = create_dummy_cols(leads_df, 'Lead Origin', 'Lead_Origin', '')

leads_df = create_dummy_cols(leads_df, 'Lead Supply', 'Lead_Source', 'Lead_Source_Others')

leads_df = create_dummy_cols(leads_df, 'Final Exercise', 'Last_Activity', 'Last_Activity_Others')

leads_df = create_dummy_cols(leads_df, 'Specialization', 'Specialization', 'Specialization_Not Specified')

leads_df = create_dummy_cols(leads_df, 'What's your present occupation', 'Current_Occupation', '')

leads_df = create_dummy_cols(leads_df, 'What issues most to you in selecting a course', 'Choosing_Course', 'Choosing_Course_Other')

leads_df = create_dummy_cols(leads_df, 'Tags', 'Tags', 'Tags_Not Specified')

leads_df = create_dummy_cols(leads_df, 'Metropolis', 'Metropolis', 'City_Not Specified')

leads_df = create_dummy_cols(leads_df, 'Final Notable Exercise', 'Last_Notable_Activity', 'Last_Notable_Activity_Other Exercise')

## Take away the Binary cols from the class cols record after which drop the unique cols from the dataframe.

`#Eradicating columns`cat_cols = record(cat_cols)

cat_cols.take away('A free copy of Mastering The Interview')

cat_cols.take away('Do Not Electronic mail')

leads_df = leads_df.drop(cat_cols, axis=1)

Take a look at-Practice Break up

`#Placing 'Transformed' variable to y`y = leads_df['Converted']

#Placing function variables to X

X = leads_df.drop('Transformed', axis=1)

`#Splitting the information into prepare and check datasets taking random state = 42`X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)

`#Verifying if the dataset has been break up appropriately`print(X_train.form)

print(X_test.form)

Function Scaling

`#Scaling of non-binary columns in prepare dataset`scaler = StandardScaler()

cols_to_scale = ['TotalVisits', 'Total Time Spent on Website','Page Views Per Visit']

X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])

`#Checking the "Convertion Price"`Transformed = (sum(leads_df['Converted'])/len(leads_df['Converted'].index))*100

Transformed

**Ans: **37.92025019546521 (Approx. 38%)

Mannequin Constructing

`#Function Choice Utilizing RFE`log_reg = LogisticRegression()

rfe = RFE(log_reg, n_features_to_select=15)

rfe = rfe.match(X_train, y_train)

`record(zip(X_train.columns, rfe.support_, rfe.ranking_))`

Mannequin Constructing

**Mannequin 1 :**

X_train_sm = sm.add_constant(X_train[rfe_cols])

log1 = sm.GLM(y_train, X_train_sm, household=sm.households.Binomial())

res = log1.match()

res.abstract()

## We will see that p-values for some variables shouldn’t be in acceptable vary. We are going to calculate the VIF to grasp the multicollenearity within the variables after which resolve which columns to drop.

`#VIF Calculation`vif = pd.DataFrame()

vif['Features'] = X_train[rfe_cols].columns

vif['VIF'] = [variance_inflation_factor(X_train[rfe_cols].values, i) for i in vary(X_train[rfe_cols].form[1])]

vif['VIF'] = spherical(vif['VIF'], 2)

vif = vif.sort_values(by = "VIF", ascending = False)

vif

The VIF for `Choosing_Course_Better Profession Prospects`

is means above the accepted values. Allow us to drop the column and recalculate the values.

`#Eradicating function with excessive VIF and excessive P Worth`rfe_cols.take away('Choosing_Course_Better Profession Prospects')

rfe_cols

**Mannequin 2:**

`X_train_sm = sm.add_constant(X_train[rfe_cols])`

log2 = sm.GLM(y_train, X_train_sm, household=sm.households.Binomial())

res = log2.match()

res.abstract()

`#VIF Calculation`vif = pd.DataFrame()

vif['Features'] = X_train[rfe_cols].columns

vif['VIF'] = [variance_inflation_factor(X_train[rfe_cols].values, i) for i in vary(X_train[rfe_cols].form[1])]

vif['VIF'] = spherical(vif['VIF'], 2)

vif = vif.sort_values(by = "VIF", ascending = False)

vif

## The VIF values appear to be good, however the p-values for some variables are unacceptable.

`#Eradicating function with excessive P worth`rfe_cols.take away('Lead_Origin_Lead Add Kind')

rfe_cols

**Mannequin 3:**

`X_train_sm = sm.add_constant(X_train[rfe_cols])`

log3 = sm.GLM(y_train, X_train_sm, household=sm.households.Binomial())

res = log3.match()

res.abstract()

`#VIF Calculation`vif = pd.DataFrame()

vif['Features'] = X_train[rfe_cols].columns

vif['VIF'] = [variance_inflation_factor(X_train[rfe_cols].values, i) for i in vary(X_train[rfe_cols].form[1])]

vif['VIF'] = spherical(vif['VIF'], 2)

vif = vif.sort_values(by = "VIF", ascending = False)

vif

Allow us to drop Lead_Source_Olark Chat since its p-value shouldn’t be acceptable.

`#Eradicating function with excessive P worth`rfe_cols.take away('Lead_Source_Olark Chat')

rfe_cols

**Mannequin 4:**

`#VIF Calculation`vif = pd.DataFrame()

vif['Features'] = X_train[rfe_cols].columns

vif['VIF'] = [variance_inflation_factor(X_train[rfe_cols].values, i) for i in vary(X_train[rfe_cols].form[1])]

vif['VIF'] = spherical(vif['VIF'], 2)

vif = vif.sort_values(by = "VIF", ascending = False)

vif

The p-values and VIF each appear to be in an appropriate vary. We will proceed with our mannequin prediction.

`#Getting the expected values on the prepare set`y_train_pred = res.predict(X_train_sm)

y_train_pred

`#Re-shape y_train_pred to a 1D array`y_train_pred = y_train_pred.values.reshape(-1)

y_train_pred

`#Knowledge body with given convertion charge and probablity of predicted ones`y_train_pred_df = pd.DataFrame({'Transformed':y_train.values, 'Converted_Prob':y_train_pred})

y_train_pred_df['Prospect ID'] = y_train.index

y_train_pred_df.head()

**Allow us to assume the preliminary cut-off is 0.5**

`# Substituting 0 or 1 with the minimize off as 0.5`y_train_pred_df['Predicted'] = y_train_pred_df.Converted_Prob.map(lambda x: 1 if x > 0.5 else 0)

y_train_pred_df.head()

Mannequin Analysis

`# Creating confusion matrix`confusion = metrics.confusion_matrix(y_train_pred_df.Transformed, y_train_pred_df.Predicted )

confusion

`# Predicted not_converted transformed`

# Precise

# not_converted 3636 216

# transformed 227 2188

`# Examine the general accuracy`print('Mannequin accuracy is', spherical(metrics.accuracy_score(y_train_pred_df.Transformed, y_train_pred_df.Predicted), 4) * 100, '%')

Mannequin accuracy is 92.93 %

**Accuracy is sort of 93% which is sort of good.**

`# Substituting the worth of true constructive`

TP = confusion[1,1]# Substituting the worth of true negatives

TN = confusion[0,0]

# Substituting the worth of false positives

FP = confusion[0,1]

# Substituting the worth of false negatives

FN = confusion[1,0]

`# Calculating the sensitivity`sensitivity = spherical(TP/(TP+FN), 4) * 100

print('Mannequin sensitivity is', sensitivity, '%')

Mannequin sensitivity is 90.60000000000001 %

`# Calculating the specificity`specificity = spherical(TN/(TN+FP), 4) * 100

print('Mannequin specificity is', specificity, '%')

Mannequin specificity is 94.39 %

**With the present minimize odd at 0.5, we get the next outcomes:**

- Mannequin Accuracy = 92.93%
- Mannequin Sensitivity = 90.6%
- Mannequin Specificity = 94.39%

Plotting ROC Curve

`#Creating ROC operate`def draw_roc( precise, probs ):

fpr, tpr, thresholds = metrics.roc_curve( precise, probs,

drop_intermediate = False )

auc_score = metrics.roc_auc_score( precise, probs )

plt.determine(figsize=(5, 5))

plt.plot( fpr, tpr, label='ROC curve (space = %0.2f)' % auc_score )

plt.plot([0, 1], [0, 1], 'k--')

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel('False Constructive Price or [1 - True Negative Rate]')

plt.ylabel('True Constructive Price')

plt.title('Receiver working attribute instance')

plt.legend(loc="decrease proper")

plt.present()

return None

`fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_df.Transformed, y_train_pred_df.Converted_Prob, drop_intermediate = False )`

`# Name the ROC operate`draw_roc(y_train_pred_df.Transformed, y_train_pred_df.Converted_Prob)

The world beneath the ROC curve is 0.97, which is an excellent worth.

**Discovering Optimum Minimize-off Level:**

`# Creating columns with completely different chance cutoffs`numbers = [float(x)/10 for x in range(10)]

for i in numbers:

y_train_pred_df[i]= y_train_pred_df.Converted_Prob.map(lambda x: 1 if x > i else 0)

y_train_pred_df.head()

`# Making a dataframe to see the values of accuracy, sensitivity, and specificity at completely different values of probabiity cutoffs`cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity'])

# Making complicated matrix to seek out values of sensitivity, accurace and specificity for every stage of probablity

num = [float(x)/10 for x in range(10)]

for i in num:

cm1 = metrics.confusion_matrix(y_train_pred_df.Transformed, y_train_pred_df[i] )

total1=sum(sum(cm1))

accuracy = (cm1[0,0]+cm1[1,1])/total1

speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])

sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])

cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]

cutoff_df

`custom_bins = [0.15,0.20, 0.25, 0.30, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95]`

cutoff_df.plot.line(x='chance', y=['accuracy','sensitivity','specificity'])

plt.xticks(custom_bins,rotation=90)

plt.yticks(custom_bins)

plt.grid(True)

plt.present()

From the graph, we will see that the optimum cut-off worth is 0.25

`y_train_pred_df['final_predicted'] = y_train_pred_df.Converted_Prob.map( lambda x: 1 if x > 0.25 else 0)`

y_train_pred_df.head()

`#Assigning Lead Rating to the Coaching knowledge`y_train_pred_df['Lead_Score'] = y_train_pred_df.Converted_Prob.map( lambda x: spherical(x*100))

y_train_pred_df.head()

`#Checking general accuracy`print('Total mannequin accuracy is', spherical(metrics.accuracy_score(y_train_pred_df.Transformed, y_train_pred_df.final_predicted), 4) * 100, '%')

Total mannequin accuracy is 92.32000000000001 %

`#Creating confusion matrix`confusion = metrics.confusion_matrix(y_train_pred_df.Transformed, y_train_pred_df.final_predicted )

confusion

`# Substituting the worth of true constructive`

TP = confusion[1,1]# Substituting the worth of true negatives

TN = confusion[0,0]

# Substituting the worth of false positives

FP = confusion[0,1]

# Substituting the worth of false negatives

FN = confusion[1,0]

`#Calculating Sensitivity`sensitivity = spherical(TP/(TP+FN), 4) * 100

print('Total mannequin sensitivity is', sensitivity, '%')

Total mannequin sensitivity is 93.13 %

`#Calculating Specificity`specificity = spherical(TN/(TN+FP), 4)*100

print('Total mannequin specificity is', specificity, '%')

Total mannequin specificity is 91.82000000000001 %

`#Calculate False Constructive charge`False_Positive = spherical(FP/ float(TN+FP),4)*100

print("False Constructive charge is ",False_Positive, '%')

False Constructive charge is 8.18 %

`#Calculate Constructive Predictive worth`Positive_predictive = spherical(TP / float(TP+FP),4)*100

print("Constructive predictive charge is ",Positive_predictive, '%')

Constructive predictive charge is 87.71 %

`#Calculate Adverse Predictive worth`Negative_predictive = spherical(TN / float(TN+ FN),4)*100

print("Adverse predictive charge is ",Negative_predictive, '%')

Adverse predictive charge is 95.52000000000001 %

**With the minimize off calculated by ROC curve at 0.25, we get the next outcomes:**

- Mannequin Accuracy = 92.32%
- Mannequin Sensitivity = 93.13%
- Mannequin Specificity = 91.82%

**Calculating Precision**

`#TP / TP + FP`Precision_rate = confusion[1,1]/(confusion[0,1]+confusion[1,1])*100

print ("Precision charge is ", Precision_rate, '%')

Precision charge is 87.7145085803432 %

**Calculating Recall**

`#TP / TP + FN`Recall_rate = confusion[1,1]/(confusion[1,0]+confusion[1,1])*100

print ("Recall charge is ", Recall_rate,"%")

Recall charge is 93.12629399585921 %

**Utilizing sklearn utilities to calculate and consider the Precision and Recall values**

`from sklearn.metrics import precision_score`precision_score = precision_score(y_train_pred_df.Transformed, y_train_pred_df.final_predicted)*100

print ("Precision rating is ", precision_score, '%')

Precision rating is 87.7145085803432 %

`from sklearn.metrics import recall_score`recall_score = recall_score(y_train_pred_df.Transformed, y_train_pred_df.final_predicted)*100

print ("Recall rating is ", recall_score, '%')

Recall rating is 93.12629399585921 %

Precision and Recall Tradeoff

`from sklearn.metrics import precision_recall_curve`y_train_pred_df.Transformed, y_train_pred_df.final_predicted

`p, r, thresholds = precision_recall_curve(y_train_pred_df.Transformed, y_train_pred_df.Converted_Prob)`

`plt.plot(thresholds, p[:-1], "g-")`

plt.plot(thresholds, r[:-1], "r-")

plt.present()

`#Making predictions on the check set`X_test[cols_to_scale] = scaler.remodel(X_test[cols_to_scale])

X_test.head()

`#Add a continuing to X_test`X_test_sm = sm.add_constant(X_test[rfe_cols])

X_test_sm.head()

`# Storing prediction of check set within the variable 'y_test_pred'`y_test_pred = res.predict(X_test_sm)

y_test_pred[:10]

`# Coverting it to df which is an array`y_pred_df = pd.DataFrame(y_test_pred)

y_pred_df.head()

`# Changing y_test to dataframe`y_test_df = pd.DataFrame(y_test)

# Placing Prospect ID to index

y_test_df['Prospect ID'] = y_test_df.index

# Take away index for each dataframes to append them facet by facet

y_pred_df.reset_index(drop=True, inplace=True)

y_test_df.reset_index(drop=True, inplace=True)

# Append y_test_df and y_pred_df

y_pred_final = pd.concat([y_test_df, y_pred_df],axis=1)

# Renaming column

y_pred_final = y_pred_final.rename(columns = {0 : 'Converted_Prob'})

y_pred_final.head()

`# Rearranging the columns`y_pred_final = y_pred_final.reindex(columns=['Prospect ID', 'Converted', 'Converted_Prob'])

# Let's examine the top of y_pred_final

y_pred_final.head()

`# Making prediction utilizing minimize off 0.3`y_pred_final['final_predicted'] = y_pred_final.Converted_Prob.map(lambda x: 1 if x > 0.25 else 0)

y_pred_final.head()

`# Examine the general accuracy`print('Take a look at mannequin accuracy is', spherical(metrics.accuracy_score(y_pred_final.Transformed, y_pred_final.final_predicted), 4) * 100, '%')

Take a look at mannequin accuracy is 93.11 %

`# Creating confusion matrix`confusion = metrics.confusion_matrix(y_pred_final.Transformed, y_pred_final.final_predicted )

confusion

`# Substituting the worth of true constructive`

TP = confusion[1,1]# Substituting the worth of true negatives

TN = confusion[0,0]

# Substituting the worth of false positives

FP = confusion[0,1]

# Substituting the worth of false negatives

FN = confusion[1,0]

`# Calculating the sensitivity`sensitivity = spherical(TP/(TP+FN), 4) * 100

print('Take a look at mannequin sensitivity is', sensitivity, '%')

Take a look at mannequin sensitivity is 94.17999999999999 %

`# Calculating the specificity`specificity = spherical(TN/(TN+FP), 4) * 100

print('Take a look at mannequin specificity is', specificity, '%')

Take a look at mannequin specificity is 92.5 %

**Making use of the minimize off of 0.25 on the check mannequin, we get the next outcomes:**

- Mannequin Accuracy = 93.11%
- Mannequin Sensitivity = 94.18%
- Mannequin Specificity = 92.5%

Assigning Lead Rating to the Take a look at knowledge

`y_pred_final['Lead_Score'] = y_pred_final.Converted_Prob.map( lambda x: spherical(x*100))`y_pred_final.head()

**Evaluating the efficiency between Practice & Take a look at knowledge:**

We’re getting good outcomes on each prepare and check knowledge, confirming that the mannequin is secure and generates good output.

Prospects with a lead rating of better than or equal to 85% could be handled as “Scorching Leads” and to be contacted for constructive conversion.

`hot_leads=y_pred_final.loc[y_pred_final["Lead_Score"]>=85]`

hot_leads

So we now have recognized 814 “Scorching Leads.”

`#Reshaping`

print("Prospect IDs of Scorching Leads are as under :")hot_leads_ids = hot_leads["Prospect ID"].values.reshape(-1)

hot_leads_ids

**Let’s discover out the essential options of our closing mannequin :**

`res.params.sort_values(ascending=False)`

Conclusion

The highest 3 variables that contribute most in the direction of the chance of a lead getting transformed are:

- Tags_Lost to EINS
- Tags_Closed by Horizon
- Lead_Source_Welingak Web site