Ujjawal Madan
October 8th 2020
We are interested in “resubscribe-eligible” users; Duolingo users that sign up for Duolingo Plus (our subscription service) and then expire from their subscription. Specifically, we are interested in whether, and why, these resubscribe-eligible users sign up for a second subscription.
I will be tackling this problem in multiple stages:
Currently, the data we are working with is App usage data on resubscribe-eligible users who were offered a promotional second free trial between July 8, 2020 and August 9, 2020.
Let's start by importing the data:
### Import libraries
import plotly.io as pio
pio.renderers.default = "notebook"
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
### Read in Data
#Relative path. please modify
df = pd.read_csv('usage_data.csv')
Let's take some time to visually explore our variables to get a better understanding of them. For these visualizations, I will be using the Plotly library.
### Create a copy to be used for our visualization
df_viz = df.copy()
### Removing Nulls
df_viz = df_viz[-df_viz.isnull().any(axis=1)]
df_viz = df_viz.reset_index().drop(['index'], axis = 1)
### Changing the initial tier variables so that we can read it as a categorical variable
d = {'6': 'Six', '12': 'Twelve', '1': 'One', 'referral': 'Referral'}
df_viz['initial_tier_cat'] = df_viz['initial_tier'].apply(lambda x: d[x])
### initial_tier variable
fig = px.histogram(df_viz, x="initial_tier_cat", facet_col="did_resubscribe", title="Initial Tier Variable")
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('What tier the user signed up for on their first Plus free trial (1-month, 6-month, 12-month, or a "referral" tier)')
As we can see, there does seem to be a difference in our histograms between those who resubscribe and those who don't. Of those who seem to subscribe through the referall process, it seems that a smaller proportion of them end up resubscribing. For those who had 12 month subscriptions, there is a higher proportion that end up resubscribing as well.
### intial_payment variable
fig = px.histogram(df_viz, x="initial_payment", facet_col="did_resubscribe", title = 'Initial Payment Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('Whether the user paid money on their first Plus subscription. Users pay when they reach the end of the free trial period without cancelling. Users do not pay when they cancel before the free trial ends, or if they subscribe to Plus via referral')
There does not seem to be a significant difference in the proportions of those who resubscribe between those who initially paid and those who didn't.
### intial_duration variable
fig = px.histogram(df_viz, x="initial_duration", facet_col="did_resubscribe", title = 'Initial Duration Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of days the user's first Plus subscription lasted, including the free trial")
The initial_duration histograms are quite similar to each other. However, it does seem interesting that for those of whom whose trial lasted between 10 and 14 days, the proportion of people resubscribing was significantly higher compared to people who had different intial_durations.
### plus challenges attempted
fig = px.histogram(df_viz, x="plus_challenges_attempted", facet_col="did_resubscribe", title = 'Plus Challenges Attempted Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of questions the user answered on Duolingo during their first Plus subscription')
After zooming in on the above plot, we can see that as the number of plus_challenges_attempted increases, the proportion of those individuals resubscribing also increases. This suggests that there is a correlation between the number of plus_challenges attempted and the probability an individual will resubscribe. There may be an underlying causal relationship as well but we would need to investigate further to make any reasonable conclusions.
### plus_challenges_correct variable
fig = px.histogram(df_viz, x="plus_challenges_correct", facet_col="did_resubscribe", title = 'Plus Challenges Correct Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of questions the user answered correctly on Duolingo during their first Plus subscription')
The above visualization is very similar to the plus_challenges_attempted visualization and I think we can use similar reasoning to say that there may be a positive correlation between plus_challenges_correct and the probability an individual will resubscribe.
### offer_delay variable
fig = px.histogram(df_viz, x="offer_delay", facet_col="did_resubscribe", title = 'Offer Delay Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of days between when the user's first Plus subscription expired and when they were offered the second free trial")
The above visualization is quite interesting to me. The two distributions which refer to the time in between the expiration of the first subscription and the offer for the second does seem to be different for both groups (resubscribe = False, resubscribe = True). It appears that minimizing the delay may indeed be linked to the probability one will resubscribe given that almost all individuals who resubscribe have an offer_delay of less than 20 days. This is particulary interesting because it is the one variable that may be within the organization's control. We will definitely come back to this.
### activity label variable
fig = px.histogram(df_viz, x="activity_label", facet_col="did_resubscribe", title = 'Activity Label Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('A measure of the user\'s activity at the point they are offered a second free trial. "Current users" were active at least once in the previous week, "reactivated users" were inactive for more than a week but less than a month, and "resurrected users" were inactive for more than a month')
Unsurprisingly, for those individuals who are current users, a higher proportion of them will resubscribe in comparison to the proportions of resurrected users and reactivated users who resubscribe. Intuitively this makes sense but it is definitely helpful to validate our intuition with these findings.
### mastery_quiz_ads variable
fig = px.histogram(df_viz, x="health_empty_ads", facet_col="did_resubscribe", title = 'Health Empty Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "health empty" ads a user saw for Plus per active day during the resubscribe period. A "health empty" ad is an ad that advertises the unlimited Hearts feature of Plus, which triggers whenever a user runs out of hearts')
Based on the above plot, there does not seem to be a significant difference between the number of health_empty ads and the proportions of those resubscribing.
### mastery_quiz_ads variable
fig = px.histogram(df_viz, x="mastery_quiz_ads", facet_col="did_resubscribe", title = 'Mastery Quiz Ads Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus')
There does appear to be a weak link between the number of master quiz ads a user saw and the probability they will resubscribe.
### Learning Time Variable
fig = px.histogram(df_viz, x="learning_time", facet_col="did_resubscribe", title = 'Learning Time Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of minutes per active day a user spent learning on Duolingo during the resubscribe period")
Given the above plot, it appears that learning time may be very influential in helping us to predict whether a person will or will not resubscribe. The proportion of those individuals who spend more time on the app per day and who resubscribe is higher than those do not spend as much time on the app. Again, this makes intuitive sense...
### New language variable
fig = px.histogram(df_viz, x="new_language", facet_col="did_resubscribe", title = 'New Language Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("Whether a user started learning a language during the resubscribe period they had never before learned on Duolingo or not")
Whether a user started a new langauge or not does not appear to be a very relavent factor in explaining the difference between those who resubscribe and those who don't.
### post_ativity variable
fig = px.histogram(df_viz, x="post_activity", facet_col="did_resubscribe", title = 'Post Activity Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('If a user started a second free trial, the number of days they were active on Duolingo during the first week of the second free trial')
Let's dig a bit deeper into our post_activity variable. It makes a lot of sense that for those of whom that haven't signed up for a second trial, they have not spent any time on the app during their hypothetical second trial. For those of whom who did sign up, the data indicates that the distribution is uniform. The number of individuals who were active for x days seems to be relatively equal for values of x from 0 to 7.
We should however note at this point that it would not make sense to include this in any predictive model given that we would not have this data at the time of implementation. There is no way we can know post_activity if we do not know whether an individual has resubscribed or not. This could be used as the target variable for a regression analysis but is not applicable in our case where we are trying to predict whether or not an individual will resubscribe.
### Resubscribed or Not
fig = px.histogram(df_viz, x="did_resubscribe", title = 'Resubscribed or Not')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
df_viz['did_resubscribe'].mean()
Last but not least, let's look at the proportion of individuals who resubscribe. The above proportion tells us that there is a class imbalance and this is crucial information for when we later build our predictive model.
Given the data, it makes sense for us to use the did_resubscribe variable in our analysis. As our goal is to predict whether or not a individual will resubscribe given what we know about them, this variable provides us with the information of whether or not they resubscribed.
Just out of curiosity, I was curious to see what the problems_per_day variable might look like (Average Attempted Problems per day) and also the problems_ratio_correct variable (Correctly Answered:Total Answered)
df_viz['problems_per_day'] = df_viz['plus_challenges_attempted']/df_viz['initial_duration']
df_viz['problems_ratio_correct'] = df_viz['plus_challenges_correct']/df_viz['plus_challenges_attempted']
### problems_per_day
fig = px.histogram(df_viz, x="problems_per_day", facet_col="did_resubscribe", title = 'Average Problems Attempted / Day')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
### problems_ratio_correct
fig = px.histogram(df_viz, x="problems_ratio_correct", facet_col="did_resubscribe", title = 'Correctly Answered / All Answered')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
How can there be values greater than 1?
### Questions Correct/Questions Attempted
sum(df_viz['plus_challenges_correct']>df_viz['plus_challenges_attempted'])
It seems that there may be issues without or plus_challenges_correct and plug_challenges_attempted data. I don't think anyone can get more questions correct than they have answered. Interestingly, I would not have made this insight if I hadn't created my own variables.
Let's get to it!
### Create a copy to be used during the model building phase
app_df = df.copy()
app_df.head()
Our data looks correct. Let's keep exploring!
app_df.shape
app_df.describe()
We should first check to see if there are any NA (Missing) values.
# Are there nan values?
app_df[app_df.isnull().any(axis=1)]
It seems that are a numerous rows that contain missing rows. There are a number of methods we can use to resolve this situation. The first step however requires us to look into whether there was an bias in the missing data. That is, do the data points with missing feature information form a sample that is represenative of the entire population? Or is it possible that there is some pattern in the missing data? Finding the reason for why we are missing data can provide us with a lot of information. At this time, I cannot conclude whether or not there is indeed a pattern. However, a cursory analysis can at least tell us at least whether the proportion of points labeled True differ greatly from the non-missing data points.
df[df.isnull().any(axis=1)]['did_resubscribe'].mean()
It seems that it does not differ greatly. Let's discuss possible strategies to deal with the missing data. Some of the more popular methods include removing missing values, using categorical variables to indicate missing data or estimating the missing data and imputing the missing fields with our own values (possibly also with perturbation). If we assume that there is no bias in the missing data, then it may be best to actually remove data points that contain missing values. As the # of incomplete data points: # of complete data points ration is quite small (~0.015), this method would avoid introducing errors whilst not shrinking our dataset significantly. If it turns out there is bias in the missing data, we may need to approach this differently.
### Remove missing values
app_df = app_df[-app_df.isnull().any(axis=1)]
app_df = app_df.reset_index().drop(['index'], axis = 1)
Let's examine the datatypes of our variables.
app_df.dtypes
We should convert our variables into the proper format.
app_df['initial_tier'] = app_df['initial_tier'].astype('category')
app_df['initial_payment'] = app_df['initial_payment'].astype('bool')
app_df['initial_duration'] = app_df['initial_duration'].astype('Int32')
app_df['plus_challenges_attempted'] = app_df['plus_challenges_attempted'].astype('Int32')
app_df['plus_challenges_correct'] = app_df['plus_challenges_correct'].astype('Int32')
app_df['offer_delay'] = app_df['offer_delay'].astype('Int32')
app_df['activity_label'] = app_df['activity_label'].astype('category')
app_df['new_language'] = app_df['new_language'].astype('bool')
app_df['post_activity'] = app_df['post_activity'].astype('Int32')
Given that there will be a number of algorithms we will be running which do not accept categorical data, we should explore solutions related to this issue.
One-Hot Encoding is a popular approach to convering categorical data into a numeric format and given the relatively few categories in both of the categorical variables in our data, I think this is an appropriate choice in this instance.
If it turns out that our best performing model is one that can handle categorical data, we can try running it on data that has not been one-hot encoded.
app_df[['initial_tier_1', 'initial_tier_12','initial_tier_6','initial_tier_referral']] = pd.get_dummies(app_df['initial_tier'])
app_df[['activity_label_current_user', 'activity_label_reactivated_user', 'activity_label_ressurected_user']] = pd.get_dummies(app_df['activity_label'])
app_df['label'] = app_df['did_resubscribe'].copy()
Let's drop the original categorical columns now that we have created the dummy variables as well as the post_activity variable which we no longer need.
app_df = app_df.drop(['user_id', 'initial_tier', 'activity_label', 'did_resubscribe'], axis = 1)
app_df = app_df.drop('post_activity', axis = 1)
One of our primarily goals during this analysis is to use quantitative methods to determine how well we can predict whether a user will start a second free trial.
We can accomplish this by building predictive models that will classify whether a user will start a second free trial or not based on the given feature information (input data).
It is clear that this is a supervised learning problem where our goal is to try and map a specific output based on an input. The question now is, what kind of model do we want to build?
It seems to me that there are two approaches that we can take. One solution is to build a classification algorithm which, if given an input, would classify the data point as either a True or False in this case. Our goal, as clearly stated, is to determine how well we can predict whether a user will resubscribe or not.
However, we can potentially improve on this approach by providing probabilities - that is how likely is the user to resubscribe based on the input data. One advantage is we could potentially fine tune our model later on (by adjusting the threshold) to better meet the needs of the business problem. For example, if the goal is to produce a model that will identify at least 80% of all individuals who will resubscribe, then we can adjust the thresholds of our model(s) to meet that requirement while minimizing False Positive and False Negatives. Additionally, numeric probabilities provide us with much more information than just binary class labels. In one example, we could potentially create customized multi-class labels later and more precisely distinguish between individuals who are very likely to resubscribe to individuals who are just on the cusp of resubscribing to individuals who are not at all likely to resubscribe.
Although we are not required to predict probabiltiies based on our objective, based on the inherent advantages I will create both classification and probability models.
As we now know, the dataset we are working with is not balanced among the two classes - True and False. This could lead to a variety of problems when building our model. Therefore, we should consider ahead of time how to best approach with this situation.
One solution involved adding class-weights to our models so that it accounts and corrects for the imbalance.
Another solutions involve using undesampling techniques whereby examples from the training dataset that belong to the majority class are removed in order to better balance the class distribution.
This is sort of opposite solution to undersampling where examples from the training dataset that belong to the minority class are duplicated in order to better balance the class distribution.
These are all valid solutions and I don't think there is any way of knowing which one is better than the other before hand so why don't we try all three!
At this stage, it is important to think about what metric we should be maximizing for in our models as well as the metric we will be using to compare the performance of our models.
There are a number of different metrics we could maximize for in our algorithms. Accuracy is the default metric that machine learning algorithms will optimize for but given the class imbalance, it is probably more appropriate to utilize the F1-Score which is the harmonic mean of both Precision and Recall.
In regards to the metric we will be using to compare models, I believe the AUC score from a TPR FPR Curve should allow us to accurately assess the performance of and compare models.
In this analysis, I will only be covering a few algorithms due to the time constraint. However, it may be worth considering other models as well at some point.
The four models that I would like to run are Random Forest, Logistic Regression, SVM and Gradient Boosting. These are models that I am personally familiar with and are also quite commonly used for classification purposes. I will be providing a short description of each model during the model-building phase.
Feature Selection is a process before inputting our data into any models which help us determine which combination of features may be most helpful in predicting our target variable. It will also help ensure that our model is not more complicated than it needs to be - simplicity is key. There are a variety of methods which we could employ and for the purposes of being thorough, I will examine the appropriateness and the advantages/disadvantages of some of the methods that I am currently considering. There may be other methods that I have not included but could be seriously considered.
A very simple and elementary method for feature selection is by manually removing features that are correlated with one another. Given that in a pair of correlated features one will end up being redundant, the solution is to drop one. However, I think we can and should explore more sophisticated techniques for feature selection.
Greedy algorithms are commonly used for feature selection, whereby at each step the algorithm does the one thing that looks best wihtout taking future options into consideration. RFE (Recursive Feature Elimination) is one of the more popular greedy algorithms that is used for feature selection and is one that we can consider. It is similar to a backward elimination greedy algorithm where the least import predictors are removed and the model is rebuilt. The disadvantages of RFE is that it is computationally expensive as well and does not make its decisions globally (looking at all the options at the same time).
L1 Regularization (Lasso) and Elastic Net is a Global Optimization Technique which adds a constraint to our objective function. Depending our regularization term, it will remove factors that have little predictive power, thus reducing the number of features.
PCA or Principal Component Analysis is a method for compressing the number of dimensions of our data. We could then potentially feed this compressed data (which has just been linearly transformed) into our model. To reiterate, this is not a feature selection method per se. Fimensionality reduction does not seems to be necessary in this case given that our variable:# of data points ratio is appropriate and our overall number of dimensions is relatively low. I do not think PCA is necessary in this case.
Typically, Lasso and Elastic Net are thought to be more accurate and sophisticated algorithms in comparison to greedy algorithms like RFE and as such, we will prefer Lasso and Elastic Net However, it might still be interesting to interpret and visualize the results of an RFE algorithm regardless.
Luckily, the functions we will be using to build our machine learning algorithms often have both the option for Lasso and Elastic Net Regularization. This makes our job much easier. For the Random Forest, let's try using the RFE method and see what comes of it.
Before running any models, we should consider whether our data may need to be standardized or normalized. Both RandomForest and XGBoost do no require any standardization or normalization given the nature of the algorithms. However, we should preprocess our data for both Logistic Regression and SVM. Luckily, we can add a normalizer or standardizer into our model pipelines as required. Given the nature of this data, I believe normalizing the data will be more appropriate than standardizing it, which will change the range of our numeric data to be between 0 and 1.
I am currently building models with data that includes outliers and am assuming that any outliers are not due to error. However, I think it would be helpful to create a training set that does not include any outliers and to see how it might perform on our test set. Given the time constraint, I will not be doing that in this analysis although ordinarily I would like to compare models trained on data including outliers with models trained on data excluding outliers.
Let's dive in!
### Importing libraries and functions
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.preprocessing import MinMaxScaler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
Although I was not planning on, I ended up doing a correlation matrix anyway just out of curiousity.
### Correlation Matrix
correlation_matrix = app_df.copy().drop('label', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
### Find correlated features
correlated_features = set()
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > 0.8:
colname = correlation_matrix.columns[i]
correlated_features.add(colname)
### List of correlated_features
correlated_features
The approach I have chosen to take is to split our data into a training and test set and to use 10-fold cross validation during the training of our models. The train/test split ratio largely depends on what number of data points is sufficient for us to be able to accurately assess our model(s) performance. ~2000 points seems to be a very reasonable number to me for our test set so I will use an 80/20 train/test split ratio.
X = app_df.iloc[:, 0:16]
y = app_df.iloc[:, 16]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
As discussed earlier, we are going to explore both undersampling and oversampling techniques. The following code will help us with resampling our data.
### Undersampling
rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))
### Oversampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)
print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))
The four models that we are buidling are: Random Forest, Logistic Regression, SVM and Gradient Boosting. I will be creating a function for each model that I will be running 3 times each (4 X 3 = 12 times in total). The first time I run each function will be using our imbalanced data and running the algorithm to account for such imbalances. The second time will be with the undersampled data and the third time will be with the oversampled data. Let's get started!
Random Forests are simply the aggregated results of numerous independent decision trees and belong in the family of algorithms called boosting: combining many weak learners into strong learners. This process is done via bootstrapping and branching. This is an excellent algorithm particularly appropriate for this data given the imbalance.
### Random Forest Function
def rf(x, y):
rfc = RandomForestClassifier(random_state=101, class_weight = 'balanced')
rf_clf = RFECV(estimator=rfc1, step=1, cv=StratifiedKFold(5), scoring='f1_macro')
rf_clf.fit(x, y)
print('Optimal number of features: {}'.format(rf_clf.n_features_))
plot_confusion_matrix(rf_clf, X_test, y_test, normalize = 'true')
return rf_clf
### Imbalanced Data
rf_clf = rf(X_train, y_train)
### Undersampling
rf_clf1 = rf(X_rus, y_rus)
### Oversampling
rf_clf2 = rf(X_ros, y_ros)
### RFE Visualization
plt.figure(figsize=(16, 9))
plt.title('Recursive Feature Elimination with Cross-Validation', fontsize=18, fontweight='bold', pad=20)
plt.xlabel('Number of features selected', fontsize=14, labelpad=20)
plt.ylabel('% Correct Classification', fontsize=14, labelpad=20)
plt.plot(range(1, len(rf_clf.grid_scores_) + 1), rf_clf.grid_scores_, color='#303F9F', linewidth=3)
plt.show()
### List of variables that were eliminated
X.columns[rfecv.support_ == False]
### Feature Importance Based on RFE Algorithm
dset = pd.DataFrame()
dset['attr'] = X.columns[rf_clf.support_ == True]
dset['importance'] = rf_clf.estimator_.feature_importances_
dset = dset.sort_values(by='importance', ascending=False)
plt.figure(figsize=(16, 14))
plt.barh(y=dset['attr'], width=dset['importance'], color='#1976D2')
plt.title('RFECV - Feature Importances', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Importance', fontsize=14, labelpad=20)
plt.show()
### TPR vs FPR Plot
disp = metrics.plot_roc_curve(rf_clf, X_test, y_test,name = 'RF Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(rf_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'RF RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(rf_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'RF ROS')
Logistic Regression is similar to linear regression except in a few key ways, one being that it ouputs probabilities. Logistic Regression uses the log odds ratio and an iterative maximum likelihood method rather than a least squares to fit the final model.
### Logistic Regression Function
def log(x, y):
from sklearn.model_selection import GridSearchCV
parameters = {
'sgdclassifier__loss': ['log'],
'sgdclassifier__penalty': ['elasticnet', 'l1'],
'sgdclassifier__class_weight': ['balanced'],
}
log_clf = make_pipeline(
MinMaxScaler(),
SGDClassifier(),
)
log_clf = GridSearchCV(log_clf, parameters, cv = 10)
log_clf.fit(x, y)
plot_confusion_matrix(log_clf, X_test, y_test, normalize = 'true')
return log_clf
### Imbalanced Data
log_clf = log(X_train, y_train)
### Undersampling
log_clf1 = log(X_rus, y_rus)
### Oversampling
log_clf2 = log(X_ros, y_ros)
### TPR vs FPR Curve
disp = metrics.plot_roc_curve(log_clf, X_test, y_test, name = 'LR Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(log_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'LR RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(log_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'LR ROS')
Support Vector Machine (SVM), specifically in this binary classification is based on the idea of finding a hyperplane that best divides a dataset into two classes.
### SVM Algorithm
def svm(x,y):
from sklearn.model_selection import GridSearchCV
parameters = {
'sgdclassifier__loss': ['hinge'],
'sgdclassifier__penalty': ['l1'],
'sgdclassifier__class_weight': ['balanced'],
}
svm_clf = make_pipeline(
MinMaxScaler(),
SGDClassifier(),
)
svm_clf = GridSearchCV(svm_clf, parameters, cv = 10)
svm_clf.fit(x, y)
plot_confusion_matrix(svm_clf, X_test, y_test, normalize = 'true')
return svm_clf
### Imbalanced Data
svm_clf = svm(X_train, y_train)
### Undersampling
svm_clf1 = svm(X_rus, y_rus)
### Oversampling
svm_clf2 = svm(X_ros, y_ros)
### TPR vs FPR Curves
from pylab import rcParams
rcParams['figure.figsize'] = 13, 7
disp = metrics.plot_roc_curve(svm_clf, X_test, y_test, name = 'SVM Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(svm_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'SVM RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'SVM ROS')
Gradient Boosting is an algorithm that belongs in the family of boosting algorithms. It's similar to Random Forest except in a few key ways- the primary one being that each new decision tree attempts to correct errors made by the previous decision tree
### Gradient Boosting Algorithm
def gb(x,y):
parameters = {'gradientboostingclassifier__loss': ['deviance', 'exponential'],
'gradientboostingclassifier__max_features': ['auto', 'sqrt', 'log2']
}
gb_clf = make_pipeline(
GradientBoostingClassifier(random_state=0),
)
gb_clf = GridSearchCV(gb_clf, parameters, cv = 10)
gb_clf.fit(x, y)
plot_confusion_matrix(gb_clf, X_test, y_test, normalize = 'true')
# y_pred = gb_clf.predict_proba(X_test)
# precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1])
# return gb_clf, (precision, recall, thresholds)
return gb_clf
### Imbalanced Data
gb_clf = gb(X_train, y_train)
### Undersampling
gb_clf1 = gb(X_rus, y_rus)
### Oversampling
gb_clf2 = gb(X_ros, y_ros)
### TPR vs FPR Curve
disp = metrics.plot_roc_curve(gb_clf, X_test, y_test, name = 'GB Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(gb_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'GB RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'GB ROS')
from pylab import rcParams
rcParams['figure.figsize'] = 16, 8
a = 0.7
disp = metrics.plot_roc_curve(rf_clf, X_test, y_test,name = 'RF Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(rf_clf1, X_test, y_test, ax=ax, alpha=a, name = 'RF RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(rf_clf2, X_test, y_test, ax=ax1, alpha=a, name = 'RF ROS')
ax2 = plt.gca()
disp = metrics.plot_roc_curve(log_clf, X_test, y_test, ax=ax2, alpha = a, name = 'LR Balanced')
ax3 = plt.gca()
disp = metrics.plot_roc_curve(log_clf1, X_test, y_test, ax=ax3, alpha=a, name = 'LR RUS')
ax4 = plt.gca()
disp = metrics.plot_roc_curve(log_clf2, X_test, y_test, ax=ax4, alpha=a, name = 'LR ROS')
ax5 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf, X_test, y_test, ax = ax5, alpha = a, name = 'SVM Balanced')
ax6 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf1, X_test, y_test, ax=ax6, alpha=a, name = 'SVM RUS')
ax7 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf2, X_test, y_test, ax=ax7, alpha=a, name = 'SVM ROS')
ax8 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf, X_test, y_test, ax = ax8, alpha = a, name = 'GB Balanced')
ax9 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf1, X_test, y_test, ax=ax9, alpha=a, name = 'GB RUS')
ax10 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf2, X_test, y_test, ax=ax10, alpha=a, name = 'GB ROS')
plt.title('True Positive Rate VS False Positive Rate')
It seems that in regards to working with imbalanced data VS undersampling VS Oversampling, neither method was conclusively better than the other.
However, it is clear that one algorithm performed better than all the others. As we can see above, Gradient Boosting performed much better than all the other algorithms in terms of AUC and that at any point on its TPR FPR curve, it performs better than the other algorithms.
We can definitely improve on the Gradient Boosting Model by incorporating hyperparameter tuning.
This is an interesting question. Currently, we have been using the AUC score under the TPR FPR curve to compare models. It is a relatively simple metric and is fairly straighforward.
However, our model threshold has not been configured to minimize False Positive and False Negative Costs. So before we can describe how well our model performs, we must take into account both the costs associated with both the False Positive and False Negative Outcomes. For example, we could potentially configure out model so that it will be able to correctly predict every single resubscriber. But it will likely be at the cost of misclassifying individuals that have left the platform as resubscribers. This could be thought of as a great model if the FP figure does not matter. But for most practical applications this would not be a good model because there would some costs associated with False Positive. Quantifying those costs and then configuring our models to minimize for total cost is crucial.
Practically, another way of describing the effectiveness our models is to hold TPR, FPR or a similar metric as a constant so as to define our threshold. We would then configure our model to meet that requirement and provide other more tangible metrics that can describe the effectiveness of our model. For example, if a Product Manager were to ask - "Describe for me the effectiveness of your model given that I want to correctly identify at least 80% of Resubscribers", then I would first configure my model so that it met that requirement. And then I might use Precision (Total Correctly Predicted), FPR (False Positive Rate) or a similar measure to explain the drawbacks or costs. Continuing with this example, I might say that I can indeed correctly predict approximately 80 percent of all resubscribers but out of all my predictions, only be 25% of them will be accurate. In fact, in the cell below, I actually run through this exact example if faced with such a question in the cell below.
### Precision, Recall, Threshold given that Recall > 0.8 and model is gb_clf
for i in range(len(scores[1])):
if scores[1][i] <= 0.8:
print('Precision is ' + str(scores[0][i-1].round(4)))
print('Recall is ' + str(scores[1][i-1].round(4)))
print('Threshold is ' + str(scores[2][i-1].round(4)))
break
What variables are most useful for such a prediction?
One of our objectives is to review which variables are most useful our prediction. Luckily, there are some simple methods we can utilize to determine variable importance. With the RandomForest and Gradient Boosting algorithms, one can visualize feature importance based on how much each feature contributes to decreasing the weighted impurity. I have provided both the variable importance based on a Random Forest as well as XGBoost just as a compare and contrast.
### Feature Importance Visualize Function
def feature_importance(forest):
forest.fit(X_train, y_train)
importances = forest.feature_importances_
try:
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
except:
pass
indices = np.argsort(importances)[::-1]
l1 = []
l2 = []
l3 = []
for f in range(X_train.shape[1]):
l1.append(f+1)
l2.append("feature %d (%f)" % (indices[f], importances[indices[f]]))
l3.append(X_train.columns[indices[f]])
# Plot the impurity-based feature importances of the forest
plt.figure(figsize=(15, 8), dpi= 300, facecolor='w', edgecolor='k')
plt.title("Feature importances")
try:
plt.bar(range(X_train.shape[1]), importances[indices],color="r", align="center", yerr=std[indices])
except:
plt.bar(range(X_train.shape[1]), importances[indices],color="r", align="center")
plt.xticks(range(X_train.shape[1]), l3)
plt.xticks(rotation=40)
plt.xlim([-1, X_train.shape[1]])
plt.show()
table = pd.DataFrame([l1,l2,l3]).T
table.columns = ['Ranking', 'Feature Number', 'Feature Name']
return table
### Random Forest Feature Importance
feature_importance(RandomForestRegressor(n_estimators=250,random_state=0))
### Gradient Boosting Feature Importance
feature_importance(GradientBoostingClassifier(random_state=0))
If we take a look at the top 7 most influential features, the three key features that are within the company's control (or somewhat) include:
These features can be altered by Duolingo directly. For example, Duolingo can decide exactly when for each user to offer a second trial. They can also decide exactly how many ads of what kind to show to each user.
Let's start by examining the offer_delay feature more closely.
### offer_delay variable
fig = px.histogram(df_viz, x="offer_delay", facet_col="did_resubscribe", title = 'Offer Delay Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
If we go back to our visualization, we can see that offer_delay is negatively correlated with the probability one might resubscribe. And this is crucial because it is in Duolingo's control when they offer the second subscription. So if we were to think of our probability as a function of x, where x is the offer_delay and the other feature information is held constant, and we are trying to maximize the probability an individual resubscribed, then what if tried using our algorithm to find the optimal value for offer_delay. It may be a simplistic approach and there are likely better ways, perhaps using optimization techniques, to find the optimal value for offer_delay, but this could offer a window into what the optimal value might be. Let's try it!
def optimize_offer(i):
df1 = X_train.iloc[i:i+1]
df1 = df1.iloc[np.arange(len(df1)).repeat(300)]
df1['offer_delay'] = [i for i in range(0,300)]
fig = px.scatter(x=[i for i in range(0,300)], y=gb_clf.predict_proba(df1)[:,1])
fig.show()
print('Sample Data')
return df1.head()
### Optimize for sample row 8 in our test set
optimize_offer(8)
This reveals that the optimal offer_delay value is 1. And it can make a relatively large difference (25%) in our probability estimation. While it's likely that 1 is the optimal value for most data points, we could potentially use this kind of method to explain how much offer_delay influences our final prediction.
Intuitively, it makes sense that after finishing a free trial, immediately being offered a second free week following the first will increase the chance that one would sign up for the second trial as well.
### mastery_quiz_ads variable
fig = px.histogram(df_viz, x="mastery_quiz_ads", facet_col="did_resubscribe", title = 'Mastery Quiz Ads Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus')
Two of other variables that seem to have some influence in our prediction is the number of Mastery Quiz and Health Empty Ads (Mastery Quiz Ads more so). We should be able to use a similar approach like we used above to see how how mastery quiz ads affect our prediction.
def optimize_ads(i):
df1 = X_train.iloc[i:i+1]
df1 = df1.iloc[np.arange(len(df1)).repeat(20)]
df1['mastery_quiz_ads'] = [np.round(i, 1) for i in np.linspace(0, 2, 20)]
fig = px.scatter(x=[np.round(i, 1) for i in np.linspace(0, 2, 20)], y=gb_clf.predict_proba(df1)[:,1])
fig.show()
print('Sample Data')
return df1.head()
### Optimize for sample row 9 in our test set.
optimize_ads(9)
In this example, there may not be an optimal solution for all points in terms of what number of ads is appropriate. Based on some experimentation, it seems that it can vary depending on the data point. We can perform this for health_ads as well.
There are several features that are strongly predictive of our target variable but we do not have indirect control over or cannot immediately change. However, being aware that these are some of the more predictive features in our model will help us understand what aspects of the product to prioritize.
Learning_time is the most influential feature based on our analysis. The more time customers spend on average per day, the more likely they are to resubscribe. According to our GB feature importance plot, this was given a 30% feature importance.
The greater the number of questions answered or answered correctly, the higher the probability the customer will resubscribe. According to our RF Feature Importance Plot, this was given 10% Feature Importance.
The longer the initial subscription, the higher the probability the customer will resubscribe. According to our GB Feature Importance plot, this was given 20% feature importance.
There are three variables that I have identified which influence our prediction and which as an organization we may have some control over and can therefore optimize for:
If our goal is solely to maximize the number of individuals who sign up for the second free trial, then we could potentially use our model to predict the optimal number of days for each user. Using A/B Experimentation, we could see whether our predictions our indeed increasing the number of resubscriptions. To further extend this if we want to make this an ongoing experiment, we can employ a hybrid Multi-Armed Bandit Model in which after set periods of time, we evaluate our predictions and adjust our models to better optimize for Offer_Delay.
We can utilize an approach similar to the A/B Testing and Hybrid Multi-Armed Bandit approach I described above. However, there are a different set of constraints to take into account.
If the cost of running an ad for the user exceeds the potential value of gaining a resubscription, then it may not be cost effective. A cost benefit analysis is crucial in this experiment.