Duolingo Take Home Project

Ujjawal Madan
October 8th 2020
alt text

Problem Statement

We are interested in “resubscribe-eligible” users; Duolingo users that sign up for Duolingo Plus (our subscription service) and then expire from their subscription. Specifically, we are interested in whether, and why, these resubscribe-eligible users sign up for a second subscription.

I will be tackling this problem in multiple stages:

  • Exploratory Data Analysis and Visualization
  • Preprocessing
  • Model-Building
  • Evaluating our models
  • Evaluating Variable Importance
  • Insights and Next Steps

Currently, the data we are working with is App usage data on resubscribe-eligible users who were offered a promotional second free trial between July 8, 2020 and August 9, 2020.

Let's start by importing the data:

In [1548]:
### Import libraries

import plotly.io as pio
pio.renderers.default = "notebook"

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
In [1549]:
### Read in Data

#Relative path. please modify
df = pd.read_csv('usage_data.csv')

Exploring the Data

Let's take some time to visually explore our variables to get a better understanding of them. For these visualizations, I will be using the Plotly library.

In [1550]:
### Create a copy to be used for our visualization

df_viz = df.copy()
In [1551]:
### Removing Nulls

df_viz = df_viz[-df_viz.isnull().any(axis=1)]
df_viz = df_viz.reset_index().drop(['index'], axis = 1)
In [1552]:
### Changing the initial tier variables so that we can read it as a categorical variable

d = {'6': 'Six', '12': 'Twelve', '1': 'One', 'referral': 'Referral'}
df_viz['initial_tier_cat'] = df_viz['initial_tier'].apply(lambda x: d[x])
In [1553]:
### initial_tier variable

fig = px.histogram(df_viz, x="initial_tier_cat", facet_col="did_resubscribe", title="Initial Tier Variable")
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('What tier the user signed up for on their first Plus free trial (1-month, 6-month, 12-month, or a "referral" tier)')
What tier the user signed up for on their first Plus free trial (1-month, 6-month, 12-month, or a "referral" tier)

As we can see, there does seem to be a difference in our histograms between those who resubscribe and those who don't. Of those who seem to subscribe through the referall process, it seems that a smaller proportion of them end up resubscribing. For those who had 12 month subscriptions, there is a higher proportion that end up resubscribing as well.

In [1486]:
### intial_payment variable

fig = px.histogram(df_viz, x="initial_payment", facet_col="did_resubscribe", title = 'Initial Payment Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('Whether the user paid money on their first Plus subscription. Users pay when they reach the end of the free trial period without cancelling. Users do not pay when they cancel before the free trial ends, or if they subscribe to Plus via referral')
Whether the user paid money on their first Plus subscription. Users pay when they reach the end of the free trial period without cancelling. Users do not pay when they cancel before the free trial ends, or if they subscribe to Plus via referral

There does not seem to be a significant difference in the proportions of those who resubscribe between those who initially paid and those who didn't.

In [1488]:
### intial_duration variable

fig = px.histogram(df_viz, x="initial_duration", facet_col="did_resubscribe", title = 'Initial Duration Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of days the user's first Plus subscription lasted, including the free trial")
The number of days the user's first Plus subscription lasted, including the free trial

The initial_duration histograms are quite similar to each other. However, it does seem interesting that for those of whom whose trial lasted between 10 and 14 days, the proportion of people resubscribing was significantly higher compared to people who had different intial_durations.

In [1489]:
### plus challenges attempted

fig = px.histogram(df_viz, x="plus_challenges_attempted", facet_col="did_resubscribe", title = 'Plus Challenges Attempted Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of questions the user answered on Duolingo during their first Plus subscription')
The number of questions the user answered on Duolingo during their first Plus subscription

After zooming in on the above plot, we can see that as the number of plus_challenges_attempted increases, the proportion of those individuals resubscribing also increases. This suggests that there is a correlation between the number of plus_challenges attempted and the probability an individual will resubscribe. There may be an underlying causal relationship as well but we would need to investigate further to make any reasonable conclusions.

In [1490]:
### plus_challenges_correct variable

fig = px.histogram(df_viz, x="plus_challenges_correct", facet_col="did_resubscribe", title = 'Plus Challenges Correct Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of questions the user answered correctly on Duolingo during their first Plus subscription')
The number of questions the user answered correctly on Duolingo during their first Plus subscription

The above visualization is very similar to the plus_challenges_attempted visualization and I think we can use similar reasoning to say that there may be a positive correlation between plus_challenges_correct and the probability an individual will resubscribe.

In [1491]:
### offer_delay variable

fig = px.histogram(df_viz, x="offer_delay", facet_col="did_resubscribe", title = 'Offer Delay Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of days between when the user's first Plus subscription expired and when they were offered the second free trial")
The number of days between when the user's first Plus subscription expired and when they were offered the second free trial

The above visualization is quite interesting to me. The two distributions which refer to the time in between the expiration of the first subscription and the offer for the second does seem to be different for both groups (resubscribe = False, resubscribe = True). It appears that minimizing the delay may indeed be linked to the probability one will resubscribe given that almost all individuals who resubscribe have an offer_delay of less than 20 days. This is particulary interesting because it is the one variable that may be within the organization's control. We will definitely come back to this.

In [1492]:
### activity label variable

fig = px.histogram(df_viz, x="activity_label", facet_col="did_resubscribe", title = 'Activity Label Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('A measure of the user\'s activity at the point they are offered a second free trial. "Current users" were active at least once in the previous week, "reactivated users" were inactive for more than a week but less than a month, and "resurrected users" were inactive for more than a month')
A measure of the user's activity at the point they are offered a second free trial. "Current users" were active at least once in the previous week, "reactivated users" were inactive for more than a week but less than a month, and "resurrected users" were inactive for more than a month

Unsurprisingly, for those individuals who are current users, a higher proportion of them will resubscribe in comparison to the proportions of resurrected users and reactivated users who resubscribe. Intuitively this makes sense but it is definitely helpful to validate our intuition with these findings.

In [1494]:
### mastery_quiz_ads variable

fig = px.histogram(df_viz, x="health_empty_ads", facet_col="did_resubscribe", title = 'Health Empty Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "health empty" ads a user saw for Plus per active day during the resubscribe period. A "health empty" ad is an ad that advertises the unlimited Hearts feature of Plus, which triggers whenever a user runs out of hearts')
The number of "health empty" ads a user saw for Plus per active day during the resubscribe period. A "health empty" ad is an ad that advertises the unlimited Hearts feature of Plus, which triggers whenever a user runs out of hearts

Based on the above plot, there does not seem to be a significant difference between the number of health_empty ads and the proportions of those resubscribing.

In [1493]:
### mastery_quiz_ads variable

fig = px.histogram(df_viz, x="mastery_quiz_ads", facet_col="did_resubscribe", title = 'Mastery Quiz Ads Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus')
The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus

There does appear to be a weak link between the number of master quiz ads a user saw and the probability they will resubscribe.

In [1495]:
### Learning Time Variable

fig = px.histogram(df_viz, x="learning_time", facet_col="did_resubscribe", title = 'Learning Time Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of minutes per active day a user spent learning on Duolingo during the resubscribe period")
The number of minutes per active day a user spent learning on Duolingo during the resubscribe period

Given the above plot, it appears that learning time may be very influential in helping us to predict whether a person will or will not resubscribe. The proportion of those individuals who spend more time on the app per day and who resubscribe is higher than those do not spend as much time on the app. Again, this makes intuitive sense...

In [1496]:
### New language variable

fig = px.histogram(df_viz, x="new_language", facet_col="did_resubscribe", title = 'New Language Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("Whether a user started learning a language during the resubscribe period they had never before learned on Duolingo or not")
Whether a user started learning a language during the resubscribe period they had never before learned on Duolingo or not

Whether a user started a new langauge or not does not appear to be a very relavent factor in explaining the difference between those who resubscribe and those who don't.

In [1499]:
### post_ativity variable

fig = px.histogram(df_viz, x="post_activity", facet_col="did_resubscribe", title = 'Post Activity Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('If a user started a second free trial, the number of days they were active on Duolingo during the first week of the second free trial')
If a user started a second free trial, the number of days they were active on Duolingo during the first week of the second free trial

Let's dig a bit deeper into our post_activity variable. It makes a lot of sense that for those of whom that haven't signed up for a second trial, they have not spent any time on the app during their hypothetical second trial. For those of whom who did sign up, the data indicates that the distribution is uniform. The number of individuals who were active for x days seems to be relatively equal for values of x from 0 to 7.

We should however note at this point that it would not make sense to include this in any predictive model given that we would not have this data at the time of implementation. There is no way we can know post_activity if we do not know whether an individual has resubscribed or not. This could be used as the target variable for a regression analysis but is not applicable in our case where we are trying to predict whether or not an individual will resubscribe.

In [1501]:
### Resubscribed or Not

fig = px.histogram(df_viz, x="did_resubscribe", title = 'Resubscribed or Not')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

df_viz['did_resubscribe'].mean()
Out[1501]:
0.12477341389728097

Last but not least, let's look at the proportion of individuals who resubscribe. The above proportion tells us that there is a class imbalance and this is crucial information for when we later build our predictive model.

Given the data, it makes sense for us to use the did_resubscribe variable in our analysis. As our goal is to predict whether or not a individual will resubscribe given what we know about them, this variable provides us with the information of whether or not they resubscribed.

Visualizing my own variables

Just out of curiosity, I was curious to see what the problems_per_day variable might look like (Average Attempted Problems per day) and also the problems_ratio_correct variable (Correctly Answered:Total Answered)

In [1555]:
df_viz['problems_per_day'] = df_viz['plus_challenges_attempted']/df_viz['initial_duration']
df_viz['problems_ratio_correct'] = df_viz['plus_challenges_correct']/df_viz['plus_challenges_attempted']
In [1556]:
### problems_per_day

fig = px.histogram(df_viz, x="problems_per_day", facet_col="did_resubscribe", title = 'Average Problems Attempted / Day')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
In [1557]:
### problems_ratio_correct

fig = px.histogram(df_viz, x="problems_ratio_correct", facet_col="did_resubscribe", title = 'Correctly Answered / All Answered')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

How can there be values greater than 1?

In [1558]:
### Questions Correct/Questions Attempted

sum(df_viz['plus_challenges_correct']>df_viz['plus_challenges_attempted'])
Out[1558]:
111

It seems that there may be issues without or plus_challenges_correct and plug_challenges_attempted data. I don't think anyone can get more questions correct than they have answered. Interestingly, I would not have made this insight if I hadn't created my own variables.

Let's get to it!

In [1595]:
### Create a copy to be used during the model building phase

app_df = df.copy()

Preparing our Data

In [1596]:
app_df.head()
Out[1596]:
user_id initial_tier initial_payment initial_duration plus_challenges_attempted plus_challenges_correct offer_delay activity_label health_empty_ads mastery_quiz_ads learning_time new_language post_activity did_resubscribe
0 91cfb886c01a3b69ff29a8444620d8eb 6 False 7.0 1255.0 999.0 78.0 resurrected_user 0.0 0.000 0.00 False 0.0 False
1 bc354f81f63043619115792b52d14d32 referral False 7.0 58.0 55.0 152.0 resurrected_user 0.0 0.000 2.32 False 0.0 False
2 a28cd2fc4e7fbce59a91a16f42daad87 referral False 7.0 543.0 504.0 42.0 resurrected_user 0.0 0.000 0.00 False 0.0 False
3 e8b7ef7465b5f87cab8bac0345cae332 1 False 7.0 168.0 162.0 2.0 current_user 0.0 0.333 5.81 False 0.0 False
4 89e71f674719f9ba93cf88a191e20104 1 False 7.0 582.0 510.0 115.0 resurrected_user 0.0 0.000 0.62 False 0.0 False

Our data looks correct. Let's keep exploring!

In [1597]:
app_df.shape
Out[1597]:
(10077, 14)
In [1598]:
app_df.describe()
Out[1598]:
initial_duration plus_challenges_attempted plus_challenges_correct offer_delay health_empty_ads mastery_quiz_ads learning_time post_activity
count 10062.000000 10063.000000 10067.000000 10066.000000 10060.000000 10068.000000 10057.000000 10065.000000
mean 35.324588 1161.908973 1034.511175 139.991357 0.203973 0.211489 28.512222 0.436960
std 73.182226 2910.942785 2577.811596 137.462982 0.447673 0.451675 213.981577 1.409789
min 3.000000 0.000000 0.000000 0.000000 -2.009000 0.000000 0.000000 0.000000
25% 7.000000 172.000000 153.000000 17.000000 0.000000 0.000000 1.290000 0.000000
50% 7.000000 446.000000 399.000000 96.000000 0.032000 0.138500 4.660000 0.000000
75% 14.000000 1048.000000 942.000000 238.000000 0.250000 0.250000 10.510000 0.000000
max 464.000000 61472.000000 57089.000000 503.000000 15.000000 11.750000 2852.560000 7.000000

NA Values

We should first check to see if there are any NA (Missing) values.

In [1599]:
# Are there nan values?

app_df[app_df.isnull().any(axis=1)]
Out[1599]:
user_id initial_tier initial_payment initial_duration plus_challenges_attempted plus_challenges_correct offer_delay activity_label health_empty_ads mastery_quiz_ads learning_time new_language post_activity did_resubscribe
15 a21d57c2a4be900872f46940f6fbc6d6 1 False 7.0 142.0 136.0 73.0 reactivated_user 0.105 NaN 14.99 True 0.0 False
108 101b051d20bc8aeec77f81a3390403d4 12 False 7.0 0.0 0.0 NaN resurrected_user 0.000 0.500 3.26 True 0.0 False
163 d7cbd9142dec663c96eec22c9d664944 12 False 14.0 52.0 51.0 5.0 reactivated_user 0.333 0.583 34.11 True NaN False
240 d2cc41b4c6e6e6940c1599a49af5787f 6 False NaN 410.0 357.0 98.0 resurrected_user 0.000 0.000 7.39 True 0.0 False
268 1d3bdee23fb8b550817d00ee2962acf5 referral False 7.0 661.0 601.0 360.0 resurrected_user 0.053 0.184 4.86 False NaN False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9701 77d39526981651916a4f57f42338f446 12 False 7.0 394.0 345.0 454.0 resurrected_user NaN 0.333 1.37 False 0.0 False
9720 f91bf686e3275bce9891b8c7a73b2d01 12 False 14.0 0.0 0.0 2.0 reactivated_user NaN 0.000 1.41 False 0.0 False
9772 29e24656f98e98a5208c02141bc87d54 NaN False 7.0 1162.0 1080.0 88.0 resurrected_user 0.200 0.300 11.59 False 0.0 False
9890 a24b48b844c69f3d7aca0732cd89013b 1 False 7.0 189.0 178.0 388.0 NaN 0.000 0.391 13.32 True 0.0 False
10012 d0871fd5a726d27363a54b69882be9c9 12 False 7.0 NaN 48.0 253.0 resurrected_user 0.155 1.741 9.53 True 0.0 False

147 rows × 14 columns

It seems that are a numerous rows that contain missing rows. There are a number of methods we can use to resolve this situation. The first step however requires us to look into whether there was an bias in the missing data. That is, do the data points with missing feature information form a sample that is represenative of the entire population? Or is it possible that there is some pattern in the missing data? Finding the reason for why we are missing data can provide us with a lot of information. At this time, I cannot conclude whether or not there is indeed a pattern. However, a cursory analysis can at least tell us at least whether the proportion of points labeled True differ greatly from the non-missing data points.

In [1600]:
df[df.isnull().any(axis=1)]['did_resubscribe'].mean()
Out[1600]:
0.1360544217687075

It seems that it does not differ greatly. Let's discuss possible strategies to deal with the missing data. Some of the more popular methods include removing missing values, using categorical variables to indicate missing data or estimating the missing data and imputing the missing fields with our own values (possibly also with perturbation). If we assume that there is no bias in the missing data, then it may be best to actually remove data points that contain missing values. As the # of incomplete data points: # of complete data points ration is quite small (~0.015), this method would avoid introducing errors whilst not shrinking our dataset significantly. If it turns out there is bias in the missing data, we may need to approach this differently.

In [1601]:
### Remove missing values

app_df = app_df[-app_df.isnull().any(axis=1)]
app_df = app_df.reset_index().drop(['index'], axis = 1)

Let's examine the datatypes of our variables.

In [1602]:
app_df.dtypes
Out[1602]:
user_id                       object
initial_tier                  object
initial_payment               object
initial_duration             float64
plus_challenges_attempted    float64
plus_challenges_correct      float64
offer_delay                  float64
activity_label                object
health_empty_ads             float64
mastery_quiz_ads             float64
learning_time                float64
new_language                  object
post_activity                float64
did_resubscribe                 bool
dtype: object

We should convert our variables into the proper format.

In [1603]:
app_df['initial_tier'] = app_df['initial_tier'].astype('category')
app_df['initial_payment'] = app_df['initial_payment'].astype('bool')
app_df['initial_duration'] = app_df['initial_duration'].astype('Int32')
app_df['plus_challenges_attempted'] = app_df['plus_challenges_attempted'].astype('Int32')
app_df['plus_challenges_correct'] = app_df['plus_challenges_correct'].astype('Int32')
app_df['offer_delay'] = app_df['offer_delay'].astype('Int32')
app_df['activity_label'] = app_df['activity_label'].astype('category')
app_df['new_language'] = app_df['new_language'].astype('bool')
app_df['post_activity'] = app_df['post_activity'].astype('Int32')

Handling Categorical Data

Given that there will be a number of algorithms we will be running which do not accept categorical data, we should explore solutions related to this issue.

One-Hot Encoding is a popular approach to convering categorical data into a numeric format and given the relatively few categories in both of the categorical variables in our data, I think this is an appropriate choice in this instance.

If it turns out that our best performing model is one that can handle categorical data, we can try running it on data that has not been one-hot encoded.

In [1604]:
app_df[['initial_tier_1', 'initial_tier_12','initial_tier_6','initial_tier_referral']] = pd.get_dummies(app_df['initial_tier'])
In [1605]:
app_df[['activity_label_current_user', 'activity_label_reactivated_user', 'activity_label_ressurected_user']] = pd.get_dummies(app_df['activity_label'])
In [1606]:
app_df['label'] = app_df['did_resubscribe'].copy()

Let's drop the original categorical columns now that we have created the dummy variables as well as the post_activity variable which we no longer need.

In [1607]:
app_df = app_df.drop(['user_id', 'initial_tier', 'activity_label', 'did_resubscribe'], axis = 1)
In [1608]:
app_df = app_df.drop('post_activity', axis = 1)

Model-Building

One of our primarily goals during this analysis is to use quantitative methods to determine how well we can predict whether a user will start a second free trial.

We can accomplish this by building predictive models that will classify whether a user will start a second free trial or not based on the given feature information (input data).

What kind of model are we building?

It is clear that this is a supervised learning problem where our goal is to try and map a specific output based on an input. The question now is, what kind of model do we want to build?

It seems to me that there are two approaches that we can take. One solution is to build a classification algorithm which, if given an input, would classify the data point as either a True or False in this case. Our goal, as clearly stated, is to determine how well we can predict whether a user will resubscribe or not.

However, we can potentially improve on this approach by providing probabilities - that is how likely is the user to resubscribe based on the input data. One advantage is we could potentially fine tune our model later on (by adjusting the threshold) to better meet the needs of the business problem. For example, if the goal is to produce a model that will identify at least 80% of all individuals who will resubscribe, then we can adjust the thresholds of our model(s) to meet that requirement while minimizing False Positive and False Negatives. Additionally, numeric probabilities provide us with much more information than just binary class labels. In one example, we could potentially create customized multi-class labels later and more precisely distinguish between individuals who are very likely to resubscribe to individuals who are just on the cusp of resubscribing to individuals who are not at all likely to resubscribe.

Although we are not required to predict probabiltiies based on our objective, based on the inherent advantages I will create both classification and probability models.

Potential Approaches to Imbalanced Classes

As we now know, the dataset we are working with is not balanced among the two classes - True and False. This could lead to a variety of problems when building our model. Therefore, we should consider ahead of time how to best approach with this situation.

Class Weights:

One solution involved adding class-weights to our models so that it accounts and corrects for the imbalance.

Undersampling:

Another solutions involve using undesampling techniques whereby examples from the training dataset that belong to the majority class are removed in order to better balance the class distribution.

Oversampling:

This is sort of opposite solution to undersampling where examples from the training dataset that belong to the minority class are duplicated in order to better balance the class distribution.

These are all valid solutions and I don't think there is any way of knowing which one is better than the other before hand so why don't we try all three!

What Metrics do we use?

At this stage, it is important to think about what metric we should be maximizing for in our models as well as the metric we will be using to compare the performance of our models.

There are a number of different metrics we could maximize for in our algorithms. Accuracy is the default metric that machine learning algorithms will optimize for but given the class imbalance, it is probably more appropriate to utilize the F1-Score which is the harmonic mean of both Precision and Recall.

In regards to the metric we will be using to compare models, I believe the AUC score from a TPR FPR Curve should allow us to accurately assess the performance of and compare models.

What specific models should we use?

In this analysis, I will only be covering a few algorithms due to the time constraint. However, it may be worth considering other models as well at some point.

The four models that I would like to run are Random Forest, Logistic Regression, SVM and Gradient Boosting. These are models that I am personally familiar with and are also quite commonly used for classification purposes. I will be providing a short description of each model during the model-building phase.

Feature Selection

Feature Selection is a process before inputting our data into any models which help us determine which combination of features may be most helpful in predicting our target variable. It will also help ensure that our model is not more complicated than it needs to be - simplicity is key. There are a variety of methods which we could employ and for the purposes of being thorough, I will examine the appropriateness and the advantages/disadvantages of some of the methods that I am currently considering. There may be other methods that I have not included but could be seriously considered.

Manually Remove Features using Correlation Matrix

A very simple and elementary method for feature selection is by manually removing features that are correlated with one another. Given that in a pair of correlated features one will end up being redundant, the solution is to drop one. However, I think we can and should explore more sophisticated techniques for feature selection.

Greedy Algorithms

Greedy algorithms are commonly used for feature selection, whereby at each step the algorithm does the one thing that looks best wihtout taking future options into consideration. RFE (Recursive Feature Elimination) is one of the more popular greedy algorithms that is used for feature selection and is one that we can consider. It is similar to a backward elimination greedy algorithm where the least import predictors are removed and the model is rebuilt. The disadvantages of RFE is that it is computationally expensive as well and does not make its decisions globally (looking at all the options at the same time).

Lasso or Elastic Net Regularization

L1 Regularization (Lasso) and Elastic Net is a Global Optimization Technique which adds a constraint to our objective function. Depending our regularization term, it will remove factors that have little predictive power, thus reducing the number of features.

PCA (Used for Dimensionality Reduction, not Feature Selection)

PCA or Principal Component Analysis is a method for compressing the number of dimensions of our data. We could then potentially feed this compressed data (which has just been linearly transformed) into our model. To reiterate, this is not a feature selection method per se. Fimensionality reduction does not seems to be necessary in this case given that our variable:# of data points ratio is appropriate and our overall number of dimensions is relatively low. I do not think PCA is necessary in this case.

Action Plan:

Typically, Lasso and Elastic Net are thought to be more accurate and sophisticated algorithms in comparison to greedy algorithms like RFE and as such, we will prefer Lasso and Elastic Net However, it might still be interesting to interpret and visualize the results of an RFE algorithm regardless.

Luckily, the functions we will be using to build our machine learning algorithms often have both the option for Lasso and Elastic Net Regularization. This makes our job much easier. For the Random Forest, let's try using the RFE method and see what comes of it.

Standardizing VS Normalizing

Before running any models, we should consider whether our data may need to be standardized or normalized. Both RandomForest and XGBoost do no require any standardization or normalization given the nature of the algorithms. However, we should preprocess our data for both Logistic Regression and SVM. Luckily, we can add a normalizer or standardizer into our model pipelines as required. Given the nature of this data, I believe normalizing the data will be more appropriate than standardizing it, which will change the range of our numeric data to be between 0 and 1.

Outliers

I am currently building models with data that includes outliers and am assuming that any outliers are not due to error. However, I think it would be helpful to create a training set that does not include any outliers and to see how it might perform on our test set. Given the time constraint, I will not be doing that in this analysis although ordinarily I would like to compare models trained on data including outliers with models trained on data excluding outliers.

Let's dive in!

In [1609]:
### Importing libraries and functions

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.preprocessing import MinMaxScaler 

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

Although I was not planning on, I ended up doing a correlation matrix anyway just out of curiousity.

In [1610]:
### Correlation Matrix

correlation_matrix = app_df.copy().drop('label', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
In [1611]:
### Find correlated features

correlated_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)
In [1612]:
### List of correlated_features

correlated_features
Out[1612]:
{'plus_challenges_correct'}

Train/Test Split

The approach I have chosen to take is to split our data into a training and test set and to use 10-fold cross validation during the training of our models. The train/test split ratio largely depends on what number of data points is sufficient for us to be able to accurately assess our model(s) performance. ~2000 points seems to be a very reasonable number to me for our test set so I will use an 80/20 train/test split ratio.

In [1613]:
X = app_df.iloc[:, 0:16]
y = app_df.iloc[:, 16]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

As discussed earlier, we are going to explore both undersampling and oversampling techniques. The following code will help us with resampling our data.

In [1175]:
### Undersampling

rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
X_rus, y_rus = rus.fit_resample(X_train, y_train)

print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))
original dataset shape: Counter({False: 8691, True: 1239})
Resample dataset shape Counter({False: 980, True: 980})
In [920]:
### Oversampling

ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))
Original dataset shape Counter({False: 8691, True: 1239})
Resample dataset shape Counter({False: 6964, True: 6964})

Running Our Models

The four models that we are buidling are: Random Forest, Logistic Regression, SVM and Gradient Boosting. I will be creating a function for each model that I will be running 3 times each (4 X 3 = 12 times in total). The first time I run each function will be using our imbalanced data and running the algorithm to account for such imbalances. The second time will be with the undersampled data and the third time will be with the oversampled data. Let's get started!

Random Forest

Random Forests are simply the aggregated results of numerous independent decision trees and belong in the family of algorithms called boosting: combining many weak learners into strong learners. This process is done via bootstrapping and branching. This is an excellent algorithm particularly appropriate for this data given the imbalance.

In [950]:
### Random Forest Function

def rf(x, y):
    rfc = RandomForestClassifier(random_state=101, class_weight = 'balanced')
    rf_clf = RFECV(estimator=rfc1, step=1, cv=StratifiedKFold(5), scoring='f1_macro')
    rf_clf.fit(x, y)
    print('Optimal number of features: {}'.format(rf_clf.n_features_))
    plot_confusion_matrix(rf_clf, X_test, y_test, normalize = 'true')
    return rf_clf
In [951]:
### Imbalanced Data

rf_clf = rf(X_train, y_train)
Optimal number of features: 10
In [1264]:
### Undersampling

rf_clf1 = rf(X_rus, y_rus)
Optimal number of features: 8
In [949]:
### Oversampling

rf_clf2 = rf(X_ros, y_ros)
Optimal number of features: 15
In [1265]:
### RFE Visualization

plt.figure(figsize=(16, 9))
plt.title('Recursive Feature Elimination with Cross-Validation', fontsize=18, fontweight='bold', pad=20)
plt.xlabel('Number of features selected', fontsize=14, labelpad=20)
plt.ylabel('% Correct Classification', fontsize=14, labelpad=20)
plt.plot(range(1, len(rf_clf.grid_scores_) + 1), rf_clf.grid_scores_, color='#303F9F', linewidth=3)

plt.show()
In [1266]:
### List of variables that were eliminated

X.columns[rfecv.support_ == False]
Out[1266]:
Index(['initial_payment', 'new_language', 'initial_tier_1', 'initial_tier_6',
       'activity_label_reactivated_user', 'activity_label_ressurected_user'],
      dtype='object')
In [1267]:
### Feature Importance Based on RFE Algorithm

dset = pd.DataFrame()
dset['attr'] = X.columns[rf_clf.support_ == True]
dset['importance'] = rf_clf.estimator_.feature_importances_
dset = dset.sort_values(by='importance', ascending=False)

plt.figure(figsize=(16, 14))
plt.barh(y=dset['attr'], width=dset['importance'], color='#1976D2')
plt.title('RFECV - Feature Importances', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Importance', fontsize=14, labelpad=20)
plt.show()
In [1390]:
### TPR vs FPR Plot

disp = metrics.plot_roc_curve(rf_clf, X_test, y_test,name = 'RF Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(rf_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'RF RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(rf_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'RF ROS')

Logistic Regression

Logistic Regression is similar to linear regression except in a few key ways, one being that it ouputs probabilities. Logistic Regression uses the log odds ratio and an iterative maximum likelihood method rather than a least squares to fit the final model.

In [1354]:
### Logistic Regression Function

def log(x, y):
    from sklearn.model_selection import GridSearchCV
    
    parameters = {
                    'sgdclassifier__loss': ['log'],
                    'sgdclassifier__penalty': ['elasticnet', 'l1'],
                    'sgdclassifier__class_weight': ['balanced'],
                    }

    log_clf = make_pipeline(
                    MinMaxScaler(),
                    SGDClassifier(),
                            )

    log_clf = GridSearchCV(log_clf, parameters, cv = 10)

    log_clf.fit(x, y)
    plot_confusion_matrix(log_clf, X_test, y_test, normalize = 'true')
    return log_clf
In [1355]:
### Imbalanced Data

log_clf = log(X_train, y_train)
In [1356]:
### Undersampling

log_clf1 = log(X_rus, y_rus)
In [1357]:
### Oversampling

log_clf2 = log(X_ros, y_ros)
In [1389]:
### TPR vs FPR Curve

disp = metrics.plot_roc_curve(log_clf, X_test, y_test, name = 'LR Balanced')

ax = plt.gca()
disp = metrics.plot_roc_curve(log_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'LR RUS')

ax1 = plt.gca()
disp = metrics.plot_roc_curve(log_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'LR ROS')

SVM

Support Vector Machine (SVM), specifically in this binary classification is based on the idea of finding a hyperplane that best divides a dataset into two classes.

In [1363]:
### SVM Algorithm

def svm(x,y):
    from sklearn.model_selection import GridSearchCV

    parameters = {
                    'sgdclassifier__loss': ['hinge'],
                    'sgdclassifier__penalty': ['l1'],
                    'sgdclassifier__class_weight': ['balanced'],
                    }

    svm_clf = make_pipeline(
                    MinMaxScaler(), 
                    SGDClassifier(),
                            )

    svm_clf = GridSearchCV(svm_clf, parameters, cv = 10)

    svm_clf.fit(x, y)
    plot_confusion_matrix(svm_clf, X_test, y_test, normalize = 'true')
    return svm_clf
In [1364]:
### Imbalanced Data

svm_clf = svm(X_train, y_train)
In [1365]:
### Undersampling

svm_clf1 = svm(X_rus, y_rus)
In [1366]:
### Oversampling

svm_clf2 = svm(X_ros, y_ros)
In [1388]:
### TPR vs FPR Curves
from pylab import rcParams
rcParams['figure.figsize'] = 13, 7

disp = metrics.plot_roc_curve(svm_clf, X_test, y_test, name = 'SVM Balanced')

ax = plt.gca()
disp = metrics.plot_roc_curve(svm_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'SVM RUS')

ax1 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'SVM ROS')

Gradient Boosting

Gradient Boosting is an algorithm that belongs in the family of boosting algorithms. It's similar to Random Forest except in a few key ways- the primary one being that each new decision tree attempts to correct errors made by the previous decision tree

In [1614]:
### Gradient Boosting Algorithm

def gb(x,y):

    parameters = {'gradientboostingclassifier__loss': ['deviance', 'exponential'],
                  'gradientboostingclassifier__max_features': ['auto', 'sqrt', 'log2']
                 }

    gb_clf = make_pipeline(
                    GradientBoostingClassifier(random_state=0),
                            )
    
    gb_clf = GridSearchCV(gb_clf, parameters, cv = 10)

    gb_clf.fit(x, y)
    plot_confusion_matrix(gb_clf, X_test, y_test, normalize = 'true')
    
#     y_pred = gb_clf.predict_proba(X_test)
#     precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1])
#     return gb_clf, (precision, recall, thresholds)
    return gb_clf
In [1615]:
### Imbalanced Data

gb_clf = gb(X_train, y_train)
In [1616]:
### Undersampling

gb_clf1 = gb(X_rus, y_rus)
In [1617]:
### Oversampling

gb_clf2 = gb(X_ros, y_ros)
In [1619]:
### TPR vs FPR Curve

disp = metrics.plot_roc_curve(gb_clf, X_test, y_test, name = 'GB Balanced')

ax = plt.gca()
disp = metrics.plot_roc_curve(gb_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'GB RUS')

ax1 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'GB ROS')
In [1405]:
from pylab import rcParams
rcParams['figure.figsize'] = 16, 8
a = 0.7

disp = metrics.plot_roc_curve(rf_clf, X_test, y_test,name = 'RF Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(rf_clf1, X_test, y_test, ax=ax, alpha=a, name = 'RF RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(rf_clf2, X_test, y_test, ax=ax1, alpha=a, name = 'RF ROS')
ax2 = plt.gca()
disp = metrics.plot_roc_curve(log_clf, X_test, y_test, ax=ax2, alpha = a, name = 'LR Balanced')
ax3 = plt.gca()
disp = metrics.plot_roc_curve(log_clf1, X_test, y_test, ax=ax3, alpha=a, name = 'LR RUS')
ax4 = plt.gca()
disp = metrics.plot_roc_curve(log_clf2, X_test, y_test, ax=ax4, alpha=a, name = 'LR ROS')
ax5 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf, X_test, y_test, ax = ax5, alpha = a, name = 'SVM Balanced')
ax6 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf1, X_test, y_test, ax=ax6, alpha=a, name = 'SVM RUS')
ax7 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf2, X_test, y_test, ax=ax7, alpha=a, name = 'SVM ROS')
ax8 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf, X_test, y_test, ax = ax8, alpha = a, name = 'GB Balanced')
ax9 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf1, X_test, y_test, ax=ax9, alpha=a, name = 'GB RUS')
ax10 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf2, X_test, y_test, ax=ax10, alpha=a, name = 'GB ROS')

plt.title('True Positive Rate VS False Positive Rate')
Out[1405]:
Text(0.5, 1.0, 'True Positive Rate VS False Positive Rate')

It seems that in regards to working with imbalanced data VS undersampling VS Oversampling, neither method was conclusively better than the other.

However, it is clear that one algorithm performed better than all the others. As we can see above, Gradient Boosting performed much better than all the other algorithms in terms of AUC and that at any point on its TPR FPR curve, it performs better than the other algorithms.

We can definitely improve on the Gradient Boosting Model by incorporating hyperparameter tuning.

How well can our model predict whether an individual will resubscribe?

This is an interesting question. Currently, we have been using the AUC score under the TPR FPR curve to compare models. It is a relatively simple metric and is fairly straighforward.

However, our model threshold has not been configured to minimize False Positive and False Negative Costs. So before we can describe how well our model performs, we must take into account both the costs associated with both the False Positive and False Negative Outcomes. For example, we could potentially configure out model so that it will be able to correctly predict every single resubscriber. But it will likely be at the cost of misclassifying individuals that have left the platform as resubscribers. This could be thought of as a great model if the FP figure does not matter. But for most practical applications this would not be a good model because there would some costs associated with False Positive. Quantifying those costs and then configuring our models to minimize for total cost is crucial.

Practically, another way of describing the effectiveness our models is to hold TPR, FPR or a similar metric as a constant so as to define our threshold. We would then configure our model to meet that requirement and provide other more tangible metrics that can describe the effectiveness of our model. For example, if a Product Manager were to ask - "Describe for me the effectiveness of your model given that I want to correctly identify at least 80% of Resubscribers", then I would first configure my model so that it met that requirement. And then I might use Precision (Total Correctly Predicted), FPR (False Positive Rate) or a similar measure to explain the drawbacks or costs. Continuing with this example, I might say that I can indeed correctly predict approximately 80 percent of all resubscribers but out of all my predictions, only be 25% of them will be accurate. In fact, in the cell below, I actually run through this exact example if faced with such a question in the cell below.

In [1452]:
### Precision, Recall, Threshold given that Recall > 0.8 and model is gb_clf

for i in range(len(scores[1])):
    if scores[1][i] <= 0.8:
        print('Precision is ' + str(scores[0][i-1].round(4)))
        print('Recall is ' + str(scores[1][i-1].round(4)))
        print('Threshold is ' + str(scores[2][i-1].round(4)))
        break
Precision is 0.2533
Recall is 0.8031
Threshold is 0.0944

What variables are most useful for such a prediction?

One of our objectives is to review which variables are most useful our prediction. Luckily, there are some simple methods we can utilize to determine variable importance. With the RandomForest and Gradient Boosting algorithms, one can visualize feature importance based on how much each feature contributes to decreasing the weighted impurity. I have provided both the variable importance based on a Random Forest as well as XGBoost just as a compare and contrast.

In [1167]:
### Feature Importance Visualize Function

def feature_importance(forest): 
    
    forest.fit(X_train, y_train)
    importances = forest.feature_importances_
    try:
        std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
    except:
        pass
    indices = np.argsort(importances)[::-1]
    
    l1 = []
    l2 = []
    l3 = []
    
    for f in range(X_train.shape[1]):
        l1.append(f+1)
        l2.append("feature %d (%f)" % (indices[f], importances[indices[f]]))
        l3.append(X_train.columns[indices[f]])

    # Plot the impurity-based feature importances of the forest
    plt.figure(figsize=(15, 8), dpi= 300, facecolor='w', edgecolor='k')
    plt.title("Feature importances")
    try:
        plt.bar(range(X_train.shape[1]), importances[indices],color="r", align="center", yerr=std[indices])
    except:
        plt.bar(range(X_train.shape[1]), importances[indices],color="r", align="center")
    plt.xticks(range(X_train.shape[1]), l3)
    plt.xticks(rotation=40)
    plt.xlim([-1, X_train.shape[1]])
    plt.show()
    
    table = pd.DataFrame([l1,l2,l3]).T
    table.columns = ['Ranking', 'Feature Number', 'Feature Name']
    return table
In [1169]:
### Random Forest Feature Importance

feature_importance(RandomForestRegressor(n_estimators=250,random_state=0))  
Feature ranking:
Out[1169]:
Ranking Feature Number Feature Name
0 1 feature 7 (0.238237) learning_time
1 2 feature 4 (0.164576) offer_delay
2 3 feature 6 (0.121570) mastery_quiz_ads
3 4 feature 3 (0.102175) plus_challenges_correct
4 5 feature 2 (0.101353) plus_challenges_attempted
5 6 feature 5 (0.091485) health_empty_ads
6 7 feature 1 (0.079264) initial_duration
7 8 feature 12 (0.029966) initial_tier_referral
8 9 feature 8 (0.015301) new_language
9 10 feature 10 (0.012134) initial_tier_12
10 11 feature 9 (0.011623) initial_tier_1
11 12 feature 13 (0.007197) activity_label_current_user
12 13 feature 14 (0.006700) activity_label_reactivated_user
13 14 feature 15 (0.006528) activity_label_ressurected_user
14 15 feature 0 (0.006195) initial_payment
15 16 feature 11 (0.005696) initial_tier_6
In [1170]:
### Gradient Boosting Feature Importance

feature_importance(GradientBoostingClassifier(random_state=0))  
Feature ranking:
Out[1170]:
Ranking Feature Number Feature Name
0 1 feature 7 (0.336907) learning_time
1 2 feature 1 (0.195514) initial_duration
2 3 feature 4 (0.156589) offer_delay
3 4 feature 12 (0.109228) initial_tier_referral
4 5 feature 6 (0.103009) mastery_quiz_ads
5 6 feature 5 (0.036880) health_empty_ads
6 7 feature 2 (0.021992) plus_challenges_attempted
7 8 feature 3 (0.019940) plus_challenges_correct
8 9 feature 15 (0.010515) activity_label_ressurected_user
9 10 feature 10 (0.002621) initial_tier_12
10 11 feature 14 (0.001803) activity_label_reactivated_user
11 12 feature 8 (0.001549) new_language
12 13 feature 13 (0.001525) activity_label_current_user
13 14 feature 0 (0.001104) initial_payment
14 15 feature 11 (0.000635) initial_tier_6
15 16 feature 9 (0.000189) initial_tier_1

If we take a look at the top 7 most influential features, the three key features that are within the company's control (or somewhat) include:

  • offer_delay
  • mastery_quiz_ads
  • health_empty ads

These features can be altered by Duolingo directly. For example, Duolingo can decide exactly when for each user to offer a second trial. They can also decide exactly how many ads of what kind to show to each user.

Let's start by examining the offer_delay feature more closely.

Offer_Delay
In [1453]:
### offer_delay variable

fig = px.histogram(df_viz, x="offer_delay", facet_col="did_resubscribe", title = 'Offer Delay Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

If we go back to our visualization, we can see that offer_delay is negatively correlated with the probability one might resubscribe. And this is crucial because it is in Duolingo's control when they offer the second subscription. So if we were to think of our probability as a function of x, where x is the offer_delay and the other feature information is held constant, and we are trying to maximize the probability an individual resubscribed, then what if tried using our algorithm to find the optimal value for offer_delay. It may be a simplistic approach and there are likely better ways, perhaps using optimization techniques, to find the optimal value for offer_delay, but this could offer a window into what the optimal value might be. Let's try it!

In [1565]:
def optimize_offer(i):
    df1 = X_train.iloc[i:i+1]
    df1 = df1.iloc[np.arange(len(df1)).repeat(300)]
    df1['offer_delay'] = [i for i in range(0,300)]
    fig = px.scatter(x=[i for i in range(0,300)], y=gb_clf.predict_proba(df1)[:,1])
    fig.show()
    print('Sample Data')
    return df1.head()
In [1566]:
### Optimize for sample row 8 in our test set

optimize_offer(8)
Sample Data
Out[1566]:
initial_payment initial_duration plus_challenges_attempted plus_challenges_correct offer_delay health_empty_ads mastery_quiz_ads learning_time new_language initial_tier_1 initial_tier_12 initial_tier_6 initial_tier_referral activity_label_current_user activity_label_reactivated_user activity_label_ressurected_user
3991 False 14 432 369 0 0.333 0.111 14.51 False 1 0 0 0 1 0 0
3991 False 14 432 369 1 0.333 0.111 14.51 False 1 0 0 0 1 0 0
3991 False 14 432 369 2 0.333 0.111 14.51 False 1 0 0 0 1 0 0
3991 False 14 432 369 3 0.333 0.111 14.51 False 1 0 0 0 1 0 0
3991 False 14 432 369 4 0.333 0.111 14.51 False 1 0 0 0 1 0 0

This reveals that the optimal offer_delay value is 1. And it can make a relatively large difference (25%) in our probability estimation. While it's likely that 1 is the optimal value for most data points, we could potentially use this kind of method to explain how much offer_delay influences our final prediction.

Intuitively, it makes sense that after finishing a free trial, immediately being offered a second free week following the first will increase the chance that one would sign up for the second trial as well.

Mastery Quiz Ads, Health Empty Ads
In [1540]:
### mastery_quiz_ads variable

fig = px.histogram(df_viz, x="mastery_quiz_ads", facet_col="did_resubscribe", title = 'Mastery Quiz Ads Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus')
The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus

Two of other variables that seem to have some influence in our prediction is the number of Mastery Quiz and Health Empty Ads (Mastery Quiz Ads more so). We should be able to use a similar approach like we used above to see how how mastery quiz ads affect our prediction.

In [1568]:
def optimize_ads(i):
    df1 = X_train.iloc[i:i+1]
    df1 = df1.iloc[np.arange(len(df1)).repeat(20)]
    df1['mastery_quiz_ads'] = [np.round(i, 1) for i in np.linspace(0, 2, 20)]
    fig = px.scatter(x=[np.round(i, 1) for i in np.linspace(0, 2, 20)], y=gb_clf.predict_proba(df1)[:,1])
    fig.show()
    print('Sample Data')
    return df1.head()
    
In [1569]:
### Optimize for sample row 9 in our test set.

optimize_ads(9)
Sample Data
Out[1569]:
initial_payment initial_duration plus_challenges_attempted plus_challenges_correct offer_delay health_empty_ads mastery_quiz_ads learning_time new_language initial_tier_1 initial_tier_12 initial_tier_6 initial_tier_referral activity_label_current_user activity_label_reactivated_user activity_label_ressurected_user
4201 False 14 2107 1951 1 0.0 0.0 5.98 False 1 0 0 0 1 0 0
4201 False 14 2107 1951 1 0.0 0.1 5.98 False 1 0 0 0 1 0 0
4201 False 14 2107 1951 1 0.0 0.2 5.98 False 1 0 0 0 1 0 0
4201 False 14 2107 1951 1 0.0 0.3 5.98 False 1 0 0 0 1 0 0
4201 False 14 2107 1951 1 0.0 0.4 5.98 False 1 0 0 0 1 0 0

In this example, there may not be an optimal solution for all points in terms of what number of ads is appropriate. Based on some experimentation, it seems that it can vary depending on the data point. We can perform this for health_ads as well.

Product Insights:

There are several features that are strongly predictive of our target variable but we do not have indirect control over or cannot immediately change. However, being aware that these are some of the more predictive features in our model will help us understand what aspects of the product to prioritize.

learning_time:

Learning_time is the most influential feature based on our analysis. The more time customers spend on average per day, the more likely they are to resubscribe. According to our GB feature importance plot, this was given a 30% feature importance.

plus_challenges_attempted/plus_challenges_correct

The greater the number of questions answered or answered correctly, the higher the probability the customer will resubscribe. According to our RF Feature Importance Plot, this was given 10% Feature Importance.

initial_duration:

The longer the initial subscription, the higher the probability the customer will resubscribe. According to our GB Feature Importance plot, this was given 20% feature importance.

Variables that we can optimize for:

There are three variables that I have identified which influence our prediction and which as an organization we may have some control over and can therefore optimize for:

Offer_Delay:

If our goal is solely to maximize the number of individuals who sign up for the second free trial, then we could potentially use our model to predict the optimal number of days for each user. Using A/B Experimentation, we could see whether our predictions our indeed increasing the number of resubscriptions. To further extend this if we want to make this an ongoing experiment, we can employ a hybrid Multi-Armed Bandit Model in which after set periods of time, we evaluate our predictions and adjust our models to better optimize for Offer_Delay.

Mastery Quiz Ads/Health Empty Ads:

We can utilize an approach similar to the A/B Testing and Hybrid Multi-Armed Bandit approach I described above. However, there are a different set of constraints to take into account.

If the cost of running an ad for the user exceeds the potential value of gaining a resubscription, then it may not be cost effective. A cost benefit analysis is crucial in this experiment.