Duolingo Take Home Project¶

Ujjawal Madan
October 8th 2020 alt text

Problem Statement¶

We are interested in “resubscribe-eligible” users; Duolingo users that sign up for Duolingo Plus (our subscription service) and then expire from their subscription. Specifically, we are interested in whether, and why, these resubscribe-eligible users sign up for a second subscription.

I will be tackling this problem in multiple stages:

Exploratory Data Analysis and Visualization
Preprocessing
Model-Building
Evaluating our models
Evaluating Variable Importance
Insights and Next Steps

Currently, the data we are working with is App usage data on resubscribe-eligible users who were offered a promotional second free trial between July 8, 2020 and August 9, 2020.

Let's start by importing the data:

### Import libraries

import plotly.io as pio
pio.renderers.default = "notebook"

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Read in Data

#Relative path. please modify
df = pd.read_csv('usage_data.csv')

Exploratory Data Analysis and Visualization

Exploring the Data¶

Let's take some time to visually explore our variables to get a better understanding of them. For these visualizations, I will be using the Plotly library.

### Create a copy to be used for our visualization

df_viz = df.copy()

### Removing Nulls

df_viz = df_viz[-df_viz.isnull().any(axis=1)]
df_viz = df_viz.reset_index().drop(['index'], axis = 1)

### Changing the initial tier variables so that we can read it as a categorical variable

d = {'6': 'Six', '12': 'Twelve', '1': 'One', 'referral': 'Referral'}
df_viz['initial_tier_cat'] = df_viz['initial_tier'].apply(lambda x: d[x])

### initial_tier variable

fig = px.histogram(df_viz, x="initial_tier_cat", facet_col="did_resubscribe", title="Initial Tier Variable")
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('What tier the user signed up for on their first Plus free trial (1-month, 6-month, 12-month, or a "referral" tier)')

What tier the user signed up for on their first Plus free trial (1-month, 6-month, 12-month, or a "referral" tier)

As we can see, there does seem to be a difference in our histograms between those who resubscribe and those who don't. Of those who seem to subscribe through the referall process, it seems that a smaller proportion of them end up resubscribing. For those who had 12 month subscriptions, there is a higher proportion that end up resubscribing as well.

### intial_payment variable

fig = px.histogram(df_viz, x="initial_payment", facet_col="did_resubscribe", title = 'Initial Payment Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('Whether the user paid money on their first Plus subscription. Users pay when they reach the end of the free trial period without cancelling. Users do not pay when they cancel before the free trial ends, or if they subscribe to Plus via referral')

Whether the user paid money on their first Plus subscription. Users pay when they reach the end of the free trial period without cancelling. Users do not pay when they cancel before the free trial ends, or if they subscribe to Plus via referral

There does not seem to be a significant difference in the proportions of those who resubscribe between those who initially paid and those who didn't.

### intial_duration variable

fig = px.histogram(df_viz, x="initial_duration", facet_col="did_resubscribe", title = 'Initial Duration Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of days the user's first Plus subscription lasted, including the free trial")

The number of days the user's first Plus subscription lasted, including the free trial

The initial_duration histograms are quite similar to each other. However, it does seem interesting that for those of whom whose trial lasted between 10 and 14 days, the proportion of people resubscribing was significantly higher compared to people who had different intial_durations.

### plus challenges attempted

fig = px.histogram(df_viz, x="plus_challenges_attempted", facet_col="did_resubscribe", title = 'Plus Challenges Attempted Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of questions the user answered on Duolingo during their first Plus subscription')

The number of questions the user answered on Duolingo during their first Plus subscription

After zooming in on the above plot, we can see that as the number of plus_challenges_attempted increases, the proportion of those individuals resubscribing also increases. This suggests that there is a correlation between the number of plus_challenges attempted and the probability an individual will resubscribe. There may be an underlying causal relationship as well but we would need to investigate further to make any reasonable conclusions.

### plus_challenges_correct variable

fig = px.histogram(df_viz, x="plus_challenges_correct", facet_col="did_resubscribe", title = 'Plus Challenges Correct Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of questions the user answered correctly on Duolingo during their first Plus subscription')

The number of questions the user answered correctly on Duolingo during their first Plus subscription

The above visualization is very similar to the plus_challenges_attempted visualization and I think we can use similar reasoning to say that there may be a positive correlation between plus_challenges_correct and the probability an individual will resubscribe.

### offer_delay variable

fig = px.histogram(df_viz, x="offer_delay", facet_col="did_resubscribe", title = 'Offer Delay Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of days between when the user's first Plus subscription expired and when they were offered the second free trial")

The number of days between when the user's first Plus subscription expired and when they were offered the second free trial

The above visualization is quite interesting to me. The two distributions which refer to the time in between the expiration of the first subscription and the offer for the second does seem to be different for both groups (resubscribe = False, resubscribe = True). It appears that minimizing the delay may indeed be linked to the probability one will resubscribe given that almost all individuals who resubscribe have an offer_delay of less than 20 days. This is particulary interesting because it is the one variable that may be within the organization's control. We will definitely come back to this.

### activity label variable

fig = px.histogram(df_viz, x="activity_label", facet_col="did_resubscribe", title = 'Activity Label Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('A measure of the user\'s activity at the point they are offered a second free trial. "Current users" were active at least once in the previous week, "reactivated users" were inactive for more than a week but less than a month, and "resurrected users" were inactive for more than a month')

A measure of the user's activity at the point they are offered a second free trial. "Current users" were active at least once in the previous week, "reactivated users" were inactive for more than a week but less than a month, and "resurrected users" were inactive for more than a month

Unsurprisingly, for those individuals who are current users, a higher proportion of them will resubscribe in comparison to the proportions of resurrected users and reactivated users who resubscribe. Intuitively this makes sense but it is definitely helpful to validate our intuition with these findings.

### mastery_quiz_ads variable

fig = px.histogram(df_viz, x="health_empty_ads", facet_col="did_resubscribe", title = 'Health Empty Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "health empty" ads a user saw for Plus per active day during the resubscribe period. A "health empty" ad is an ad that advertises the unlimited Hearts feature of Plus, which triggers whenever a user runs out of hearts')

The number of "health empty" ads a user saw for Plus per active day during the resubscribe period. A "health empty" ad is an ad that advertises the unlimited Hearts feature of Plus, which triggers whenever a user runs out of hearts

Based on the above plot, there does not seem to be a significant difference between the number of health_empty ads and the proportions of those resubscribing.

### mastery_quiz_ads variable

fig = px.histogram(df_viz, x="mastery_quiz_ads", facet_col="did_resubscribe", title = 'Mastery Quiz Ads Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus')

The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus

There does appear to be a weak link between the number of master quiz ads a user saw and the probability they will resubscribe.

### Learning Time Variable

fig = px.histogram(df_viz, x="learning_time", facet_col="did_resubscribe", title = 'Learning Time Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("The number of minutes per active day a user spent learning on Duolingo during the resubscribe period")

The number of minutes per active day a user spent learning on Duolingo during the resubscribe period

Given the above plot, it appears that learning time may be very influential in helping us to predict whether a person will or will not resubscribe. The proportion of those individuals who spend more time on the app per day and who resubscribe is higher than those do not spend as much time on the app. Again, this makes intuitive sense...

### New language variable

fig = px.histogram(df_viz, x="new_language", facet_col="did_resubscribe", title = 'New Language Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print("Whether a user started learning a language during the resubscribe period they had never before learned on Duolingo or not")

Whether a user started learning a language during the resubscribe period they had never before learned on Duolingo or not

Whether a user started a new langauge or not does not appear to be a very relavent factor in explaining the difference between those who resubscribe and those who don't.

### post_ativity variable

fig = px.histogram(df_viz, x="post_activity", facet_col="did_resubscribe", title = 'Post Activity Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('If a user started a second free trial, the number of days they were active on Duolingo during the first week of the second free trial')

If a user started a second free trial, the number of days they were active on Duolingo during the first week of the second free trial

Let's dig a bit deeper into our post_activity variable. It makes a lot of sense that for those of whom that haven't signed up for a second trial, they have not spent any time on the app during their hypothetical second trial. For those of whom who did sign up, the data indicates that the distribution is uniform. The number of individuals who were active for x days seems to be relatively equal for values of x from 0 to 7.

We should however note at this point that it would not make sense to include this in any predictive model given that we would not have this data at the time of implementation. There is no way we can know post_activity if we do not know whether an individual has resubscribed or not. This could be used as the target variable for a regression analysis but is not applicable in our case where we are trying to predict whether or not an individual will resubscribe.

### Resubscribed or Not

fig = px.histogram(df_viz, x="did_resubscribe", title = 'Resubscribed or Not')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

df_viz['did_resubscribe'].mean()

0.12477341389728097

Last but not least, let's look at the proportion of individuals who resubscribe. The above proportion tells us that there is a class imbalance and this is crucial information for when we later build our predictive model.

Given the data, it makes sense for us to use the did_resubscribe variable in our analysis. As our goal is to predict whether or not a individual will resubscribe given what we know about them, this variable provides us with the information of whether or not they resubscribed.

Visualizing my own variables¶

Just out of curiosity, I was curious to see what the problems_per_day variable might look like (Average Attempted Problems per day) and also the problems_ratio_correct variable (Correctly Answered:Total Answered)

df_viz['problems_per_day'] = df_viz['plus_challenges_attempted']/df_viz['initial_duration']
df_viz['problems_ratio_correct'] = df_viz['plus_challenges_correct']/df_viz['plus_challenges_attempted']

### problems_per_day

fig = px.histogram(df_viz, x="problems_per_day", facet_col="did_resubscribe", title = 'Average Problems Attempted / Day')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

### problems_ratio_correct

fig = px.histogram(df_viz, x="problems_ratio_correct", facet_col="did_resubscribe", title = 'Correctly Answered / All Answered')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

How can there be values greater than 1?

### Questions Correct/Questions Attempted

sum(df_viz['plus_challenges_correct']>df_viz['plus_challenges_attempted'])

111

It seems that there may be issues without or plus_challenges_correct and plug_challenges_attempted data. I don't think anyone can get more questions correct than they have answered. Interestingly, I would not have made this insight if I hadn't created my own variables.

Preprocessing

Let's get to it!

### Create a copy to be used during the model building phase

app_df = df.copy()

Preparing our Data¶

app_df.head()

Our data looks correct. Let's keep exploring!

app_df.shape

(10077, 14)

app_df.describe()

NA Values¶

We should first check to see if there are any NA (Missing) values.

# Are there nan values?

app_df[app_df.isnull().any(axis=1)]

It seems that are a numerous rows that contain missing rows. There are a number of methods we can use to resolve this situation. The first step however requires us to look into whether there was an bias in the missing data. That is, do the data points with missing feature information form a sample that is represenative of the entire population? Or is it possible that there is some pattern in the missing data? Finding the reason for why we are missing data can provide us with a lot of information. At this time, I cannot conclude whether or not there is indeed a pattern. However, a cursory analysis can at least tell us at least whether the proportion of points labeled True differ greatly from the non-missing data points.

df[df.isnull().any(axis=1)]['did_resubscribe'].mean()

0.1360544217687075

It seems that it does not differ greatly. Let's discuss possible strategies to deal with the missing data. Some of the more popular methods include removing missing values, using categorical variables to indicate missing data or estimating the missing data and imputing the missing fields with our own values (possibly also with perturbation). If we assume that there is no bias in the missing data, then it may be best to actually remove data points that contain missing values. As the # of incomplete data points: # of complete data points ration is quite small (~0.015), this method would avoid introducing errors whilst not shrinking our dataset significantly. If it turns out there is bias in the missing data, we may need to approach this differently.

### Remove missing values

app_df = app_df[-app_df.isnull().any(axis=1)]
app_df = app_df.reset_index().drop(['index'], axis = 1)

Let's examine the datatypes of our variables.

app_df.dtypes

user_id                       object
initial_tier                  object
initial_payment               object
initial_duration             float64
plus_challenges_attempted    float64
plus_challenges_correct      float64
offer_delay                  float64
activity_label                object
health_empty_ads             float64
mastery_quiz_ads             float64
learning_time                float64
new_language                  object
post_activity                float64
did_resubscribe                 bool
dtype: object

We should convert our variables into the proper format.

app_df['initial_tier'] = app_df['initial_tier'].astype('category')
app_df['initial_payment'] = app_df['initial_payment'].astype('bool')
app_df['initial_duration'] = app_df['initial_duration'].astype('Int32')
app_df['plus_challenges_attempted'] = app_df['plus_challenges_attempted'].astype('Int32')
app_df['plus_challenges_correct'] = app_df['plus_challenges_correct'].astype('Int32')
app_df['offer_delay'] = app_df['offer_delay'].astype('Int32')
app_df['activity_label'] = app_df['activity_label'].astype('category')
app_df['new_language'] = app_df['new_language'].astype('bool')
app_df['post_activity'] = app_df['post_activity'].astype('Int32')

Handling Categorical Data¶

Given that there will be a number of algorithms we will be running which do not accept categorical data, we should explore solutions related to this issue.

One-Hot Encoding is a popular approach to convering categorical data into a numeric format and given the relatively few categories in both of the categorical variables in our data, I think this is an appropriate choice in this instance.

If it turns out that our best performing model is one that can handle categorical data, we can try running it on data that has not been one-hot encoded.

app_df[['initial_tier_1', 'initial_tier_12','initial_tier_6','initial_tier_referral']] = pd.get_dummies(app_df['initial_tier'])

app_df[['activity_label_current_user', 'activity_label_reactivated_user', 'activity_label_ressurected_user']] = pd.get_dummies(app_df['activity_label'])

app_df['label'] = app_df['did_resubscribe'].copy()

Let's drop the original categorical columns now that we have created the dummy variables as well as the post_activity variable which we no longer need.

app_df = app_df.drop(['user_id', 'initial_tier', 'activity_label', 'did_resubscribe'], axis = 1)

app_df = app_df.drop('post_activity', axis = 1)

Model-Building

One of our primarily goals during this analysis is to use quantitative methods to determine how well we can predict whether a user will start a second free trial.

We can accomplish this by building predictive models that will classify whether a user will start a second free trial or not based on the given feature information (input data).

What kind of model are we building?¶

It is clear that this is a supervised learning problem where our goal is to try and map a specific output based on an input. The question now is, what kind of model do we want to build?

It seems to me that there are two approaches that we can take. One solution is to build a classification algorithm which, if given an input, would classify the data point as either a True or False in this case. Our goal, as clearly stated, is to determine how well we can predict whether a user will resubscribe or not.

However, we can potentially improve on this approach by providing probabilities - that is how likely is the user to resubscribe based on the input data. One advantage is we could potentially fine tune our model later on (by adjusting the threshold) to better meet the needs of the business problem. For example, if the goal is to produce a model that will identify at least 80% of all individuals who will resubscribe, then we can adjust the thresholds of our model(s) to meet that requirement while minimizing False Positive and False Negatives. Additionally, numeric probabilities provide us with much more information than just binary class labels. In one example, we could potentially create customized multi-class labels later and more precisely distinguish between individuals who are very likely to resubscribe to individuals who are just on the cusp of resubscribing to individuals who are not at all likely to resubscribe.

Although we are not required to predict probabiltiies based on our objective, based on the inherent advantages I will create both classification and probability models.

Potential Approaches to Imbalanced Classes¶

As we now know, the dataset we are working with is not balanced among the two classes - True and False. This could lead to a variety of problems when building our model. Therefore, we should consider ahead of time how to best approach with this situation.

Class Weights:¶

One solution involved adding class-weights to our models so that it accounts and corrects for the imbalance.

Undersampling:¶

Another solutions involve using undesampling techniques whereby examples from the training dataset that belong to the majority class are removed in order to better balance the class distribution.

Oversampling:¶

This is sort of opposite solution to undersampling where examples from the training dataset that belong to the minority class are duplicated in order to better balance the class distribution.

These are all valid solutions and I don't think there is any way of knowing which one is better than the other before hand so why don't we try all three!

What Metrics do we use?¶

At this stage, it is important to think about what metric we should be maximizing for in our models as well as the metric we will be using to compare the performance of our models.

There are a number of different metrics we could maximize for in our algorithms. Accuracy is the default metric that machine learning algorithms will optimize for but given the class imbalance, it is probably more appropriate to utilize the F1-Score which is the harmonic mean of both Precision and Recall.

In regards to the metric we will be using to compare models, I believe the AUC score from a TPR FPR Curve should allow us to accurately assess the performance of and compare models.

What specific models should we use?¶

In this analysis, I will only be covering a few algorithms due to the time constraint. However, it may be worth considering other models as well at some point.

The four models that I would like to run are Random Forest, Logistic Regression, SVM and Gradient Boosting. These are models that I am personally familiar with and are also quite commonly used for classification purposes. I will be providing a short description of each model during the model-building phase.

Feature Selection¶

Feature Selection is a process before inputting our data into any models which help us determine which combination of features may be most helpful in predicting our target variable. It will also help ensure that our model is not more complicated than it needs to be - simplicity is key. There are a variety of methods which we could employ and for the purposes of being thorough, I will examine the appropriateness and the advantages/disadvantages of some of the methods that I am currently considering. There may be other methods that I have not included but could be seriously considered.

Manually Remove Features using Correlation Matrix¶

A very simple and elementary method for feature selection is by manually removing features that are correlated with one another. Given that in a pair of correlated features one will end up being redundant, the solution is to drop one. However, I think we can and should explore more sophisticated techniques for feature selection.

Greedy Algorithms¶

Greedy algorithms are commonly used for feature selection, whereby at each step the algorithm does the one thing that looks best wihtout taking future options into consideration. RFE (Recursive Feature Elimination) is one of the more popular greedy algorithms that is used for feature selection and is one that we can consider. It is similar to a backward elimination greedy algorithm where the least import predictors are removed and the model is rebuilt. The disadvantages of RFE is that it is computationally expensive as well and does not make its decisions globally (looking at all the options at the same time).

Lasso or Elastic Net Regularization¶

L1 Regularization (Lasso) and Elastic Net is a Global Optimization Technique which adds a constraint to our objective function. Depending our regularization term, it will remove factors that have little predictive power, thus reducing the number of features.

PCA (Used for Dimensionality Reduction, not Feature Selection)¶

PCA or Principal Component Analysis is a method for compressing the number of dimensions of our data. We could then potentially feed this compressed data (which has just been linearly transformed) into our model. To reiterate, this is not a feature selection method per se. Fimensionality reduction does not seems to be necessary in this case given that our variable:# of data points ratio is appropriate and our overall number of dimensions is relatively low. I do not think PCA is necessary in this case.

Action Plan:¶

Typically, Lasso and Elastic Net are thought to be more accurate and sophisticated algorithms in comparison to greedy algorithms like RFE and as such, we will prefer Lasso and Elastic Net However, it might still be interesting to interpret and visualize the results of an RFE algorithm regardless.

Luckily, the functions we will be using to build our machine learning algorithms often have both the option for Lasso and Elastic Net Regularization. This makes our job much easier. For the Random Forest, let's try using the RFE method and see what comes of it.

Standardizing VS Normalizing¶

Before running any models, we should consider whether our data may need to be standardized or normalized. Both RandomForest and XGBoost do no require any standardization or normalization given the nature of the algorithms. However, we should preprocess our data for both Logistic Regression and SVM. Luckily, we can add a normalizer or standardizer into our model pipelines as required. Given the nature of this data, I believe normalizing the data will be more appropriate than standardizing it, which will change the range of our numeric data to be between 0 and 1.

Outliers¶

I am currently building models with data that includes outliers and am assuming that any outliers are not due to error. However, I think it would be helpful to create a training set that does not include any outliers and to see how it might perform on our test set. Given the time constraint, I will not be doing that in this analysis although ordinarily I would like to compare models trained on data including outliers with models trained on data excluding outliers.

Let's dive in!

### Importing libraries and functions

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_roc_curve
from sklearn.preprocessing import MinMaxScaler 

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

Although I was not planning on, I ended up doing a correlation matrix anyway just out of curiousity.

### Correlation Matrix

correlation_matrix = app_df.copy().drop('label', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()

### Find correlated features

correlated_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)

### List of correlated_features

correlated_features

{'plus_challenges_correct'}

Train/Test Split¶

The approach I have chosen to take is to split our data into a training and test set and to use 10-fold cross validation during the training of our models. The train/test split ratio largely depends on what number of data points is sufficient for us to be able to accurately assess our model(s) performance. ~2000 points seems to be a very reasonable number to me for our test set so I will use an 80/20 train/test split ratio.

X = app_df.iloc[:, 0:16]
y = app_df.iloc[:, 16]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

As discussed earlier, we are going to explore both undersampling and oversampling techniques. The following code will help us with resampling our data.

### Undersampling

rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
X_rus, y_rus = rus.fit_resample(X_train, y_train)

print('original dataset shape:', Counter(y))
print('Resample dataset shape', Counter(y_rus))

original dataset shape: Counter({False: 8691, True: 1239})
Resample dataset shape Counter({False: 980, True: 980})

### Oversampling

ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(X_train, y_train)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Original dataset shape Counter({False: 8691, True: 1239})
Resample dataset shape Counter({False: 6964, True: 6964})

Running Our Models¶

The four models that we are buidling are: Random Forest, Logistic Regression, SVM and Gradient Boosting. I will be creating a function for each model that I will be running 3 times each (4 X 3 = 12 times in total). The first time I run each function will be using our imbalanced data and running the algorithm to account for such imbalances. The second time will be with the undersampled data and the third time will be with the oversampled data. Let's get started!

Random Forest¶

Random Forests are simply the aggregated results of numerous independent decision trees and belong in the family of algorithms called boosting: combining many weak learners into strong learners. This process is done via bootstrapping and branching. This is an excellent algorithm particularly appropriate for this data given the imbalance.

### Random Forest Function

def rf(x, y):
    rfc = RandomForestClassifier(random_state=101, class_weight = 'balanced')
    rf_clf = RFECV(estimator=rfc1, step=1, cv=StratifiedKFold(5), scoring='f1_macro')
    rf_clf.fit(x, y)
    print('Optimal number of features: {}'.format(rf_clf.n_features_))
    plot_confusion_matrix(rf_clf, X_test, y_test, normalize = 'true')
    return rf_clf

### Imbalanced Data

rf_clf = rf(X_train, y_train)

Optimal number of features: 10

### Undersampling

rf_clf1 = rf(X_rus, y_rus)

Optimal number of features: 8

### Oversampling

rf_clf2 = rf(X_ros, y_ros)

Optimal number of features: 15

### RFE Visualization

plt.figure(figsize=(16, 9))
plt.title('Recursive Feature Elimination with Cross-Validation', fontsize=18, fontweight='bold', pad=20)
plt.xlabel('Number of features selected', fontsize=14, labelpad=20)
plt.ylabel('% Correct Classification', fontsize=14, labelpad=20)
plt.plot(range(1, len(rf_clf.grid_scores_) + 1), rf_clf.grid_scores_, color='#303F9F', linewidth=3)

plt.show()

### List of variables that were eliminated

X.columns[rfecv.support_ == False]

Index(['initial_payment', 'new_language', 'initial_tier_1', 'initial_tier_6',
       'activity_label_reactivated_user', 'activity_label_ressurected_user'],
      dtype='object')

### Feature Importance Based on RFE Algorithm

dset = pd.DataFrame()
dset['attr'] = X.columns[rf_clf.support_ == True]
dset['importance'] = rf_clf.estimator_.feature_importances_
dset = dset.sort_values(by='importance', ascending=False)

plt.figure(figsize=(16, 14))
plt.barh(y=dset['attr'], width=dset['importance'], color='#1976D2')
plt.title('RFECV - Feature Importances', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Importance', fontsize=14, labelpad=20)
plt.show()

### TPR vs FPR Plot

disp = metrics.plot_roc_curve(rf_clf, X_test, y_test,name = 'RF Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(rf_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'RF RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(rf_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'RF ROS')

Logistic Regression¶

Logistic Regression is similar to linear regression except in a few key ways, one being that it ouputs probabilities. Logistic Regression uses the log odds ratio and an iterative maximum likelihood method rather than a least squares to fit the final model.

### Logistic Regression Function

def log(x, y):
    from sklearn.model_selection import GridSearchCV
    
    parameters = {
                    'sgdclassifier__loss': ['log'],
                    'sgdclassifier__penalty': ['elasticnet', 'l1'],
                    'sgdclassifier__class_weight': ['balanced'],
                    }

    log_clf = make_pipeline(
                    MinMaxScaler(),
                    SGDClassifier(),
                            )

    log_clf = GridSearchCV(log_clf, parameters, cv = 10)

    log_clf.fit(x, y)
    plot_confusion_matrix(log_clf, X_test, y_test, normalize = 'true')
    return log_clf

### Imbalanced Data

log_clf = log(X_train, y_train)

### Undersampling

log_clf1 = log(X_rus, y_rus)

### Oversampling

log_clf2 = log(X_ros, y_ros)

### TPR vs FPR Curve

disp = metrics.plot_roc_curve(log_clf, X_test, y_test, name = 'LR Balanced')

ax = plt.gca()
disp = metrics.plot_roc_curve(log_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'LR RUS')

ax1 = plt.gca()
disp = metrics.plot_roc_curve(log_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'LR ROS')

SVM¶

Support Vector Machine (SVM), specifically in this binary classification is based on the idea of finding a hyperplane that best divides a dataset into two classes.

### SVM Algorithm

def svm(x,y):
    from sklearn.model_selection import GridSearchCV

    parameters = {
                    'sgdclassifier__loss': ['hinge'],
                    'sgdclassifier__penalty': ['l1'],
                    'sgdclassifier__class_weight': ['balanced'],
                    }

    svm_clf = make_pipeline(
                    MinMaxScaler(), 
                    SGDClassifier(),
                            )

    svm_clf = GridSearchCV(svm_clf, parameters, cv = 10)

    svm_clf.fit(x, y)
    plot_confusion_matrix(svm_clf, X_test, y_test, normalize = 'true')
    return svm_clf

### Imbalanced Data

svm_clf = svm(X_train, y_train)

### Undersampling

svm_clf1 = svm(X_rus, y_rus)

### Oversampling

svm_clf2 = svm(X_ros, y_ros)

### TPR vs FPR Curves
from pylab import rcParams
rcParams['figure.figsize'] = 13, 7

disp = metrics.plot_roc_curve(svm_clf, X_test, y_test, name = 'SVM Balanced')

ax = plt.gca()
disp = metrics.plot_roc_curve(svm_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'SVM RUS')

ax1 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'SVM ROS')

Gradient Boosting¶

Gradient Boosting is an algorithm that belongs in the family of boosting algorithms. It's similar to Random Forest except in a few key ways- the primary one being that each new decision tree attempts to correct errors made by the previous decision tree

### Gradient Boosting Algorithm

def gb(x,y):

    parameters = {'gradientboostingclassifier__loss': ['deviance', 'exponential'],
                  'gradientboostingclassifier__max_features': ['auto', 'sqrt', 'log2']
                 }

    gb_clf = make_pipeline(
                    GradientBoostingClassifier(random_state=0),
                            )
    
    gb_clf = GridSearchCV(gb_clf, parameters, cv = 10)

    gb_clf.fit(x, y)
    plot_confusion_matrix(gb_clf, X_test, y_test, normalize = 'true')
    
#     y_pred = gb_clf.predict_proba(X_test)
#     precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1])
#     return gb_clf, (precision, recall, thresholds)
    return gb_clf

### Imbalanced Data

gb_clf = gb(X_train, y_train)

### Undersampling

gb_clf1 = gb(X_rus, y_rus)

### Oversampling

gb_clf2 = gb(X_ros, y_ros)

### TPR vs FPR Curve

disp = metrics.plot_roc_curve(gb_clf, X_test, y_test, name = 'GB Balanced')

ax = plt.gca()
disp = metrics.plot_roc_curve(gb_clf1, X_test, y_test, ax=ax, alpha=0.8, name = 'GB RUS')

ax1 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf2, X_test, y_test, ax=ax1, alpha=0.8, name = 'GB ROS')

Evaluating Our Models

from pylab import rcParams
rcParams['figure.figsize'] = 16, 8
a = 0.7

disp = metrics.plot_roc_curve(rf_clf, X_test, y_test,name = 'RF Balanced')
ax = plt.gca()
disp = metrics.plot_roc_curve(rf_clf1, X_test, y_test, ax=ax, alpha=a, name = 'RF RUS')
ax1 = plt.gca()
disp = metrics.plot_roc_curve(rf_clf2, X_test, y_test, ax=ax1, alpha=a, name = 'RF ROS')
ax2 = plt.gca()
disp = metrics.plot_roc_curve(log_clf, X_test, y_test, ax=ax2, alpha = a, name = 'LR Balanced')
ax3 = plt.gca()
disp = metrics.plot_roc_curve(log_clf1, X_test, y_test, ax=ax3, alpha=a, name = 'LR RUS')
ax4 = plt.gca()
disp = metrics.plot_roc_curve(log_clf2, X_test, y_test, ax=ax4, alpha=a, name = 'LR ROS')
ax5 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf, X_test, y_test, ax = ax5, alpha = a, name = 'SVM Balanced')
ax6 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf1, X_test, y_test, ax=ax6, alpha=a, name = 'SVM RUS')
ax7 = plt.gca()
disp = metrics.plot_roc_curve(svm_clf2, X_test, y_test, ax=ax7, alpha=a, name = 'SVM ROS')
ax8 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf, X_test, y_test, ax = ax8, alpha = a, name = 'GB Balanced')
ax9 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf1, X_test, y_test, ax=ax9, alpha=a, name = 'GB RUS')
ax10 = plt.gca()
disp = metrics.plot_roc_curve(gb_clf2, X_test, y_test, ax=ax10, alpha=a, name = 'GB ROS')

plt.title('True Positive Rate VS False Positive Rate')

Text(0.5, 1.0, 'True Positive Rate VS False Positive Rate')

It seems that in regards to working with imbalanced data VS undersampling VS Oversampling, neither method was conclusively better than the other.

However, it is clear that one algorithm performed better than all the others. As we can see above, Gradient Boosting performed much better than all the other algorithms in terms of AUC and that at any point on its TPR FPR curve, it performs better than the other algorithms.

We can definitely improve on the Gradient Boosting Model by incorporating hyperparameter tuning.

How well can our model predict whether an individual will resubscribe?¶

This is an interesting question. Currently, we have been using the AUC score under the TPR FPR curve to compare models. It is a relatively simple metric and is fairly straighforward.

However, our model threshold has not been configured to minimize False Positive and False Negative Costs. So before we can describe how well our model performs, we must take into account both the costs associated with both the False Positive and False Negative Outcomes. For example, we could potentially configure out model so that it will be able to correctly predict every single resubscriber. But it will likely be at the cost of misclassifying individuals that have left the platform as resubscribers. This could be thought of as a great model if the FP figure does not matter. But for most practical applications this would not be a good model because there would some costs associated with False Positive. Quantifying those costs and then configuring our models to minimize for total cost is crucial.

Practically, another way of describing the effectiveness our models is to hold TPR, FPR or a similar metric as a constant so as to define our threshold. We would then configure our model to meet that requirement and provide other more tangible metrics that can describe the effectiveness of our model. For example, if a Product Manager were to ask - "Describe for me the effectiveness of your model given that I want to correctly identify at least 80% of Resubscribers", then I would first configure my model so that it met that requirement. And then I might use Precision (Total Correctly Predicted), FPR (False Positive Rate) or a similar measure to explain the drawbacks or costs. Continuing with this example, I might say that I can indeed correctly predict approximately 80 percent of all resubscribers but out of all my predictions, only be 25% of them will be accurate. In fact, in the cell below, I actually run through this exact example if faced with such a question in the cell below.

### Precision, Recall, Threshold given that Recall > 0.8 and model is gb_clf

for i in range(len(scores[1])):
    if scores[1][i] <= 0.8:
        print('Precision is ' + str(scores[0][i-1].round(4)))
        print('Recall is ' + str(scores[1][i-1].round(4)))
        print('Threshold is ' + str(scores[2][i-1].round(4)))
        break

Precision is 0.2533
Recall is 0.8031
Threshold is 0.0944

What variables are most useful for such a prediction?

One of our objectives is to review which variables are most useful our prediction. Luckily, there are some simple methods we can utilize to determine variable importance. With the RandomForest and Gradient Boosting algorithms, one can visualize feature importance based on how much each feature contributes to decreasing the weighted impurity. I have provided both the variable importance based on a Random Forest as well as XGBoost just as a compare and contrast.

### Feature Importance Visualize Function

def feature_importance(forest): 
    
    forest.fit(X_train, y_train)
    importances = forest.feature_importances_
    try:
        std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
    except:
        pass
    indices = np.argsort(importances)[::-1]
    
    l1 = []
    l2 = []
    l3 = []
    
    for f in range(X_train.shape[1]):
        l1.append(f+1)
        l2.append("feature %d (%f)" % (indices[f], importances[indices[f]]))
        l3.append(X_train.columns[indices[f]])

    # Plot the impurity-based feature importances of the forest
    plt.figure(figsize=(15, 8), dpi= 300, facecolor='w', edgecolor='k')
    plt.title("Feature importances")
    try:
        plt.bar(range(X_train.shape[1]), importances[indices],color="r", align="center", yerr=std[indices])
    except:
        plt.bar(range(X_train.shape[1]), importances[indices],color="r", align="center")
    plt.xticks(range(X_train.shape[1]), l3)
    plt.xticks(rotation=40)
    plt.xlim([-1, X_train.shape[1]])
    plt.show()
    
    table = pd.DataFrame([l1,l2,l3]).T
    table.columns = ['Ranking', 'Feature Number', 'Feature Name']
    return table

### Random Forest Feature Importance

feature_importance(RandomForestRegressor(n_estimators=250,random_state=0))

Feature ranking:

### Gradient Boosting Feature Importance

feature_importance(GradientBoostingClassifier(random_state=0))

Feature ranking:

If we take a look at the top 7 most influential features, the three key features that are within the company's control (or somewhat) include:

offer_delay
mastery_quiz_ads
health_empty ads

These features can be altered by Duolingo directly. For example, Duolingo can decide exactly when for each user to offer a second trial. They can also decide exactly how many ads of what kind to show to each user.

Let's start by examining the offer_delay feature more closely.

Offer_Delay¶

### offer_delay variable

fig = px.histogram(df_viz, x="offer_delay", facet_col="did_resubscribe", title = 'Offer Delay Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()

If we go back to our visualization, we can see that offer_delay is negatively correlated with the probability one might resubscribe. And this is crucial because it is in Duolingo's control when they offer the second subscription. So if we were to think of our probability as a function of x, where x is the offer_delay and the other feature information is held constant, and we are trying to maximize the probability an individual resubscribed, then what if tried using our algorithm to find the optimal value for offer_delay. It may be a simplistic approach and there are likely better ways, perhaps using optimization techniques, to find the optimal value for offer_delay, but this could offer a window into what the optimal value might be. Let's try it!

def optimize_offer(i):
    df1 = X_train.iloc[i:i+1]
    df1 = df1.iloc[np.arange(len(df1)).repeat(300)]
    df1['offer_delay'] = [i for i in range(0,300)]
    fig = px.scatter(x=[i for i in range(0,300)], y=gb_clf.predict_proba(df1)[:,1])
    fig.show()
    print('Sample Data')
    return df1.head()

### Optimize for sample row 8 in our test set

optimize_offer(8)

Sample Data

This reveals that the optimal offer_delay value is 1. And it can make a relatively large difference (25%) in our probability estimation. While it's likely that 1 is the optimal value for most data points, we could potentially use this kind of method to explain how much offer_delay influences our final prediction.

Intuitively, it makes sense that after finishing a free trial, immediately being offered a second free week following the first will increase the chance that one would sign up for the second trial as well.

Mastery Quiz Ads, Health Empty Ads¶

### mastery_quiz_ads variable

fig = px.histogram(df_viz, x="mastery_quiz_ads", facet_col="did_resubscribe", title = 'Mastery Quiz Ads Variable')
fig.update_yaxes(matches=None)
fig.update_yaxes(showticklabels=True, col=2)
fig.show()
print('The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus')

The number of "mastery quiz" ads a user saw for Plus per active day during the resubscribe period. A "mastery quiz" ad is an ad that advertises the Mastery Quiz feature of Plus

Two of other variables that seem to have some influence in our prediction is the number of Mastery Quiz and Health Empty Ads (Mastery Quiz Ads more so). We should be able to use a similar approach like we used above to see how how mastery quiz ads affect our prediction.

def optimize_ads(i):
    df1 = X_train.iloc[i:i+1]
    df1 = df1.iloc[np.arange(len(df1)).repeat(20)]
    df1['mastery_quiz_ads'] = [np.round(i, 1) for i in np.linspace(0, 2, 20)]
    fig = px.scatter(x=[np.round(i, 1) for i in np.linspace(0, 2, 20)], y=gb_clf.predict_proba(df1)[:,1])
    fig.show()
    print('Sample Data')
    return df1.head()

### Optimize for sample row 9 in our test set.

optimize_ads(9)

Sample Data

In this example, there may not be an optimal solution for all points in terms of what number of ads is appropriate. Based on some experimentation, it seems that it can vary depending on the data point. We can perform this for health_ads as well.

Insights and Recommendations

Product Insights:¶

There are several features that are strongly predictive of our target variable but we do not have indirect control over or cannot immediately change. However, being aware that these are some of the more predictive features in our model will help us understand what aspects of the product to prioritize.

learning_time:¶

Learning_time is the most influential feature based on our analysis. The more time customers spend on average per day, the more likely they are to resubscribe. According to our GB feature importance plot, this was given a 30% feature importance.

plus_challenges_attempted/plus_challenges_correct¶

The greater the number of questions answered or answered correctly, the higher the probability the customer will resubscribe. According to our RF Feature Importance Plot, this was given 10% Feature Importance.

initial_duration:¶

The longer the initial subscription, the higher the probability the customer will resubscribe. According to our GB Feature Importance plot, this was given 20% feature importance.

Variables that we can optimize for:¶

There are three variables that I have identified which influence our prediction and which as an organization we may have some control over and can therefore optimize for:

Offer_Delay:¶

If our goal is solely to maximize the number of individuals who sign up for the second free trial, then we could potentially use our model to predict the optimal number of days for each user. Using A/B Experimentation, we could see whether our predictions our indeed increasing the number of resubscriptions. To further extend this if we want to make this an ongoing experiment, we can employ a hybrid Multi-Armed Bandit Model in which after set periods of time, we evaluate our predictions and adjust our models to better optimize for Offer_Delay.

Mastery Quiz Ads/Health Empty Ads:¶

We can utilize an approach similar to the A/B Testing and Hybrid Multi-Armed Bandit approach I described above. However, there are a different set of constraints to take into account.

If the cost of running an ad for the user exceeds the potential value of gaining a resubscription, then it may not be cost effective. A cost benefit analysis is crucial in this experiment.

	initial_duration	plus_challenges_attempted	plus_challenges_correct	offer_delay	health_empty_ads	mastery_quiz_ads	learning_time	post_activity
count	10062.000000	10063.000000	10067.000000	10066.000000	10060.000000	10068.000000	10057.000000	10065.000000
mean	35.324588	1161.908973	1034.511175	139.991357	0.203973	0.211489	28.512222	0.436960
std	73.182226	2910.942785	2577.811596	137.462982	0.447673	0.451675	213.981577	1.409789
min	3.000000	0.000000	0.000000	0.000000	-2.009000	0.000000	0.000000	0.000000
25%	7.000000	172.000000	153.000000	17.000000	0.000000	0.000000	1.290000	0.000000
50%	7.000000	446.000000	399.000000	96.000000	0.032000	0.138500	4.660000	0.000000
75%	14.000000	1048.000000	942.000000	238.000000	0.250000	0.250000	10.510000	0.000000
max	464.000000	61472.000000	57089.000000	503.000000	15.000000	11.750000	2852.560000	7.000000

	user_id	initial_tier	initial_payment	initial_duration	plus_challenges_attempted	plus_challenges_correct	offer_delay	activity_label	health_empty_ads	mastery_quiz_ads	learning_time	new_language	post_activity	did_resubscribe
15	a21d57c2a4be900872f46940f6fbc6d6	1	False	7.0	142.0	136.0	73.0	reactivated_user	0.105	NaN	14.99	True	0.0	False
108	101b051d20bc8aeec77f81a3390403d4	12	False	7.0	0.0	0.0	NaN	resurrected_user	0.000	0.500	3.26	True	0.0	False
163	d7cbd9142dec663c96eec22c9d664944	12	False	14.0	52.0	51.0	5.0	reactivated_user	0.333	0.583	34.11	True	NaN	False
240	d2cc41b4c6e6e6940c1599a49af5787f	6	False	NaN	410.0	357.0	98.0	resurrected_user	0.000	0.000	7.39	True	0.0	False
268	1d3bdee23fb8b550817d00ee2962acf5	referral	False	7.0	661.0	601.0	360.0	resurrected_user	0.053	0.184	4.86	False	NaN	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9701	77d39526981651916a4f57f42338f446	12	False	7.0	394.0	345.0	454.0	resurrected_user	NaN	0.333	1.37	False	0.0	False
9720	f91bf686e3275bce9891b8c7a73b2d01	12	False	14.0	0.0	0.0	2.0	reactivated_user	NaN	0.000	1.41	False	0.0	False
9772	29e24656f98e98a5208c02141bc87d54	NaN	False	7.0	1162.0	1080.0	88.0	resurrected_user	0.200	0.300	11.59	False	0.0	False
9890	a24b48b844c69f3d7aca0732cd89013b	1	False	7.0	189.0	178.0	388.0	NaN	0.000	0.391	13.32	True	0.0	False
10012	d0871fd5a726d27363a54b69882be9c9	12	False	7.0	NaN	48.0	253.0	resurrected_user	0.155	1.741	9.53	True	0.0	False

	user_id	initial_tier	initial_payment	initial_duration	plus_challenges_attempted	plus_challenges_correct	offer_delay	activity_label	mastery_quiz_ads	learning_time	new_language	did_resubscribe
0	91cfb886c01a3b69ff29a8444620d8eb	6	False	7.0	1255.0	999.0	78.0	resurrected_user	0.000	0.00	False	False
1	bc354f81f63043619115792b52d14d32	referral	False	7.0	58.0	55.0	152.0	resurrected_user	0.000	2.32	False	False
2	a28cd2fc4e7fbce59a91a16f42daad87	referral	False	7.0	543.0	504.0	42.0	resurrected_user	0.000	0.00	False	False
3	e8b7ef7465b5f87cab8bac0345cae332	1	False	7.0	168.0	162.0	2.0	current_user	0.333	5.81	False	False
4	89e71f674719f9ba93cf88a191e20104	1	False	7.0	582.0	510.0	115.0	resurrected_user	0.000	0.62	False	False

	Ranking	Feature Number	Feature Name
0	1	feature 7 (0.238237)	learning_time
1	2	feature 4 (0.164576)	offer_delay
2	3	feature 6 (0.121570)	mastery_quiz_ads
3	4	feature 3 (0.102175)	plus_challenges_correct
4	5	feature 2 (0.101353)	plus_challenges_attempted
5	6	feature 5 (0.091485)	health_empty_ads
6	7	feature 1 (0.079264)	initial_duration
7	8	feature 12 (0.029966)	initial_tier_referral
8	9	feature 8 (0.015301)	new_language
9	10	feature 10 (0.012134)	initial_tier_12
10	11	feature 9 (0.011623)	initial_tier_1
11	12	feature 13 (0.007197)	activity_label_current_user
12	13	feature 14 (0.006700)	activity_label_reactivated_user
13	14	feature 15 (0.006528)	activity_label_ressurected_user
14	15	feature 0 (0.006195)	initial_payment
15	16	feature 11 (0.005696)	initial_tier_6

	Ranking	Feature Number	Feature Name
0	1	feature 7 (0.336907)	learning_time
1	2	feature 1 (0.195514)	initial_duration
2	3	feature 4 (0.156589)	offer_delay
3	4	feature 12 (0.109228)	initial_tier_referral
4	5	feature 6 (0.103009)	mastery_quiz_ads
5	6	feature 5 (0.036880)	health_empty_ads
6	7	feature 2 (0.021992)	plus_challenges_attempted
7	8	feature 3 (0.019940)	plus_challenges_correct
8	9	feature 15 (0.010515)	activity_label_ressurected_user
9	10	feature 10 (0.002621)	initial_tier_12
10	11	feature 14 (0.001803)	activity_label_reactivated_user
11	12	feature 8 (0.001549)	new_language
12	13	feature 13 (0.001525)	activity_label_current_user
13	14	feature 0 (0.001104)	initial_payment
14	15	feature 11 (0.000635)	initial_tier_6
15	16	feature 9 (0.000189)	initial_tier_1

	initial_payment	initial_duration	plus_challenges_attempted	plus_challenges_correct	offer_delay	health_empty_ads	mastery_quiz_ads	learning_time	new_language	initial_tier_1	activity_label_current_user
3991	False	14	432	369	0	0.333	0.111	14.51	False	1	1
3991	False	14	432	369	1	0.333	0.111	14.51	False	1	1
3991	False	14	432	369	2	0.333	0.111	14.51	False	1	1
3991	False	14	432	369	3	0.333	0.111	14.51	False	1	1
3991	False	14	432	369	4	0.333	0.111	14.51	False	1	1

	initial_payment	initial_duration	plus_challenges_attempted	plus_challenges_correct	offer_delay	mastery_quiz_ads	learning_time	new_language	initial_tier_1	activity_label_current_user
4201	False	14	2107	1951	1	0.0	5.98	False	1	1
4201	False	14	2107	1951	1	0.1	5.98	False	1	1
4201	False	14	2107	1951	1	0.2	5.98	False	1	1
4201	False	14	2107	1951	1	0.3	5.98	False	1	1
4201	False	14	2107	1951	1	0.4	5.98	False	1	1