Analysis

Please, see the data preparation tutorial for understand how to prepare data for this functions.

Load data

import pandas as pd
df = pd.read_csv('example_datasets/train.csv')

First steps

In this tutorial, we will describe the onboarding lost-passed case.

Goal

Our goal is to detect interface elements/screens of an app at which users’ engagement drops significantly and induces them to leave the app without account registration.

Tasks

  1. Collect data

  2. Prepare data

  3. Analyze data

    1. Build pivot tables

    2. Visualize users path in the app

    3. Build the classifier

      1. Classifier helps you to pick out specific users paths
      2. Classifier allows estimating the probability of the user’s leaving from the app based on his current path. One can use this information to dynamically change the content of the app to prevent that.

Expected results

  1. You will identify the most “problematic” elements of the app
  2. You will get the classifier which will allow you to predict the user’s leaving from the app based on the current user’s behavior

Import retentioneering framework

First of all, we need to import package module

from retentioneering import analysis

After that we shoud select the export folder in config (you can leave it empty and the script will create a folder with current timestamp). Note: if you don’t want to leave this field empty, you have to specify an existing directory, because the script will not create it.

settings = {
    'export_folder': './experiments/new_experiment'
}

or

settings = {}

Events` probability dynamics

desc = analysis.get_desc_table(df,
                               settings=settings,
                               plot=True,
                               target_event_list=['lost',
                                                  'passed'])
Probability dynamics
Each column of the table corresponds to a sequence number
of the user’s steps from the path and each row corresponds to event name.

Values of the table show the probability that the user choose the appropriate event at the appropriate step.

It’s difficult to make a complex analysis from that table so it is better
to split our users to those who leave the app and those who passed.

Difference in passed and lost users behaviour

# find users who get lost
lost_users_list = df[df.event_name == 'lost'].user_pseudo_id.unique()

# create filter for lost users
filt = df.user_pseudo_id.isin(lost_users_list)

# filter data for lost users trajectories
df_lost = df[filt]

# filter data for passed users trajectories
df_passed = df[~filt]

Plot dynamics for different groups.

Plot for group of users who have lost event:

desc_loss = analysis.get_desc_table(df_lost,
                                    settings=settings,
                                    plot=True,
                                    target_event_list=['lost',
                                                       'passed'])
Probability dynamics for lost users

Plot for group of users who have passed event:

desc_passed = analysis.get_desc_table(df_passed,
                                    settings=settings,
                                    plot=True,
                                    target_event_list=['lost',
                                                       'passed'])
Probability dynamics for passed users
diff_df = analysis.get_diff(desc_loss,
                            desc_passed,
                            settings=settings,
                            precalc=True)
Difference of probability dynamics over lost and passed users

Agregates over user transitions

Lets aggregate our data over users transitions:

agg_list = ['trans_count', 'dt_mean', 'dt_median', 'dt_min', 'dt_max']
df_agg = analysis.get_all_agg(df, agg_list)
df_agg.head()

Out:

                    event_name                      next_event  trans_count    ...
0  onboarding__chooseLoginType                            lost            1    ...
1  onboarding__chooseLoginType          onboarding_login_Type1          414    ...
2  onboarding__chooseLoginType          onboarding_login_Type2          159    ...
3  onboarding__chooseLoginType  onboarding_privacy_policyShown         2133    ...
4     onboarding__loginFailure                            lost            1    ...

Now we can see which transitions take the most time and how often people have used different transitions.

We could choose the 10 longest users’ paths:

df_agg.sort_values('trans_count', ascending=False).head(10)

Out:

                           event_name                         next_event  trans_count    ...
84          onboarding_welcome_screen          onboarding_welcome_screen         5021    ...
85          onboarding_welcome_screen                             passed         2330    ...
3         onboarding__chooseLoginType     onboarding_privacy_policyShown         2133    ...
79          onboarding_welcome_screen        onboarding__chooseLoginType         1938    ...
67     onboarding_privacy_policyShown             onboarding_login_Type1         1675    ...
11             onboarding_login_Type1  onboarding_privacy_policyAccepted         1666    ...
82          onboarding_welcome_screen        onboarding_otherLogin__show         1601    ...
62  onboarding_privacy_policyAccepted          onboarding_welcome_screen         1189    ...
78          onboarding_welcome_screen                               lost         1043    ...
47        onboarding_otherLogin__show          onboarding_welcome_screen          876    ...

You can see the events where users spend most of their time. It seems reasonable to analyze only popular events to get stable results.

Adjacency matrix

The adjacency matrix is the representation of the graph. You can read more about it on the`wiki <https://en.wikipedia.org/wiki/Adjacency_matrix>`__:

adj_count = analysis.get_adjacency(df_agg, 'trans_count')
adj_count

Out:

                                        ...   onboarding_login_Type1   onboarding_privacy_policyShown    ...
onboarding_login_Type1                  ...                      0.0                              0.0    ...
onboarding_privacy_policyShown          ...                   1675.0                              0.0    ...
onboarding__loginFailure                ...                      0.0                              0.0    ...
onboarding_privacy_policyTapToPolicy    ...                      0.0                              0.0    ...
onboarding_welcome_screen               ...                      0.0                              0.0    ...

Users clustering

Also, we could clusterize users by the frequency of events in their path:

countmap = analysis.calculate.calculate_frequency_map(df, settings)
Hist of frequencies Heatmap of user trajectories

On that plot, we can see that some users have pretty close frequencies of different functions usage.

And we can see that it is useful to separate groups with different conversion rates:

analysis.utils.plot_clusters(df, countmap, n_clusters=5, plot_cnt=2)
Distribution of target class in founded clusters

Graph visualization

We have two options to plot the graphs: 1. With python (this is local) 2. With our API (in that case, you'll send your data to our servers, but we don't save it and using only for visualization)

In the second option, the plot looks much better and obvious.

analysis.utils.plot_graph_python(df_agg, 'trans_count', settings)
Python graph visualization
from retentioneering.utils.export import plot_graph_api
plot_graph_api(df_lost, settings)

Lost-Passed classifier

Model fitting

clf = analysis.Model(df, target_event='lost', settings=settings)
clf.fit_model()
Model metrics

It returns metrics of quality of the model.

Model inference

We have data for new users, who are not passed or lost already.

Lets load it into pandas DataFrame:

test_data = pd.read_csv('example_datasets/test.csv')

Now we can predict probabilities for new users:

prediction = clf.infer(test_data)
prediction.head()

Out:

                     user_pseudo_id  not_target    target
0  000bf8e1812a0335c7e65d52b3f6e816    0.976125  0.023875
1  00275391998b3f87d798f6e7a1ec5c15    0.757970  0.242030
2  004ecbe8a710f3c7b5b3cbc9bc0c74b2    0.727521  0.272479
3  00530441b09d5494b09e936a97d5cb99    0.988654  0.011346
4  005502038cec478faf343fe54310a848    0.592515  0.407485

Understanding your data

You can plot projection of users trajectories to understand how your data looks likes:

clf.plot_projections()
Model metrics

Understanding prediction of your model

Also, you can plot results of the model inference over that projections to understand the cases where your model fails:

clf.plot_projections(sample=data.event_name.values, ids=data.user_pseudo_id.values)
Model metrics

Visualizing graph for area

From the previous plot, you can be interested in what trajectories has high conversion rates.

You can select the area on that plot and visualize it as a graph:

# write coordinates bbox angles

bbox = [
    [-4, -12],
    [8, -26]
]

clf.plot_cluster_track(bbox)
Python graph visualization

The most important edges

You could find what was the most important edges and nodes in your model for debugging (e.g. it helps you to understand ‘leaky’ events) or to find problem transitions in your app.

Edges:

imp_tracks = clf.build_important_track()
imp_tracks[imp_tracks[1].notnull()]

Out:

                             0                              1
0  onboarding__chooselogintype         onboarding_login_type1
1       onboarding_login_type1       onboarding__loginfailure
2  onboarding__chooselogintype         onboarding_login_type2
3       onboarding_login_type2    onboarding_otherlogin__show
5     onboarding__loginfailure  onboarding_login_type1_cancel

Nodes:

imp_tracks[imp_tracks[1].isnull()][0].values

Out:

array(['onboarding__loginfailure', 'onboarding_login_type1',
       'onboarding_login_type1_cancel', 'onboarding_login_type2',
       'onboarding_otherlogin_privacy_policyshown',
       'onboarding_privacy_policydecline', 'onboarding_welcome_screen'],
      dtype=object)