Analysis ======== Please, see the `data preparation tutorial `__ for understand how to prepare data for this functions. Load data --------- .. code:: python import pandas as pd df = pd.read_csv('example_datasets/train.csv') First steps ----------- In this tutorial, we will describe the onboarding lost-passed case. **Goal** Our goal is to detect interface elements/screens of an app at which users' engagement drops significantly and induces them to leave the app without account registration. **Tasks** 1. Collect data 2. Prepare data 3. Analyze data 1. Build pivot tables 2. Visualize users path in the app 3. Build the classifier 1. Classifier helps you to pick out specific users paths 2. Classifier allows estimating the probability of the user's leaving from the app based on his current path. One can use this information to dynamically change the content of the app to prevent that. **Expected results** 1. You will identify the most "problematic" elements of the app 2. You will get the classifier which will allow you to predict the user's leaving from the app based on the current user's behavior Import retentioneering framework ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First of all, we need to import package module .. code:: python from retentioneering import analysis After that we shoud select the export folder in config (you can leave it empty and the script will create a folder with current timestamp). Note: if you don't want to leave this field empty, you have to specify an existing directory, because the script will not create it. .. code:: python settings = { 'export_folder': './experiments/new_experiment' } or .. code:: python settings = {} Events\` probability dynamics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python desc = analysis.get_desc_table(df, settings=settings, plot=True, target_event_list=['lost', 'passed']) .. image:: ../_static/plots/desc_table_both.png :width: 1200 :alt: Probability dynamics Each column of the table corresponds to a sequence number of the user's steps from the path and each row corresponds to event name. Values of the table show the probability that the user choose the appropriate event at the appropriate step. It's difficult to make a complex analysis from that table so it is better to split our users to those who leave the app and those who passed. Difference in passed and lost users behaviour ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: python # find users who get lost lost_users_list = df[df.event_name == 'lost'].user_pseudo_id.unique() # create filter for lost users filt = df.user_pseudo_id.isin(lost_users_list) # filter data for lost users trajectories df_lost = df[filt] # filter data for passed users trajectories df_passed = df[~filt] Plot dynamics for different groups. Plot for group of users who have **lost** event: .. code:: python desc_loss = analysis.get_desc_table(df_lost, settings=settings, plot=True, target_event_list=['lost', 'passed']) .. image:: ../_static/plots/desc_table_lost.png :width: 1200 :alt: Probability dynamics for lost users Plot for group of users who have **passed** event: .. code:: python desc_passed = analysis.get_desc_table(df_passed, settings=settings, plot=True, target_event_list=['lost', 'passed']) .. image:: ../_static/plots/desc_table_passed.png :width: 1200 :alt: Probability dynamics for passed users .. code:: python diff_df = analysis.get_diff(desc_loss, desc_passed, settings=settings, precalc=True) .. image:: ../_static/plots/desc_table_diff.png :width: 1200 :alt: Difference of probability dynamics over lost and passed users Agregates over user transitions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Lets aggregate our data over users transitions: .. code:: python agg_list = ['trans_count', 'dt_mean', 'dt_median', 'dt_min', 'dt_max'] df_agg = analysis.get_all_agg(df, agg_list) df_agg.head() Out: .. code-block:: none event_name next_event trans_count ... 0 onboarding__chooseLoginType lost 1 ... 1 onboarding__chooseLoginType onboarding_login_Type1 414 ... 2 onboarding__chooseLoginType onboarding_login_Type2 159 ... 3 onboarding__chooseLoginType onboarding_privacy_policyShown 2133 ... 4 onboarding__loginFailure lost 1 ... Now we can see which transitions take the most time and how often people have used different transitions. We could choose the 10 longest users' paths: .. code:: python df_agg.sort_values('trans_count', ascending=False).head(10) Out: .. code-block:: none event_name next_event trans_count ... 84 onboarding_welcome_screen onboarding_welcome_screen 5021 ... 85 onboarding_welcome_screen passed 2330 ... 3 onboarding__chooseLoginType onboarding_privacy_policyShown 2133 ... 79 onboarding_welcome_screen onboarding__chooseLoginType 1938 ... 67 onboarding_privacy_policyShown onboarding_login_Type1 1675 ... 11 onboarding_login_Type1 onboarding_privacy_policyAccepted 1666 ... 82 onboarding_welcome_screen onboarding_otherLogin__show 1601 ... 62 onboarding_privacy_policyAccepted onboarding_welcome_screen 1189 ... 78 onboarding_welcome_screen lost 1043 ... 47 onboarding_otherLogin__show onboarding_welcome_screen 876 ... You can see the events where users spend most of their time. It seems reasonable to analyze only popular events to get stable results. Adjacency matrix ~~~~~~~~~~~~~~~~ The adjacency matrix is the representation of the graph. You can read more about it on the`wiki `__: .. code:: python adj_count = analysis.get_adjacency(df_agg, 'trans_count') adj_count Out: .. code-block:: none ... onboarding_login_Type1 onboarding_privacy_policyShown ... onboarding_login_Type1 ... 0.0 0.0 ... onboarding_privacy_policyShown ... 1675.0 0.0 ... onboarding__loginFailure ... 0.0 0.0 ... onboarding_privacy_policyTapToPolicy ... 0.0 0.0 ... onboarding_welcome_screen ... 0.0 0.0 ... Users clustering ~~~~~~~~~~~~~~~~ Also, we could clusterize users by the frequency of events in their path: .. code:: python countmap = analysis.calculate.calculate_frequency_map(df, settings) .. image:: ../_static/plots/bar.png :width: 1200 :alt: Hist of frequencies .. image:: ../_static/plots/countmap.png :width: 1200 :alt: Heatmap of user trajectories On that plot, we can see that some users have pretty close frequencies of different functions usage. And we can see that it is useful to separate groups with different conversion rates: .. code:: python analysis.utils.plot_clusters(df, countmap, n_clusters=5, plot_cnt=2) .. image:: ../_static/plots/pie_cluster.png :width: 1200 :alt: Distribution of target class in founded clusters Graph visualization ~~~~~~~~~~~~~~~~~~~ We have two options to plot the graphs: 1. With python (this is local) 2. With our API (``in that case, you'll send your data to our servers, but we don't save it and using only for visualization``) In the second option, the plot looks much better and obvious. .. code:: python analysis.utils.plot_graph_python(df_agg, 'trans_count', settings) .. image:: ../_static/plots/graph_1.png :width: 1200 :alt: Python graph visualization .. code:: python from retentioneering.utils.export import plot_graph_api plot_graph_api(df_lost, settings) Lost-Passed classifier ~~~~~~~~~~~~~~~~~~~~~~ Model fitting ^^^^^^^^^^^^^ .. code:: python clf = analysis.Model(df, target_event='lost', settings=settings) clf.fit_model() .. image:: ../_static/plots/scores.png :width: 1200 :alt: Model metrics It returns metrics of quality of the model. Model inference ^^^^^^^^^^^^^^^ We have data for new users, who are not passed or lost already. Lets load it into pandas DataFrame: .. code:: python test_data = pd.read_csv('example_datasets/test.csv') Now we can predict probabilities for new users: .. code:: python prediction = clf.infer(test_data) prediction.head() Out: .. code-block:: none user_pseudo_id not_target target 0 000bf8e1812a0335c7e65d52b3f6e816 0.976125 0.023875 1 00275391998b3f87d798f6e7a1ec5c15 0.757970 0.242030 2 004ecbe8a710f3c7b5b3cbc9bc0c74b2 0.727521 0.272479 3 00530441b09d5494b09e936a97d5cb99 0.988654 0.011346 4 005502038cec478faf343fe54310a848 0.592515 0.407485 Understanding your data ^^^^^^^^^^^^^^^^^^^^^^^ You can plot projection of users trajectories to understand how your data looks likes: .. code:: python clf.plot_projections() .. image:: ../_static/plots/tsne_classes.png :width: 1200 :alt: Model metrics Understanding prediction of your model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Also, you can plot results of the model inference over that projections to understand the cases where your model fails: .. code:: python clf.plot_projections(sample=data.event_name.values, ids=data.user_pseudo_id.values) .. image:: ../_static/plots/tsne_probs.png :width: 1200 :alt: Model metrics Visualizing graph for area ^^^^^^^^^^^^^^^^^^^^^^^^^^ From the previous plot, you can be interested in what trajectories has high conversion rates. You can select the area on that plot and visualize it as a graph: .. code:: python # write coordinates bbox angles bbox = [ [-4, -12], [8, -26] ] clf.plot_cluster_track(bbox) .. image:: ../_static/plots/graph_2.png :width: 1200 :alt: Python graph visualization The most important edges ^^^^^^^^^^^^^^^^^^^^^^^^ You could find what was the most important edges and nodes in your model for debugging (e.g. it helps you to understand 'leaky' events) or to find problem transitions in your app. Edges: .. code:: python imp_tracks = clf.build_important_track() imp_tracks[imp_tracks[1].notnull()] Out: .. code-block:: none 0 1 0 onboarding__chooselogintype onboarding_login_type1 1 onboarding_login_type1 onboarding__loginfailure 2 onboarding__chooselogintype onboarding_login_type2 3 onboarding_login_type2 onboarding_otherlogin__show 5 onboarding__loginfailure onboarding_login_type1_cancel Nodes: .. code:: python imp_tracks[imp_tracks[1].isnull()][0].values Out: .. code-block:: none array(['onboarding__loginfailure', 'onboarding_login_type1', 'onboarding_login_type1_cancel', 'onboarding_login_type2', 'onboarding_otherlogin_privacy_policyshown', 'onboarding_privacy_policydecline', 'onboarding_welcome_screen'], dtype=object)