retentioneering.analysis package¶
Submodules¶
retentioneering.analysis.calculate module¶
-
retentioneering.analysis.calculate.
calculate_frequency_hist
(df, settings, target_events=None, make_plot=True, save=True, plot_name=None, figsize=(8, 5))[source]¶ Calculate frequency of each event from input clickstream and plot a barplot
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- settings (dict) – experiment config (can be empty dict here)
- target_events (Union[tuple, list, str, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
- make_plot (bool) – plot stats or not
- save (bool) – True if the graph should be saved
- plot_name (str) – name of file with graph plot
- figsize (tuple) – width, height in inches. If not provided, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
Returns: pd.DataFrame
-
retentioneering.analysis.calculate.
calculate_frequency_map
(df, settings, target_events=None, plot_name=None, make_plot=True, save=True, figsize_hist=(8, 5), figsize_heatmap=(10, 15))[source]¶ Calculate frequency of each event for each user from input clickstream and plot a heatmap
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- settings (dict) – experiment config (can be empty dict here)
- target_events (Union[tuple, list, str, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
- plot_name (str) – name of file with graph plot
- make_plot (bool) – plot stats or not
- save (bool) – True if the graph should be saved
- figsize_hist (tuple) – width, height in inches for bar plot with events. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
- figsize_heatmap (tuple) – width, height in inches for heatmap. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
Returns: pd.DataFrame
retentioneering.analysis.cluster module¶
-
retentioneering.analysis.cluster.
add_cluster_of_users
(data, users_clusters, how='left')[source]¶ Add cluster of each user to clickstream data
Parameters: - data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least one column: user_pseudo_id
- users_clusters (pd.DataFrame) – DataFrame with user_pseudo_id and cluster for each user
- how (str) – argument to pass in pd.merge function
Returns: pd.DataFrame
-
retentioneering.analysis.cluster.
calculate_cluster_stats
(data, users_clusters, settings, target_events=('lost', 'passed'), make_plot=True, plot_count=2, save=True, plot_name=None, figsize=(10, 5))[source]¶ Plot pie-chart with distribution of target events in clusters
Parameters: - data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- users_clusters (pd.DataFrame) – DataFrame with user_pseudo_id and cluster for each user
- settings (dict) – experiment config (can be empty dict here)
- target_events (list or tuple) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
- make_plot (bool) – plot stats or not
- plot_count (int) – number of plots for output
- save – True if the graph should be saved
- plot_name (str) – name of file with graph plot
- figsize (tuple) – width, height in inches. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
Returns: np.array
-
retentioneering.analysis.cluster.
cluster_users
(countmap, n_clusters=None, clusterer=None)[source]¶ Cluster users based on input dataframe and return DataFrame with user_pseudo_id and cluster for each user
Parameters: - countmap (pd.DataFrame) – input dataframe, should have user_id in index. All fields will be features in clustering algorithm
- n_clusters (int) – supposed number of clusters, could be None
- clusterer (func) – clustering algorithm. Should have fit_predict function
Returns: pd.DataFrame
retentioneering.analysis.model module¶
-
class
retentioneering.analysis.model.
Model
(data, target_event, settings, event_filter=None, n_start_events=None, emb_type='tf-idf', ngram_range=(1, 3), emb_dims=None, embedder=None)[source]¶ Bases:
object
Base model for classification
-
build_important_track
()[source]¶ Finds the most important tracks for definition of target_event
Returns: most important edges in graph Return type: pd.DataFrame
-
fit_model
(model_type='logit')[source]¶ Fits classifier
Parameters: model_type (str) – type of model (now only logit is supported) Returns: None
-
plot_cluster_track
(bbox)[source]¶ Plots graph for users in selected area
Parameters: bbox (List[List[float]]) – coordinates of top-left and bottom-right angles of area Returns: None
-
plot_projections
(sample=None, target=None, ids=None)[source]¶ Plots tsne projection of users trajectories
Parameters: - sample (np.ndarray of pd.DataFrame) – sample of trajectories
- target (str) – by which column data should be splitted (if target is None then probabilities of target_event is highlighted)
- ids (list or other iterable) – list of user_ids for visualization in plot
Returns: None
-
predict_proba
(sample)[source]¶ Predicts probability of sample
Parameters: sample (np.ndarray or pd.DataFrame) – sample of users vectorized tracks (e.g. with tf-idf transform) Returns: probabilities of different classes as list of [not target_event probability, target_event probability] Return type: np.ndarray
-
-
retentioneering.analysis.model.
create_filter
(data, n_folds=None)[source]¶ Creates interested events filter with histogram
Parameters: - data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- n_folds (int) – number of folds in histogram (it affects how far technical events should be from users` events)
Returns: set
retentioneering.analysis.utils module¶
-
retentioneering.analysis.utils.
filter_welcome
(df)[source]¶ Filter for truncated welcome visualization
- :param df:data from BQ or your own (clickstream). Should have at least three columns: event_name,
- event_timestamp and user_pseudo_id
Returns: filtered for users events dataset Return type: pd.DataFrame
-
retentioneering.analysis.utils.
get_accums
(agg, name, max_rank)[source]¶ Creates Accumulator Variables
Parameters: - agg – Counts of events by step
- name – Name of Accumulator
- max_rank – Number of steps in pivot
Returns: Accumulator Variable
-
retentioneering.analysis.utils.
get_adjacency
(df, adj_type)[source]¶ Creates graph adjacency matrix from table with aggregates by nodes
Parameters: - df (pd.DataFrame) – table with aggregates (from retentioneering.analysis.get_all_agg function)
- adj_type (str) – name of col for weighting graph nodes (column name from df)
Returns: adjacency matrix
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
get_agg
(df, agg_type)[source]¶ Create aggregates (weights) by time of graph nodes
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- agg_type (str) – type of aggregate, should be written in form ‘name’ + ‘_’ + aggregate type (e.g. trans_count where trans is the name and count is aggragete type). Aggragate types can be: max, min, mean, median, std, count. For greater list, please, check the pd.DataFrame.groupby().agg() documentation
Returns: table with aggregates by nodes of graph
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
get_all_agg
(df, agg_list)[source]¶ Create aggregates (weights) by time of graph nodes from agg_list
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- agg_list (List[str]) – list of needed aggregates, each aggregate should be written in form ‘name’ + ‘_’ + aggregate type (e.g. trans_count where trans is the name and count is aggragete type). Aggragate types can be: max, min, mean, median, std, count. For greater list, please, check the pd.DataFrame.groupby().agg() documentation
Returns: table with aggregates by nodes of graph
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
get_desc_table
(df, settings, target_event_list=['lost', 'passed'], max_steps=None, plot=True, plot_name=None)[source]¶ Builds distribution of events over steps
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- settings (dict) – experiment config (can be empty dict here)
- target_event_list (list) – list of target events
- max_steps (int) –
- plot (bool) – if True then heatmap is plotted
- plot_name (str) –
Returns: Pivot table with distribution of events over steps
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
get_diff
(df_old, df_new, settings, precalc=False, plot=True, plot_name=None)[source]¶ Gets difference between two groups
Parameters: - df_old (pd.DataFrame) – Raw clickstream or calculated desc table of last version
- df_new (pd.DataFrame) – Raw clickstream or calculated desc table of new version
- settings (dict) – experiment config (can be empty dict here)
- precalc (bool) – If True then precalculated desc tables is used
- plot (bool) – if True then heatmap is plotted
- plot_name (str) –
Returns: Table of differences between two versions
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
get_shift
(df)[source]¶ Creates next_event and time_to_next_event
Parameters: df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id Returns: source table with additional columns Return type: pd.DataFrame
-
retentioneering.analysis.utils.
plot_clusters
(data, countmap, target_events=['lost', 'passed'], n_clusters=None, plot_cnt=2, width=10, height=5)[source]¶ Plot pie-chart with distribution of target events in clusters
Parameters: - data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- countmap (pd.DataFrame) – result of retentioneering.analysis.utils.plot_frequency_map
- target_events (List[str]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
- n_clusters (int) – supposed number of clusters
- plot_cnt (int) – number of plots for output
- width (float) – width of plot
- height (float) – height of plot
Returns: None
-
retentioneering.analysis.utils.
plot_frequency_map
(df, settings, target_events=['lost', 'passed'], plot_name=None)[source]¶ Plots frequency histogram and heatmap of users` event count
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- settings (dict) – experiment config (can be empty dict here)
- target_events (List[str]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
- plot_name (str) – name of file with graph plot
Returns: table with counts of events for users
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
plot_graph_python
(df_agg, agg_type, settings, layout=<function random_layout>, plot_name=None)[source]¶ Visualize trajectories from aggregated tables (with python)
Parameters: - df_agg (pd.DataFrame) – table with aggregates (from retentioneering.analysis.get_all_agg function)
- agg_type (str) – name of col for weighting graph nodes (column name from df)
- settings (dict) – experiment config (can be empty dict here)
- layout (func) –
- plot_name (str) – name of file with graph plot
Returns: None
-
retentioneering.analysis.utils.
prepare_dataset
(df, target_events, event_filter=None, n_start_events=None)[source]¶ Prepares data for classifier inference
Parameters: - df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
- target_events (Union[list, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
- event_filter (list or other iterable) – list of events that is wanted to use in analysis
- n_start_events – length of users trajectory from start
Returns: prepared data for inference (glued user events in one trajectory)
Return type: pd.DataFrame
-
retentioneering.analysis.utils.
prepare_prunned
(df)[source]¶ Filter for truncated welcome visualization
- :param df:data from BQ or your own (clickstream). Should have at least three columns: event_name,
- event_timestamp and user_pseudo_id
Returns: filtered for users events dataset Return type: pd.DataFrame
retentioneering.analysis.weighter module¶
-
retentioneering.analysis.weighter.
calc_all_norm_mech
(data, mechanics_events, mode='session', duration_thresh=1, len_thresh=None)[source]¶ Calculates weights of different mechanics in users` sessions :param data: clickstream data with columns session (rank of user`s session) :param mechanics_events: mapping of mechanic and its target events :param mode: if session then calculates weights over session, if full over full users story :param duration_thresh: duration in time threshold for technical (ping) session deletion :param len_thresh: number of events in session threshold for technical (ping) session deletion :return: session description with weights of each mechanic
Return type: pd.DataFrame
-
retentioneering.analysis.weighter.
mechanics_enrichment
(data, mechanics, id_col, event_col, q=0.99, q2=0.99)[source]¶ Enrich list of events specific for mechanic
Parameters: - data (pd.DataFrame) – clickstream data with columns session (rank of user`s session)
- mechanics (pd.DataFrame) – table with description in form [id_col, event_col], where id_col is a column with mechanic name and event_col is a column which contains target events specific for that mechanic
- id_col – name of the column with mechanic name
- event_col – name of the column with target events specific for that mechanic
- q (float in interval (0, 1)) – quantile for frequency of target events
- q2 (float in interval (0, 1)) – quantile for frequency of target events of other mechanic
Returns: mapping of mechanic and its target events
Return type: Dict[str, List[str]]