retentioneering.analysis package

Submodules

retentioneering.analysis.calculate module

retentioneering.analysis.calculate.calculate_frequency_hist(df, settings, target_events=None, make_plot=True, save=True, plot_name=None, figsize=(8, 5))[source]

Calculate frequency of each event from input clickstream and plot a barplot

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • settings (dict) – experiment config (can be empty dict here)
  • target_events (Union[tuple, list, str, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
  • make_plot (bool) – plot stats or not
  • save (bool) – True if the graph should be saved
  • plot_name (str) – name of file with graph plot
  • figsize (tuple) – width, height in inches. If not provided, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
Returns:

pd.DataFrame

retentioneering.analysis.calculate.calculate_frequency_map(df, settings, target_events=None, plot_name=None, make_plot=True, save=True, figsize_hist=(8, 5), figsize_heatmap=(10, 15))[source]

Calculate frequency of each event for each user from input clickstream and plot a heatmap

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • settings (dict) – experiment config (can be empty dict here)
  • target_events (Union[tuple, list, str, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
  • plot_name (str) – name of file with graph plot
  • make_plot (bool) – plot stats or not
  • save (bool) – True if the graph should be saved
  • figsize_hist (tuple) – width, height in inches for bar plot with events. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
  • figsize_heatmap (tuple) – width, height in inches for heatmap. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
Returns:

pd.DataFrame

retentioneering.analysis.cluster module

retentioneering.analysis.cluster.add_cluster_of_users(data, users_clusters, how='left')[source]

Add cluster of each user to clickstream data

Parameters:
  • data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least one column: user_pseudo_id
  • users_clusters (pd.DataFrame) – DataFrame with user_pseudo_id and cluster for each user
  • how (str) – argument to pass in pd.merge function
Returns:

pd.DataFrame

retentioneering.analysis.cluster.calculate_cluster_stats(data, users_clusters, settings, target_events=('lost', 'passed'), make_plot=True, plot_count=2, save=True, plot_name=None, figsize=(10, 5))[source]

Plot pie-chart with distribution of target events in clusters

Parameters:
  • data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • users_clusters (pd.DataFrame) – DataFrame with user_pseudo_id and cluster for each user
  • settings (dict) – experiment config (can be empty dict here)
  • target_events (list or tuple) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
  • make_plot (bool) – plot stats or not
  • plot_count (int) – number of plots for output
  • save – True if the graph should be saved
  • plot_name (str) – name of file with graph plot
  • figsize (tuple) – width, height in inches. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
Returns:

np.array

retentioneering.analysis.cluster.cluster_users(countmap, n_clusters=None, clusterer=None)[source]

Cluster users based on input dataframe and return DataFrame with user_pseudo_id and cluster for each user

Parameters:
  • countmap (pd.DataFrame) – input dataframe, should have user_id in index. All fields will be features in clustering algorithm
  • n_clusters (int) – supposed number of clusters, could be None
  • clusterer (func) – clustering algorithm. Should have fit_predict function
Returns:

pd.DataFrame

retentioneering.analysis.model module

class retentioneering.analysis.model.Model(data, target_event, settings, event_filter=None, n_start_events=None, emb_type='tf-idf', ngram_range=(1, 3), emb_dims=None, embedder=None)[source]

Bases: object

Base model for classification

build_important_track()[source]

Finds the most important tracks for definition of target_event

Returns:most important edges in graph
Return type:pd.DataFrame
fit_model(model_type='logit')[source]

Fits classifier

Parameters:model_type (str) – type of model (now only logit is supported)
Returns:None
plot()[source]

Plot metrics of model

Returns:None
plot_cluster_track(bbox)[source]

Plots graph for users in selected area

Parameters:bbox (List[List[float]]) – coordinates of top-left and bottom-right angles of area
Returns:None
plot_projections(sample=None, target=None, ids=None)[source]

Plots tsne projection of users trajectories

Parameters:
  • sample (np.ndarray of pd.DataFrame) – sample of trajectories
  • target (str) – by which column data should be splitted (if target is None then probabilities of target_event is highlighted)
  • ids (list or other iterable) – list of user_ids for visualization in plot
Returns:

None

predict_proba(sample)[source]

Predicts probability of sample

Parameters:sample (np.ndarray or pd.DataFrame) – sample of users vectorized tracks (e.g. with tf-idf transform)
Returns:probabilities of different classes as list of [not target_event probability, target_event probability]
Return type:np.ndarray
retentioneering.analysis.model.create_filter(data, n_folds=None)[source]

Creates interested events filter with histogram

Parameters:
  • data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • n_folds (int) – number of folds in histogram (it affects how far technical events should be from users` events)
Returns:

set

retentioneering.analysis.utils module

retentioneering.analysis.utils.filter_welcome(df)[source]

Filter for truncated welcome visualization

:param df:data from BQ or your own (clickstream). Should have at least three columns: event_name,
event_timestamp and user_pseudo_id
Returns:filtered for users events dataset
Return type:pd.DataFrame
retentioneering.analysis.utils.get_accums(agg, name, max_rank)[source]

Creates Accumulator Variables

Parameters:
  • agg – Counts of events by step
  • name – Name of Accumulator
  • max_rank – Number of steps in pivot
Returns:

Accumulator Variable

retentioneering.analysis.utils.get_adjacency(df, adj_type)[source]

Creates graph adjacency matrix from table with aggregates by nodes

Parameters:
  • df (pd.DataFrame) – table with aggregates (from retentioneering.analysis.get_all_agg function)
  • adj_type (str) – name of col for weighting graph nodes (column name from df)
Returns:

adjacency matrix

Return type:

pd.DataFrame

retentioneering.analysis.utils.get_agg(df, agg_type)[source]

Create aggregates (weights) by time of graph nodes

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • agg_type (str) – type of aggregate, should be written in form ‘name’ + ‘_’ + aggregate type (e.g. trans_count where trans is the name and count is aggragete type). Aggragate types can be: max, min, mean, median, std, count. For greater list, please, check the pd.DataFrame.groupby().agg() documentation
Returns:

table with aggregates by nodes of graph

Return type:

pd.DataFrame

retentioneering.analysis.utils.get_all_agg(df, agg_list)[source]

Create aggregates (weights) by time of graph nodes from agg_list

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • agg_list (List[str]) – list of needed aggregates, each aggregate should be written in form ‘name’ + ‘_’ + aggregate type (e.g. trans_count where trans is the name and count is aggragete type). Aggragate types can be: max, min, mean, median, std, count. For greater list, please, check the pd.DataFrame.groupby().agg() documentation
Returns:

table with aggregates by nodes of graph

Return type:

pd.DataFrame

retentioneering.analysis.utils.get_desc_table(df, settings, target_event_list=['lost', 'passed'], max_steps=None, plot=True, plot_name=None)[source]

Builds distribution of events over steps

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • settings (dict) – experiment config (can be empty dict here)
  • target_event_list (list) – list of target events
  • max_steps (int) –
  • plot (bool) – if True then heatmap is plotted
  • plot_name (str) –
Returns:

Pivot table with distribution of events over steps

Return type:

pd.DataFrame

retentioneering.analysis.utils.get_diff(df_old, df_new, settings, precalc=False, plot=True, plot_name=None)[source]

Gets difference between two groups

Parameters:
  • df_old (pd.DataFrame) – Raw clickstream or calculated desc table of last version
  • df_new (pd.DataFrame) – Raw clickstream or calculated desc table of new version
  • settings (dict) – experiment config (can be empty dict here)
  • precalc (bool) – If True then precalculated desc tables is used
  • plot (bool) – if True then heatmap is plotted
  • plot_name (str) –
Returns:

Table of differences between two versions

Return type:

pd.DataFrame

retentioneering.analysis.utils.get_shift(df)[source]

Creates next_event and time_to_next_event

Parameters:df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
Returns:source table with additional columns
Return type:pd.DataFrame
retentioneering.analysis.utils.plot_clusters(data, countmap, target_events=['lost', 'passed'], n_clusters=None, plot_cnt=2, width=10, height=5)[source]

Plot pie-chart with distribution of target events in clusters

Parameters:
  • data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • countmap (pd.DataFrame) – result of retentioneering.analysis.utils.plot_frequency_map
  • target_events (List[str]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
  • n_clusters (int) – supposed number of clusters
  • plot_cnt (int) – number of plots for output
  • width (float) – width of plot
  • height (float) – height of plot
Returns:

None

retentioneering.analysis.utils.plot_frequency_map(df, settings, target_events=['lost', 'passed'], plot_name=None)[source]

Plots frequency histogram and heatmap of users` event count

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • settings (dict) – experiment config (can be empty dict here)
  • target_events (List[str]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
  • plot_name (str) – name of file with graph plot
Returns:

table with counts of events for users

Return type:

pd.DataFrame

retentioneering.analysis.utils.plot_graph_python(df_agg, agg_type, settings, layout=<function random_layout>, plot_name=None)[source]

Visualize trajectories from aggregated tables (with python)

Parameters:
  • df_agg (pd.DataFrame) – table with aggregates (from retentioneering.analysis.get_all_agg function)
  • agg_type (str) – name of col for weighting graph nodes (column name from df)
  • settings (dict) – experiment config (can be empty dict here)
  • layout (func) –
  • plot_name (str) – name of file with graph plot
Returns:

None

retentioneering.analysis.utils.prepare_dataset(df, target_events, event_filter=None, n_start_events=None)[source]

Prepares data for classifier inference

Parameters:
  • df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
  • target_events (Union[list, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
  • event_filter (list or other iterable) – list of events that is wanted to use in analysis
  • n_start_events – length of users trajectory from start
Returns:

prepared data for inference (glued user events in one trajectory)

Return type:

pd.DataFrame

retentioneering.analysis.utils.prepare_prunned(df)[source]

Filter for truncated welcome visualization

:param df:data from BQ or your own (clickstream). Should have at least three columns: event_name,
event_timestamp and user_pseudo_id
Returns:filtered for users events dataset
Return type:pd.DataFrame

retentioneering.analysis.weighter module

retentioneering.analysis.weighter.calc_all_norm_mech(data, mechanics_events, mode='session', duration_thresh=1, len_thresh=None)[source]

Calculates weights of different mechanics in users` sessions :param data: clickstream data with columns session (rank of user`s session) :param mechanics_events: mapping of mechanic and its target events :param mode: if session then calculates weights over session, if full over full users story :param duration_thresh: duration in time threshold for technical (ping) session deletion :param len_thresh: number of events in session threshold for technical (ping) session deletion :return: session description with weights of each mechanic

Return type:pd.DataFrame
retentioneering.analysis.weighter.mechanics_enrichment(data, mechanics, id_col, event_col, q=0.99, q2=0.99)[source]

Enrich list of events specific for mechanic

Parameters:
  • data (pd.DataFrame) – clickstream data with columns session (rank of user`s session)
  • mechanics (pd.DataFrame) – table with description in form [id_col, event_col], where id_col is a column with mechanic name and event_col is a column which contains target events specific for that mechanic
  • id_col – name of the column with mechanic name
  • event_col – name of the column with target events specific for that mechanic
  • q (float in interval (0, 1)) – quantile for frequency of target events
  • q2 (float in interval (0, 1)) – quantile for frequency of target events of other mechanic
Returns:

mapping of mechanic and its target events

Return type:

Dict[str, List[str]]

Module contents