retentioneering.analysis package¶

Submodules¶

retentioneering.analysis.calculate module¶

retentioneering.analysis.calculate.calculate_frequency_hist(df, settings, target_events=None, make_plot=True, save=True, plot_name=None, figsize=(8, 5))[source]¶

Calculate frequency of each event from input clickstream and plot a barplot

Parameters:

df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
settings (dict) – experiment config (can be empty dict here)
target_events (Union[tuple, list, str, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
make_plot (bool) – plot stats or not
save (bool) – True if the graph should be saved
plot_name (str) – name of file with graph plot
figsize (tuple) – width, height in inches. If not provided, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]

Returns:

pd.DataFrame

retentioneering.analysis.calculate.calculate_frequency_map(df, settings, target_events=None, plot_name=None, make_plot=True, save=True, figsize_hist=(8, 5), figsize_heatmap=(10, 15))[source]¶

Calculate frequency of each event for each user from input clickstream and plot a heatmap

Parameters:

df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
settings (dict) – experiment config (can be empty dict here)
target_events (Union[tuple, list, str, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
plot_name (str) – name of file with graph plot
make_plot (bool) – plot stats or not
save (bool) – True if the graph should be saved
figsize_hist (tuple) – width, height in inches for bar plot with events. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]
figsize_heatmap (tuple) – width, height in inches for heatmap. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]

Returns:

pd.DataFrame

retentioneering.analysis.cluster module¶

retentioneering.analysis.cluster.add_cluster_of_users(data, users_clusters, how='left')[source]¶

Add cluster of each user to clickstream data

Parameters:	data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least one column: user_pseudo_id users_clusters (pd.DataFrame) – DataFrame with user_pseudo_id and cluster for each user how (str) – argument to pass in pd.merge function
Returns:	pd.DataFrame

retentioneering.analysis.cluster.calculate_cluster_stats(data, users_clusters, settings, target_events=('lost', 'passed'), make_plot=True, plot_count=2, save=True, plot_name=None, figsize=(10, 5))[source]¶

Plot pie-chart with distribution of target events in clusters

Parameters:

data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
users_clusters (pd.DataFrame) – DataFrame with user_pseudo_id and cluster for each user
settings (dict) – experiment config (can be empty dict here)
target_events (list or tuple) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
make_plot (bool) – plot stats or not
plot_count (int) – number of plots for output
save – True if the graph should be saved
plot_name (str) – name of file with graph plot
figsize (tuple) – width, height in inches. If None, defaults to rcParams[“figure.figsize”] = [6.4, 4.8]

Returns:

np.array

retentioneering.analysis.cluster.cluster_users(countmap, n_clusters=None, clusterer=None)[source]¶

Cluster users based on input dataframe and return DataFrame with user_pseudo_id and cluster for each user

Parameters:	countmap (pd.DataFrame) – input dataframe, should have user_id in index. All fields will be features in clustering algorithm n_clusters (int) – supposed number of clusters, could be None clusterer (func) – clustering algorithm. Should have fit_predict function
Returns:	pd.DataFrame

retentioneering.analysis.model module¶

class retentioneering.analysis.model.Model(data, target_event, settings, event_filter=None, n_start_events=None, emb_type='tf-idf', ngram_range=(1, 3), emb_dims=None, embedder=None)[source]¶

Bases: object

Base model for classification

build_important_track()[source]¶

Finds the most important tracks for definition of target_event

Returns:	most important edges in graph
Return type:	pd.DataFrame

fit_model(model_type='logit')[source]¶

Fits classifier

Parameters:	model_type (str) – type of model (now only logit is supported)
Returns:	None

plot()[source]¶

Plot metrics of model

Returns:	None

plot_cluster_track(bbox)[source]¶

Plots graph for users in selected area

Parameters:	bbox (List[List[float]]) – coordinates of top-left and bottom-right angles of area
Returns:	None

plot_projections(sample=None, target=None, ids=None)[source]¶

Plots tsne projection of users trajectories

Parameters:	sample (np.ndarray of pd.DataFrame) – sample of trajectories target (str) – by which column data should be splitted (if target is None then probabilities of target_event is highlighted) ids (list or other iterable) – list of user_ids for visualization in plot
Returns:	None

predict_proba(sample)[source]¶

Predicts probability of sample

Parameters:	sample (np.ndarray or pd.DataFrame) – sample of users vectorized tracks (e.g. with tf-idf transform)
Returns:	probabilities of different classes as list of [not target_event probability, target_event probability]
Return type:	np.ndarray

retentioneering.analysis.model.create_filter(data, n_folds=None)[source]¶

Creates interested events filter with histogram

Parameters:	data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id n_folds (int) – number of folds in histogram (it affects how far technical events should be from users` events)
Returns:	set

retentioneering.analysis.utils module¶

retentioneering.analysis.utils.filter_welcome(df)[source]¶

Filter for truncated welcome visualization

:param df:data from BQ or your own (clickstream). Should have at least three columns: event_name,: event_timestamp and user_pseudo_id

Returns:	filtered for users events dataset
Return type:	pd.DataFrame

retentioneering.analysis.utils.get_accums(agg, name, max_rank)[source]¶

Creates Accumulator Variables

Parameters:	agg – Counts of events by step name – Name of Accumulator max_rank – Number of steps in pivot
Returns:	Accumulator Variable

retentioneering.analysis.utils.get_adjacency(df, adj_type)[source]¶

Creates graph adjacency matrix from table with aggregates by nodes

Parameters:	df (pd.DataFrame) – table with aggregates (from retentioneering.analysis.get_all_agg function) adj_type (str) – name of col for weighting graph nodes (column name from df)
Returns:	adjacency matrix
Return type:	pd.DataFrame

retentioneering.analysis.utils.get_agg(df, agg_type)[source]¶

Create aggregates (weights) by time of graph nodes

Parameters:	df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id agg_type (str) – type of aggregate, should be written in form ‘name’ + ‘_’ + aggregate type (e.g. trans_count where trans is the name and count is aggragete type). Aggragate types can be: max, min, mean, median, std, count. For greater list, please, check the pd.DataFrame.groupby().agg() documentation
Returns:	table with aggregates by nodes of graph
Return type:	pd.DataFrame

retentioneering.analysis.utils.get_all_agg(df, agg_list)[source]¶

Create aggregates (weights) by time of graph nodes from agg_list

Parameters:	df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id agg_list (List[str]) – list of needed aggregates, each aggregate should be written in form ‘name’ + ‘_’ + aggregate type (e.g. trans_count where trans is the name and count is aggragete type). Aggragate types can be: max, min, mean, median, std, count. For greater list, please, check the pd.DataFrame.groupby().agg() documentation
Returns:	table with aggregates by nodes of graph
Return type:	pd.DataFrame

retentioneering.analysis.utils.get_desc_table(df, settings, target_event_list=['lost', 'passed'], max_steps=None, plot=True, plot_name=None)[source]¶

Builds distribution of events over steps

Parameters:	df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id settings (dict) – experiment config (can be empty dict here) target_event_list (list) – list of target events max_steps (int) – plot (bool) – if True then heatmap is plotted plot_name (str) –
Returns:	Pivot table with distribution of events over steps
Return type:	pd.DataFrame

retentioneering.analysis.utils.get_diff(df_old, df_new, settings, precalc=False, plot=True, plot_name=None)[source]¶

Gets difference between two groups

Parameters:	df_old (pd.DataFrame) – Raw clickstream or calculated desc table of last version df_new (pd.DataFrame) – Raw clickstream or calculated desc table of new version settings (dict) – experiment config (can be empty dict here) precalc (bool) – If True then precalculated desc tables is used plot (bool) – if True then heatmap is plotted plot_name (str) –
Returns:	Table of differences between two versions
Return type:	pd.DataFrame

retentioneering.analysis.utils.get_shift(df)[source]¶

Creates next_event and time_to_next_event

Parameters:	df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
Returns:	source table with additional columns
Return type:	pd.DataFrame

retentioneering.analysis.utils.plot_clusters(data, countmap, target_events=['lost', 'passed'], n_clusters=None, plot_cnt=2, width=10, height=5)[source]¶

Plot pie-chart with distribution of target events in clusters

Parameters:

data (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id
countmap (pd.DataFrame) – result of retentioneering.analysis.utils.plot_frequency_map
target_events (List[str]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost)
n_clusters (int) – supposed number of clusters
plot_cnt (int) – number of plots for output
width (float) – width of plot
height (float) – height of plot

Returns:

None

retentioneering.analysis.utils.plot_frequency_map(df, settings, target_events=['lost', 'passed'], plot_name=None)[source]¶

Plots frequency histogram and heatmap of users` event count

Parameters:	df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id settings (dict) – experiment config (can be empty dict here) target_events (List[str]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost) plot_name (str) – name of file with graph plot
Returns:	table with counts of events for users
Return type:	pd.DataFrame

retentioneering.analysis.utils.plot_graph_python(df_agg, agg_type, settings, layout=<function random_layout>, plot_name=None)[source]¶

Visualize trajectories from aggregated tables (with python)

Parameters:	df_agg (pd.DataFrame) – table with aggregates (from retentioneering.analysis.get_all_agg function) agg_type (str) – name of col for weighting graph nodes (column name from df) settings (dict) – experiment config (can be empty dict here) layout (func) – plot_name (str) – name of file with graph plot
Returns:	None

retentioneering.analysis.utils.prepare_dataset(df, target_events, event_filter=None, n_start_events=None)[source]¶

Prepares data for classifier inference

Parameters:	df (pd.DataFrame) – data from BQ or your own (clickstream). Should have at least three columns: event_name, event_timestamp and user_pseudo_id target_events (Union[list, None]) – name of event which signalize target function (e.g. for prediction of lost users it’ll be lost) event_filter (list or other iterable) – list of events that is wanted to use in analysis n_start_events – length of users trajectory from start
Returns:	prepared data for inference (glued user events in one trajectory)
Return type:	pd.DataFrame

retentioneering.analysis.utils.prepare_prunned(df)[source]¶

Filter for truncated welcome visualization

:param df:data from BQ or your own (clickstream). Should have at least three columns: event_name,: event_timestamp and user_pseudo_id

Returns:	filtered for users events dataset
Return type:	pd.DataFrame

retentioneering.analysis.weighter module¶

retentioneering.analysis.weighter.calc_all_norm_mech(data, mechanics_events, mode='session', duration_thresh=1, len_thresh=None)[source]¶

Calculates weights of different mechanics in users` sessions :param data: clickstream data with columns session (rank of user`s session) :param mechanics_events: mapping of mechanic and its target events :param mode: if session then calculates weights over session, if full over full users story :param duration_thresh: duration in time threshold for technical (ping) session deletion :param len_thresh: number of events in session threshold for technical (ping) session deletion :return: session description with weights of each mechanic

Return type:	pd.DataFrame

retentioneering.analysis.weighter.mechanics_enrichment(data, mechanics, id_col, event_col, q=0.99, q2=0.99)[source]¶

Enrich list of events specific for mechanic

Parameters:	data (pd.DataFrame) – clickstream data with columns session (rank of user`s session) mechanics (pd.DataFrame) – table with description in form [id_col, event_col], where id_col is a column with mechanic name and event_col is a column which contains target events specific for that mechanic id_col – name of the column with mechanic name event_col – name of the column with target events specific for that mechanic q (float in interval (0, 1)) – quantile for frequency of target events q2 (float in interval (0, 1)) – quantile for frequency of target events of other mechanic
Returns:	mapping of mechanic and its target events
Return type:	Dict[str, List[str]]

retentioneering.analysis package¶

Submodules¶

retentioneering.analysis.calculate module¶

retentioneering.analysis.cluster module¶

retentioneering.analysis.model module¶

retentioneering.analysis.utils module¶

retentioneering.analysis.weighter module¶

Module contents¶