retentioneering.utils package

Submodules

retentioneering.utils.bq_download module

retentioneering.utils.bq_download.download_events(client, job_config, user_filter_event_names=None, user_filter_event_table=None, dates_users=None, users_app_version=None, event_filter_event_names=None, event_filter_event_table=None, dates_events=None, events_app_version=None, count_events=None, use_last_events=False, random_user_limit=None, random_seed=None, settings=None, group_name=None, drop_duplicates=None, return_dataframe=True, return_only_query=False, hide_progress_bar=False, progress_bar_min_interval=4)[source]
Parameters:
  • client (bigquery.Client()) – bigquery client
  • job_config (bigquery.QueryJobConfig()) – bigquery client job config
  • user_filter_event_names (list) – filter on events for user selection
  • user_filter_event_table (str) – name of the table with users
  • dates_users (tuple or list) – first and last dates of first user appearance
  • users_app_version (str) – select only users with this app_version
  • event_filter_event_names (list) – select only users with such events
  • event_filter_event_table (str) – name of the table with events
  • dates_events (tuple or list) – first and last date of the event selection period
  • events_app_version (str) – app version filter for event table
  • count_events (int) – number of event which are taking from the event table for every user
  • use_last_events (bool) – use last events before target event if true, use first events after otherwise
  • random_user_limit (int) – number of random selected users
  • random_seed (int) – random seed
  • settings (dict) – settings dict
  • group_name (str) – add new column ‘group_name’ with value
  • drop_duplicates (list) – list of columns in bigquery table which are used to drop duplicates
  • return_dataframe (bool) – if is true then data will be returned as pd.DataFrame, list otherwise
  • return_only_query (bool) – return only query string without running
  • hide_progress_bar (bool) – hide tqdm progress bar
  • progress_bar_min_interval (int) – min interval of tqdm progress bar in seconds
Returns:

retentioneering.utils.bq_download.download_events_multi(client, job_config, settings=None, return_only_query=False, **kwargs)[source]

Generate queries from settings, run them in bigquery and download results.

Parameters:
  • client (bigquery.Client()) – bigquery client
  • job_config (bigquery.QueryJobConfig()) – bigquery client job config
  • settings (dict) – settings dict
  • return_only_query (bool) – return only query string for all queries without running
  • **kwargs

    options to pass in download_events function

Returns:

pd.DataFrame or list

retentioneering.utils.bq_download.download_table(client, dataset_id, table_id)[source]

Download table from bigquery

Parameters:
  • client (bigquery.Client()) – bigquery client
  • dataset_id (str) – target dataset id
  • table_id (str) – target table id
Returns:

pd.DataFrame

retentioneering.utils.bq_download.run_query(client, query, job_config=None, group_name=None, return_dataframe=True, return_only_query=False, hide_progress_bar=False, progress_bar_min_interval=4, **params)[source]

Run a query in bigquery and download results

Parameters:
  • client (bigquery.Client()) – bigquery client
  • query (str) – query to run (it could be string with params)
  • job_config (bigquery.QueryJobConfig()) – bigquery client job config
  • group_name (str or None) – add new column ‘group_name’ with value
  • return_dataframe (bool) – if is true then data will be returned as pd.DataFrame, list otherwise
  • return_only_query (bool) – return only query string without running
  • hide_progress_bar (bool) – hide tqdm progress bar
  • progress_bar_min_interval (int) – min interval of tqdm progress bar in seconds
  • **params

    options to pass in query.format function

Returns:

list or pd.DataFrame

retentioneering.utils.export module

retentioneering.utils.export.export_tracks(df, settings, users='all', task='lost', order='all', treshold=0.5, start_event=None, end_event=None)[source]

Visualize trajectories from event clickstream (with Mathematica)

Parameters:
  • df – event clickstream
  • settings – experiment config (can be empty dict here)
  • usersall or list of user ids to plot specific group
  • task – type of task for different visualization (can be lost or prunned_welcome)
  • order – depth in sessions for filtering
  • threshold – threshold for session splitting
  • start_event – name of start event in trajectory
  • end_event – name of last event in trajectory
  • df – pd.DataFrame
  • settings – dict
  • users – str or list
  • task – str
  • order – int
  • threshold – float
  • start_event – str
  • end_event – str
Returns:

None

retentioneering.utils.export.plot_graph_api(df, settings, users='all', task='lost', order='all', treshold=0.5, start_event=None, end_event=None)[source]

retentioneering.utils.preparing module

class retentioneering.utils.preparing.SessionSplitter(n_components)[source]

Bases: object

Class for session splitting processing

add_session_column(df, thr, sort)[source]

Creates columns with session rank

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • thr (float) – time threshold from previous step for session interruption
  • sort (bool) – If sorting by User ID & Event Timestamp is required
Returns:

input data with columns session

Return type:

pd.DataFrame

add_time_from_prev_event(df, unit=None, delta_unit='s')[source]

Adds time from previous event column

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • unit (str) – type of string for pd.datetime parsing
  • delta_unit (str) – step in timestamp column (e.g. seconds from 01-01-1970)
Returns:

input data with column from_prev_event

Return type:

pd.DataFrame

fit(df, columns_config, unit=None, delta_unit='s')[source]

Fits the gausian mixture model for understanding threshold

Parameters:
  • df – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • columns_config – Dictionary that maps to required column names: {‘event_name_col’: Event Name Column, ‘event_timestamp_col’ Event Timestamp Column, ‘user_id_col’: User ID Column}
  • unit – type of string for pd.datetime parsing
  • delta_unit – step in timestamp column (e.g. seconds from 01-01-1970)
Returns:

None

get_threshold(thr=0.95, thrs=array([1.00000e+00, 1.25000e+00, 1.50000e+00, ..., 2.99950e+03, 2.99975e+03, 3.00000e+03]))[source]

Finds best threshlod

Parameters:
  • thr (float in interval (0, 1)) – Probability threshold for session interruption
  • thrs (List[float] or iterable) – Timedelta values for threshold checking
Returns:

value of threshold

Return type:

float

predict(df, thr_prob=0.95, thrs=array([1.00000e+00, 1.25000e+00, 1.50000e+00, ..., 2.99950e+03, 2.99975e+03, 3.00000e+03]), sort=True)[source]

Predicts sessions for passed DataFrame

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • thr_prob (float) – Probability threshold for session interruption
  • thrs (List[float] or iterable) – Timedelta values for threshold checking
  • sort (bool) – If sorting by User ID & Event Timestamp is required
Returns:

Passed DataFrame augmented with session column

Return type:

pd.DataFrame

visualize(df, figsize=(15, 5), dpi=500, **kwargs)[source]

Visualize mixture of found distributions

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • figsize (tuple) – size of plot
  • dpi (int) – dot per inch to saving plot
Returns:

None

retentioneering.utils.preparing.add_first_and_last_events(df, first_event_name='fisrt_event', last_event_name='last_event')[source]

For every user and session adds first event with <first_event_name> and last event with <last_event_name>

Parameters:
  • df (pd.DataFrame) – input DataFrame
  • first_event_name (str) – first event name
  • last_event_name (str) – last event name
Returns:

pd.DataFrame

retentioneering.utils.preparing.add_lost_events(df, positive_event_name='passed', negative_event_name='lost', settings=None)[source]
Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • positive_event_name (str) – positive event name
  • negative_event_name (str) – negative event name which should be added if there is no positive event in the session
  • settings (dict) – config dict
Returns:

pd.DataFrame

retentioneering.utils.preparing.add_passed_event(df, positive_event_name='passed', filters=None, settings=None)[source]
Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • positive_event_name (str) – name of the positive event which should be added if filter conditions is True
  • filters (dict) – dict with filter conditions
  • settings (config dict) – dict
Returns:

pd.DataFrame

retentioneering.utils.preparing.drop_duplicated_events(df, duplicate_thr_time=0, settings=None)[source]

Delete duplicated events (two events with save event names if the time between them less than duplicate_thr_time)

Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • duplicate_thr_time (int) – threshold for time between events
  • settings (dict) – config dict
Returns:

pd.DataFrame

retentioneering.utils.preparing.filter_events(df, filters=[], settings=None)[source]
Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • filters (list) – list each element of which is a filter dict
  • settings (dict) – config dict
Returns:

pd.DataFrame

retentioneering.utils.preparing.filter_users(df, filters=[], settings=None)[source]
Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • filters (list) – list each element of which is a filter dict
  • settings (dict) – config dict
Returns:

pd.DataFrame

retentioneering.utils.queries module

retentioneering.utils.utils module

class retentioneering.utils.utils.Config(filename, is_json=False)[source]

Bases: dict

Enrichment of dict class with saving option

export(filename, is_json=False)[source]

Dumps config to file

Parameters:
  • filename – str output file name
  • is_json – bool save in json format (yaml otherwise)
Returns:

retentioneering.utils.utils.init_client(service_account_filepath, **kwargs)[source]

Return the bigquert.Client()

Parameters:
  • service_account_filepath (path) – path to service account
  • kwargs (keywords) – keywords to pass in bigquery.Client.from_service_account_json function
Returns:

bigquert.Client()

retentioneering.utils.utils.init_from_file(filename, is_json=False)[source]

Create bigquert.Client() and bigquery.QueryJobConfig() from json or yaml file

Parameters:
  • filename (str) – path to file with config
  • is_json (bool) – read file as json if true (read as yaml otherwise)
Returns:

retentioneering.utils.utils.init_job_config(project, destination_dataset, destination_table)[source]

Return the bigquery.QueryJobConfig() with legacy sql and destination table to allow large results

Parameters:
  • project (str) – project name where destination table is
  • destination_dataset (str) – dataset id where destination table is
  • destination_table (str) – destination table id
Returns:

bigquery.QueryJobConfig()

Module contents