retentioneering.utils package¶
Submodules¶
retentioneering.utils.bq_download module¶
-
retentioneering.utils.bq_download.
download_events
(client, job_config, user_filter_event_names=None, user_filter_event_table=None, dates_users=None, users_app_version=None, event_filter_event_names=None, event_filter_event_table=None, dates_events=None, events_app_version=None, count_events=None, use_last_events=False, random_user_limit=None, random_seed=None, settings=None, group_name=None, drop_duplicates=None, return_dataframe=True, return_only_query=False, hide_progress_bar=False, progress_bar_min_interval=4)[source]¶ Parameters: - client (bigquery.Client()) – bigquery client
- job_config (bigquery.QueryJobConfig()) – bigquery client job config
- user_filter_event_names (list) – filter on events for user selection
- user_filter_event_table (str) – name of the table with users
- dates_users (tuple or list) – first and last dates of first user appearance
- users_app_version (str) – select only users with this app_version
- event_filter_event_names (list) – select only users with such events
- event_filter_event_table (str) – name of the table with events
- dates_events (tuple or list) – first and last date of the event selection period
- events_app_version (str) – app version filter for event table
- count_events (int) – number of event which are taking from the event table for every user
- use_last_events (bool) – use last events before target event if true, use first events after otherwise
- random_user_limit (int) – number of random selected users
- random_seed (int) – random seed
- settings (dict) – settings dict
- group_name (str) – add new column ‘group_name’ with value
- drop_duplicates (list) – list of columns in bigquery table which are used to drop duplicates
- return_dataframe (bool) – if is true then data will be returned as pd.DataFrame, list otherwise
- return_only_query (bool) – return only query string without running
- hide_progress_bar (bool) – hide tqdm progress bar
- progress_bar_min_interval (int) – min interval of tqdm progress bar in seconds
Returns:
-
retentioneering.utils.bq_download.
download_events_multi
(client, job_config, settings=None, return_only_query=False, **kwargs)[source]¶ Generate queries from settings, run them in bigquery and download results.
Parameters: - client (bigquery.Client()) – bigquery client
- job_config (bigquery.QueryJobConfig()) – bigquery client job config
- settings (dict) – settings dict
- return_only_query (bool) – return only query string for all queries without running
- **kwargs –
options to pass in download_events function
Returns: pd.DataFrame or list
-
retentioneering.utils.bq_download.
download_table
(client, dataset_id, table_id)[source]¶ Download table from bigquery
Parameters: - client (bigquery.Client()) – bigquery client
- dataset_id (str) – target dataset id
- table_id (str) – target table id
Returns: pd.DataFrame
-
retentioneering.utils.bq_download.
run_query
(client, query, job_config=None, group_name=None, return_dataframe=True, return_only_query=False, hide_progress_bar=False, progress_bar_min_interval=4, **params)[source]¶ Run a query in bigquery and download results
Parameters: - client (bigquery.Client()) – bigquery client
- query (str) – query to run (it could be string with params)
- job_config (bigquery.QueryJobConfig()) – bigquery client job config
- group_name (str or None) – add new column ‘group_name’ with value
- return_dataframe (bool) – if is true then data will be returned as pd.DataFrame, list otherwise
- return_only_query (bool) – return only query string without running
- hide_progress_bar (bool) – hide tqdm progress bar
- progress_bar_min_interval (int) – min interval of tqdm progress bar in seconds
- **params –
options to pass in query.format function
Returns: list or pd.DataFrame
retentioneering.utils.export module¶
-
retentioneering.utils.export.
export_tracks
(df, settings, users='all', task='lost', order='all', treshold=0.5, start_event=None, end_event=None)[source]¶ Visualize trajectories from event clickstream (with Mathematica)
Parameters: - df – event clickstream
- settings – experiment config (can be empty dict here)
- users – all or list of user ids to plot specific group
- task – type of task for different visualization (can be lost or prunned_welcome)
- order – depth in sessions for filtering
- threshold – threshold for session splitting
- start_event – name of start event in trajectory
- end_event – name of last event in trajectory
- df – pd.DataFrame
- settings – dict
- users – str or list
- task – str
- order – int
- threshold – float
- start_event – str
- end_event – str
Returns: None
retentioneering.utils.preparing module¶
-
class
retentioneering.utils.preparing.
SessionSplitter
(n_components)[source]¶ Bases:
object
Class for session splitting processing
-
add_session_column
(df, thr, sort)[source]¶ Creates columns with session rank
Parameters: - df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
- thr (float) – time threshold from previous step for session interruption
- sort (bool) – If sorting by User ID & Event Timestamp is required
Returns: input data with columns session
Return type: pd.DataFrame
-
add_time_from_prev_event
(df, unit=None, delta_unit='s')[source]¶ Adds time from previous event column
Parameters: - df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
- unit (str) – type of string for pd.datetime parsing
- delta_unit (str) – step in timestamp column (e.g. seconds from 01-01-1970)
Returns: input data with column from_prev_event
Return type: pd.DataFrame
-
fit
(df, columns_config, unit=None, delta_unit='s')[source]¶ Fits the gausian mixture model for understanding threshold
Parameters: - df – DataFrame with columns responding for Event Name, Event Timestamp and User ID
- columns_config – Dictionary that maps to required column names: {‘event_name_col’: Event Name Column, ‘event_timestamp_col’ Event Timestamp Column, ‘user_id_col’: User ID Column}
- unit – type of string for pd.datetime parsing
- delta_unit – step in timestamp column (e.g. seconds from 01-01-1970)
Returns: None
-
get_threshold
(thr=0.95, thrs=array([1.00000e+00, 1.25000e+00, 1.50000e+00, ..., 2.99950e+03, 2.99975e+03, 3.00000e+03]))[source]¶ Finds best threshlod
Parameters: - thr (float in interval (0, 1)) – Probability threshold for session interruption
- thrs (List[float] or iterable) – Timedelta values for threshold checking
Returns: value of threshold
Return type: float
-
predict
(df, thr_prob=0.95, thrs=array([1.00000e+00, 1.25000e+00, 1.50000e+00, ..., 2.99950e+03, 2.99975e+03, 3.00000e+03]), sort=True)[source]¶ Predicts sessions for passed DataFrame
Parameters: - df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
- thr_prob (float) – Probability threshold for session interruption
- thrs (List[float] or iterable) – Timedelta values for threshold checking
- sort (bool) – If sorting by User ID & Event Timestamp is required
Returns: Passed DataFrame augmented with session column
Return type: pd.DataFrame
-
-
retentioneering.utils.preparing.
add_first_and_last_events
(df, first_event_name='fisrt_event', last_event_name='last_event')[source]¶ For every user and session adds first event with <first_event_name> and last event with <last_event_name>
Parameters: - df (pd.DataFrame) – input DataFrame
- first_event_name (str) – first event name
- last_event_name (str) – last event name
Returns: pd.DataFrame
-
retentioneering.utils.preparing.
add_lost_events
(df, positive_event_name='passed', negative_event_name='lost', settings=None)[source]¶ Parameters: - df (pd.DataFrame) – input pd.DataFrame
- positive_event_name (str) – positive event name
- negative_event_name (str) – negative event name which should be added if there is no positive event in the session
- settings (dict) – config dict
Returns: pd.DataFrame
-
retentioneering.utils.preparing.
add_passed_event
(df, positive_event_name='passed', filters=None, settings=None)[source]¶ Parameters: - df (pd.DataFrame) – input pd.DataFrame
- positive_event_name (str) – name of the positive event which should be added if filter conditions is True
- filters (dict) – dict with filter conditions
- settings (config dict) – dict
Returns: pd.DataFrame
-
retentioneering.utils.preparing.
drop_duplicated_events
(df, duplicate_thr_time=0, settings=None)[source]¶ Delete duplicated events (two events with save event names if the time between them less than duplicate_thr_time)
Parameters: - df (pd.DataFrame) – input pd.DataFrame
- duplicate_thr_time (int) – threshold for time between events
- settings (dict) – config dict
Returns: pd.DataFrame
retentioneering.utils.queries module¶
retentioneering.utils.utils module¶
-
class
retentioneering.utils.utils.
Config
(filename, is_json=False)[source]¶ Bases:
dict
Enrichment of dict class with saving option
-
retentioneering.utils.utils.
init_client
(service_account_filepath, **kwargs)[source]¶ Return the bigquert.Client()
Parameters: - service_account_filepath (path) – path to service account
- kwargs (keywords) – keywords to pass in bigquery.Client.from_service_account_json function
Returns: bigquert.Client()
-
retentioneering.utils.utils.
init_from_file
(filename, is_json=False)[source]¶ Create bigquert.Client() and bigquery.QueryJobConfig() from json or yaml file
Parameters: - filename (str) – path to file with config
- is_json (bool) – read file as json if true (read as yaml otherwise)
Returns:
-
retentioneering.utils.utils.
init_job_config
(project, destination_dataset, destination_table)[source]¶ Return the bigquery.QueryJobConfig() with legacy sql and destination table to allow large results
Parameters: - project (str) – project name where destination table is
- destination_dataset (str) – dataset id where destination table is
- destination_table (str) – destination table id
Returns: bigquery.QueryJobConfig()