retentioneering.preparing package

Submodules

retentioneering.preparing.preparing module

class retentioneering.preparing.preparing.SessionSplitter(n_components)[source]

Bases: object

Class for session splitting processing

add_session_column(df, thr, sort)[source]

Creates columns with session rank.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • thr (float) – time threshold from previous step for session interruption
  • sort (bool) – If sorting by User ID & Event Timestamp is required
Returns:

input data with columns session

Return type:

pd.DataFrame

add_time_from_prev_event(df, unit=None, delta_unit='s')[source]

Adds time from previous event column.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • unit (str) – type of string for pd.datetime parsing
  • delta_unit (str) – step in timestamp column (e.g. seconds from 01-01-1970)
Returns:

input data with column from_prev_event

Return type:

pd.DataFrame

fit(df, columns_config, unit=None, delta_unit='s')[source]

Fits the gausian mixture model for understanding threshold.

Parameters:
  • df – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • columns_config – Dictionary that maps to required column names: {‘event_name_col’: Event Name Column, ‘event_timestamp_col’ Event Timestamp Column, ‘user_id_col’: User ID Column}
  • unit – type of string for pd.datetime parsing
  • delta_unit – step in timestamp column (e.g. seconds from 01-01-1970)
Returns:

None

get_threshold(thr=0.95, thrs=array([1.00000e+00, 1.25000e+00, 1.50000e+00, ..., 2.99950e+03, 2.99975e+03, 3.00000e+03]))[source]

Finds best threshold.

Parameters:
  • thr (float in interval (0, 1)) – Probability threshold for session interruption
  • thrs (List[float] or iterable) – Timedelta values for threshold checking
Returns:

value of threshold

Return type:

float

predict(df, thr_prob=0.95, thrs=array([1.00000e+00, 1.25000e+00, 1.50000e+00, ..., 2.99950e+03, 2.99975e+03, 3.00000e+03]), sort=True)[source]

Predicts sessions for passed DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • thr_prob (float) – Probability threshold for session interruption
  • thrs (List[float] or iterable) – Timedelta values for threshold checking
  • sort (bool) – If sorting by User ID & Event Timestamp is required
Returns:

self

Return type:

pd.DataFrame

visualize(df, figsize=(15, 5), dpi=500, **kwargs)[source]

Visualize mixture of found distributions.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns responding for Event Name, Event Timestamp and User ID
  • figsize (tuple) – size of plot
  • dpi (int) – dot per inch to saving plot
Returns:

None

retentioneering.preparing.preparing.add_first_and_last_events(df, first_event_name='fisrt_event', last_event_name='last_event')[source]

For every user and session adds first event with first_event_name and last event with last_event_name.

Parameters:
  • df (pd.DataFrame) – input DataFrame
  • first_event_name (str) – name of the first event
  • last_event_name (str) – name of the last event
Returns:

self

Return type:

pd.DataFrame

retentioneering.preparing.preparing.add_lost_events(df, positive_event_name='passed', negative_event_name='lost', settings=None)[source]

Add new events with negative_event_name in input DataFrame.

Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • positive_event_name (str) – positive event name
  • negative_event_name (str) – negative event name which should be added if there is no positive event in the session
  • settings (dict) – config dict
Returns:

self

Return type:

pd.DataFrame

retentioneering.preparing.preparing.add_passed_event(df, positive_event_name='passed', filters=None, settings=None)[source]

Add new events with positive_event_name and delete all events after.

Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • positive_event_name (str) – name of the positive event which should be added if filter conditions is True
  • filters (dict) – dict with filter conditions
  • settings (config dict) – dict
Returns:

self

Return type:

pd.DataFrame

retentioneering.preparing.preparing.drop_duplicated_events(df, duplicate_thr_time=0, settings=None)[source]

Delete duplicated events (two events with save event names if the time between them less than duplicate_thr_time).

Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • duplicate_thr_time (int) – threshold for time between events
  • settings (dict) – config dict
Returns:

self

Return type:

pd.DataFrame

retentioneering.preparing.preparing.filter_events(df, filters=[], settings=None)[source]

Apply filters to the input table.

Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • filters (list) – list each element of which is a filter dict
  • settings (dict) – config dict
Returns:

self

Return type:

pd.DataFrame

retentioneering.preparing.preparing.filter_users(df, filters=[], settings=None)[source]

Apply filters to users from the input table and leave all events for the received users.

Parameters:
  • df (pd.DataFrame) – input pd.DataFrame
  • filters (list) – list each element of which is a filter dict
  • settings (dict) – config dict
Returns:

pd.DataFrame

Module contents