Download data

Firstly, you should export your clickstream data as csv or other table format (or you can download data directly from BigQuery).

Data should have at least three columns: user_id, event_timestamp and event_name.

Prepare data for analysis

First of all, load the data in python using pandas:

import pandas as pd
data = pd.read_csv('path_to_your_data.csv')

You also could read data from other sources such as .xlsx (pd.read_excel), sql (pd.read_sql) and etc. Please, check the pandas documentation for other options

Columns renaming and formatting

Analysis submodule needs proper names of columns:

  1. Column with user ID should be named as user_pseudo_id
  2. Name of event should be named as event_name
  3. Timestamp of event should be named as event_timestamp. Also, it is needed to convert it to the integer type (seconds from 1970-01-01).

Rename your columns with pandas:

data = data.rename({
    'your_user_id_column_name': 'user_pseudo_id',
    'your_event_name_column_name': 'event_name',
    'your_event_timestamp_name': 'event_timestamp'
}, axis=1)

Check the type of your timestamp column:

print("""
Event timestamp type: {}
Event timestamp example: {}
""".format(
    data.event_timestamp.dtype,
    data.event_timestamp.iloc[0]
))

Out:

We see that here column with the timestamp is a python object (string).

You can use the following functions to convert it into seconds:

# converts string to datetime
data.event_timestamp = pd.to_datetime(data.event_timestamp)

# converts datetime to integer
data.event_timestamp = data.event_timestamp.astype(int) / 1e6

Add target events

Most of our tools aim to estimate how different trajectories leads to different target events. So you should add such events as lost and passed.

For example, there is a list of events that correspond to the passed onboarding:

from retentioneering import preparing
event_filter = ['newFlight', 'feed', 'tabbar', 'myFlights']
data = preparing.add_passed_event(data, positive_event_name='passed', filter=event_filter)

And all users who were not passed over some time have lost event:

data = preparing.add_lost_event(data, existed_event='pass', time_thresh=5)

Export data

data.to_csv('prepared_data.csv', index=False)