Download data¶
Firstly, you should export your clickstream data as csv or other table format (or you can download data directly from BigQuery).
Data should have at least three columns: user_id,
event_timestamp and event_name.
Prepare data for analysis¶
First of all, load the data in python using pandas:
import pandas as pd
data = pd.read_csv('path_to_your_data.csv')
You also could read data from other sources such as .xlsx
(pd.read_excel), sql (pd.read_sql) and etc. Please, check
the pandas
documentation
for other options
Columns renaming and formatting¶
Analysis submodule needs proper names of columns:
- Column with user ID should be named as
user_pseudo_id - Name of event should be named as
event_name - Timestamp of event should be named as
event_timestamp. Also, it is needed to convert it to the integer type (seconds from1970-01-01).
Rename your columns with pandas:
data = data.rename({
'your_user_id_column_name': 'user_pseudo_id',
'your_event_name_column_name': 'event_name',
'your_event_timestamp_name': 'event_timestamp'
}, axis=1)
Check the type of your timestamp column:
print("""
Event timestamp type: {}
Event timestamp example: {}
""".format(
data.event_timestamp.dtype,
data.event_timestamp.iloc[0]
))
Out:
We see that here column with the timestamp is a python object (string).
You can use the following functions to convert it into seconds:
# converts string to datetime
data.event_timestamp = pd.to_datetime(data.event_timestamp)
# converts datetime to integer
data.event_timestamp = data.event_timestamp.astype(int) / 1e6
Add target events¶
Most of our tools aim to estimate how different trajectories leads to
different target events. So you should add such events as
lost and passed.
For example, there is a list of events that correspond to the passed onboarding:
from retentioneering import preparing
event_filter = ['newFlight', 'feed', 'tabbar', 'myFlights']
data = preparing.add_passed_event(data, positive_event_name='passed', filter=event_filter)
And all users who were not passed over some time have lost event:
data = preparing.add_lost_event(data, existed_event='pass', time_thresh=5)
Export data¶
data.to_csv('prepared_data.csv', index=False)