Download data¶
Firstly, you should export your clickstream data as csv or other table format (or you can download data directly from BigQuery).
Data should have at least three columns: user_id
,
event_timestamp
and event_name
.
Prepare data for analysis¶
First of all, load the data in python using pandas:
import pandas as pd
data = pd.read_csv('path_to_your_data.csv')
You also could read data from other sources such as .xlsx
(pd.read_excel
), sql
(pd.read_sql
) and etc. Please, check
the pandas
documentation
for other options
Columns renaming and formatting¶
Analysis submodule needs proper names of columns:
- Column with user ID should be named as
user_pseudo_id
- Name of event should be named as
event_name
- Timestamp of event should be named as
event_timestamp
. Also, it is needed to convert it to the integer type (seconds from1970-01-01
).
Rename your columns with pandas:
data = data.rename({
'your_user_id_column_name': 'user_pseudo_id',
'your_event_name_column_name': 'event_name',
'your_event_timestamp_name': 'event_timestamp'
}, axis=1)
Check the type of your timestamp column:
print("""
Event timestamp type: {}
Event timestamp example: {}
""".format(
data.event_timestamp.dtype,
data.event_timestamp.iloc[0]
))
Out:
We see that here column with the timestamp is a python object (string).
You can use the following functions to convert it into seconds:
# converts string to datetime
data.event_timestamp = pd.to_datetime(data.event_timestamp)
# converts datetime to integer
data.event_timestamp = data.event_timestamp.astype(int) / 1e6
Add target events¶
Most of our tools aim to estimate how different trajectories leads to
different target events. So you should add such events as
lost
and passed
.
For example, there is a list of events that correspond to the passed onboarding:
from retentioneering import preparing
event_filter = ['newFlight', 'feed', 'tabbar', 'myFlights']
data = preparing.add_passed_event(data, positive_event_name='passed', filter=event_filter)
And all users who were not passed over some time have lost event:
data = preparing.add_lost_event(data, existed_event='pass', time_thresh=5)
Export data¶
data.to_csv('prepared_data.csv', index=False)