Thanks to user dnlbrky, we now have a third way to accomplish sessionizing log data for any arbitrary time out period (see methods 1 and 2), this time using data.table from R along with magrittr for piping:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
library(magrittr)
library(data.table)
## Download, unzip, and load data (first 10,000 lines):
single_col_timestamp <- url("http://randyzwitch.com/wp-content/uploads/2015/01/single_col_timestamp.csv.gz") %>%
gzcon %>%
readLines(n=10000L) %>%
textConnection %>%
read.csv %>%
setDT
## Convert to timestamp:
single_col_timestamp[, event_timestamp:=as.POSIXct(event_timestamp)]
## Order by uid and event_timestamp:
setkey(single_col_timestamp, uid, event_timestamp)
## Sessionize the data (more than 30 minutes between events is a new session):
single_col_timestamp[, session_id:=paste(uid, cumsum((c(0, diff(event_timestamp))/60 > 30)*1), sep="_"), by=uid]
## Examine the results:
#single_col_timestamp[uid %like% "a55bb9"]
single_col_timestamp[session_id %like% "fc895c3babd"]
I agree with dnlbrky in that this feels a little better than the dplyr method for heavy SQL users like me, but ultimately, I still think the SQL method is the most elegant and obvious to understand. But that’s the great thing with open-source software; pick any tool you want, accomplish whatever you choose using any method you choose.