Is it feasible to keep millions of keys in state of Spark Streaming job for two months? -
i'm trying solve (simplified here) problem in spark streaming: let's have log of events made users, each event tuple (user name, activity, time), e.g.:
("user1", "view", "2015-04-14t21:04z") ("user1", "click", "2015-04-14t21:05z")
now gather events user analysis of that. let's output analysis of:
("user1", list(("view", "2015-04-14t21:04z"),("click", "2015-04-14t21:05z"))
the events should kept 2 months. during time there might around 500 milion of such events, , millions of unique users, keys here.
my questions are:
- is feasible such thing
updatestatebykey
on dstream, when have millions of keys stored? - am right
dstream.window
no use here, when have 2 months length window , have slide of few seconds?
p.s. found out, updatestatebykey
called on keys on every slide, means called millions of time every few seconds. makes me doubt in design , i'm rather thinking alternative solutions like:
- using cassandra state
- using trident state (with cassandra probably)
- using samza state management.
i think depends on how query data in future. have similar scenarios. made transformation through mappartitions , reducebykey , store data in cassandra.
Comments
Post a Comment