Is it feasible to keep millions of keys in state of Spark Streaming job for two months? -

i'm trying solve (simplified here) problem in spark streaming: let's have log of events made users, each event tuple (user name, activity, time), e.g.:

("user1", "view", "2015-04-14t21:04z") ("user1", "click", "2015-04-14t21:05z")

now gather events user analysis of that. let's output analysis of:

("user1", list(("view", "2015-04-14t21:04z"),("click", "2015-04-14t21:05z"))

the events should kept 2 months. during time there might around 500 milion of such events, , millions of unique users, keys here.

my questions are:

is feasible such thing updatestatebykey on dstream, when have millions of keys stored?
am right dstream.window no use here, when have 2 months length window , have slide of few seconds?

p.s. found out, updatestatebykey called on keys on every slide, means called millions of time every few seconds. makes me doubt in design , i'm rather thinking alternative solutions like:

using cassandra state
using trident state (with cassandra probably)
using samza state management.

i think depends on how query data in future. have similar scenarios. made transformation through mappartitions , reducebykey , store data in cassandra.

Search This Blog

Remember

Is it feasible to keep millions of keys in state of Spark Streaming job for two months? -

Comments

Post a Comment

Popular posts from this blog

Java 8 + Maven Javadoc plugin: Error fetching URL -

android - How to delete or change the searchview icon inside the SearchView actionBar? -

c++ - Msgpack packing bools bug -