Is it feasible to keep millions of keys in state of Spark Streaming job for two months? -

i'm trying solve (simplified here) problem in spark streaming: let's have log of events made users, each event tuple (user name, activity, time), e.g.:

("user1", "view", "2015-04-14t21:04z") ("user1", "click", "2015-04-14t21:05z") 

now gather events user analysis of that. let's output analysis of:

("user1", list(("view", "2015-04-14t21:04z"),("click", "2015-04-14t21:05z")) 

the events should kept 2 months. during time there might around 500 milion of such events, , millions of unique users, keys here.

my questions are:

  • is feasible such thing updatestatebykey on dstream, when have millions of keys stored?
  • am right dstream.window no use here, when have 2 months length window , have slide of few seconds?

p.s. found out, updatestatebykey called on keys on every slide, means called millions of time every few seconds. makes me doubt in design , i'm rather thinking alternative solutions like:

  • using cassandra state
  • using trident state (with cassandra probably)
  • using samza state management.

i think depends on how query data in future. have similar scenarios. made transformation through mappartitions , reducebykey , store data in cassandra.


Popular posts from this blog

Java 8 + Maven Javadoc plugin: Error fetching URL -

css - SVG using textPath a symbol not rendering in Firefox -

php - Google Calendar Events -