hadoop - How to extract keywords from lots of documents? -
i have many documents, on ten thousands (maybe more). i'd extract keywords each document, let's 5 keywords each document, using hadoop. each document may talk unique topic. current approach use latent dirichlet allocation (lda) implemented in mahout. each document talks different topic, number of extracted topics should equal number of documents, large. lda become inefficient when number of topics become large, approach randomly group documents small groups each having 100 documents , use mahout lda extract 100 topics each group. approach works, may not efficient because each time run hadoop on small set of documents. has better (more efficient) idea this?
Comments
Post a Comment