nlp - use Python to convert files of word counts to sparse matrix -
i have series of files, each 1 containing counts of words. each file have different words. here's example:
filea
word1,20 word2,10 word3,2
fileb:
word1,10 word4,50 word3,5
there 20k files , each have tens of thousands of words.
i want build sparse matrix each row represents file's word distribution, you'd out of scikit's countvectorizer.
if word1, word2, word3, word4 columns, anf filea , fileb rows expect get:
[[20,10,2,0],[10,0,5,50]]
how achive that? if possible, i'd able include words appear in @ least n files.
you use dictionaries mapping words how appear , file names word counts in files.
files = ["file1", "file2"] all_words = collections.defaultdict(int) all_files = collections.defaultdict(dict) filename in files: open(filename) f: line in f: word, count = line.split(",") all_files[filename][word] = int(count) all_words[word] += 1
then can use in nested list comprehension create sparse matrix:
>>> [[all_files[f].get(w, 0) w in sorted(all_words)] f in files] [[20, 10, 2, 0], [10, 0, 5, 50]]
or filtering minimum word count:
>>> [[all_files[f].get(w, 0) w in sorted(all_words) if all_words[w] > 1] f in files] [[20, 2], [10, 5]]
Comments
Post a Comment