nlp - use Python to convert files of word counts to sparse matrix -


i have series of files, each 1 containing counts of words. each file have different words. here's example:

filea

word1,20 word2,10 word3,2 

fileb:

word1,10 word4,50 word3,5 

there 20k files , each have tens of thousands of words.

i want build sparse matrix each row represents file's word distribution, you'd out of scikit's countvectorizer.

if word1, word2, word3, word4 columns, anf filea , fileb rows expect get:

[[20,10,2,0],[10,0,5,50]] 

how achive that? if possible, i'd able include words appear in @ least n files.

you use dictionaries mapping words how appear , file names word counts in files.

files = ["file1", "file2"] all_words = collections.defaultdict(int) all_files = collections.defaultdict(dict)  filename in files:     open(filename) f:         line in f:             word, count = line.split(",")             all_files[filename][word] = int(count)             all_words[word] += 1 

then can use in nested list comprehension create sparse matrix:

>>> [[all_files[f].get(w, 0) w in sorted(all_words)] f in files] [[20, 10, 2, 0], [10, 0, 5, 50]] 

or filtering minimum word count:

>>> [[all_files[f].get(w, 0) w in sorted(all_words) if all_words[w] > 1] f in files] [[20, 2], [10, 5]] 

Comments

Popular posts from this blog

css - SVG using textPath a symbol not rendering in Firefox -

Java 8 + Maven Javadoc plugin: Error fetching URL -

node.js - How to abort query on demand using Neo4j drivers -