java - Multithreaded string processing blows up with #threads -

i'm working on multithreaded project have parse text file magic object, processing on object, , aggregate output. old version of code parsed text in 1 thread , did object processing in thread pool using java's executorservice. weren't getting performance boost wanted, , turned out parsing takes longer thought relative processing time each object, tried moving parsing worker threads.

this should have worked, happens the time-per-object blows function of number of threads in pool. it's worse linear, not quite bad exponential.

i've whittled down small example (on machine anyhow) shows behavior. example doesn't create magic object; it's doing string manipulation. there's no inter-thread dependencies can see; know split() isn't terribly efficient can't imagine why sh*t bed in multithreaded context. have missed something?

i'm running in java 7 on 24-core machine. lines long, ~1mb each. there can dozens of items in features, , 100k+ items in edges.

sample input:

1    1    156    24    230    1350    id(foo):id(bar):w(house,pos):w(house,neg)    1->2:1@1.0    16->121:2@1.0,3@0.5

sample command line running 16 worker threads:

$ java -xmx10g foo 16 myfile.txt

example code:

public class foo implements runnable { string line; int id; public foo(string line, int id) {     this.line = line;     this.id = id; } public void run() {     system.out.println(system.currenttimemillis()+" job start "+this.id);     // line format: tab delimited                                                                     // x[4]     // graph[2]     // features[m]      <-- ':' delimited                                                   // edges[n]     string[] x = this.line.split("\t",5);     string[] graph = x[4].split("\t",4);     string[] features = graph[2].split(":");     string[] edges = graph[3].split("\t");     (string e : edges) {         string[] ee = e.split(":",2);         ee[0].split("->",2);         (string f : ee[1].split(",")) {             f.split("@",2);         }     }                                                                         system.out.println(system.currenttimemillis()+" job done "+this.id); } public static void main(string[] args) throws ioexception,interruptedexception {     system.err.println("reading "+args[1]+" in "+args[0]+" threads...");     linenumberreader reader = new linenumberreader(new filereader(args[1]));     executorservice pool = executors.newfixedthreadpool(integer.parseint(args[0]));     for(string line; (line=reader.readline()) != null;) {         pool.submit(new foo(line, reader.getlinenumber()));     }     pool.shutdown();     pool.awaittermination(7,timeunit.days); } }

updates:

reading whole file memory first has no effect. more specific, read whole file, adding each line arraylist<string>. iterated on list create jobs pool. makes substrings-eating-the-heap hypothesis unlikely, no?
compiling 1 copy of delimiter pattern used worker threads has no effect. :(

resolution:

i've converted parsing code use custom splitting routine based on indexof(), so:

private string[] split(string string, char delim) {     if (string.length() == 0) return new string[0];     int nitems=1;     (int i=0; i<string.length(); i++) {         if (string.charat(i) == delim) nitems++;     }     string[] items = new string[nitems];     int last=0;     (int next=last,i=0; i<items.length && next!=-1; last=next+1,i++) {         next=string.indexof(delim,last);         items[i]=next<0?string.substring(last):string.substring(last,next);     }     return items;        }

oddly enough not blow number of threads increases, , have no idea why. it's functional workaround though, i'll live it...

in java 7, string.split() uses string.substring() internally, "optimization" reasons not create real new strings, empty string shells point sub-sections of original one.

so when split() string small pieces, original 1 (maybe huge) still in memory , may end eating heap. see parse big files, might risk (this has been changed in java 8).

given format well-known, recommend parsing each line "by hand" rather using string.split() (regex bad performance anyway), , creating real new ones sub-parts.

Search This Blog

Remember

java - Multithreaded string processing blows up with #threads -

Comments

Post a Comment

Popular posts from this blog

Java 8 + Maven Javadoc plugin: Error fetching URL -

css - SVG using textPath a symbol not rendering in Firefox -

php - Google Calendar Events -