java - Multithreaded string processing blows up with #threads -
i'm working on multithreaded project have parse text file magic object, processing on object, , aggregate output. old version of code parsed text in 1 thread , did object processing in thread pool using java's executorservice
. weren't getting performance boost wanted, , turned out parsing takes longer thought relative processing time each object, tried moving parsing worker threads.
this should have worked, happens the time-per-object blows function of number of threads in pool. it's worse linear, not quite bad exponential.
i've whittled down small example (on machine anyhow) shows behavior. example doesn't create magic object; it's doing string manipulation. there's no inter-thread dependencies can see; know split()
isn't terribly efficient can't imagine why sh*t bed in multithreaded context. have missed something?
i'm running in java 7 on 24-core machine. lines long, ~1mb each. there can dozens of items in features
, , 100k+ items in edges
.
sample input:
1 1 156 24 230 1350 id(foo):id(bar):w(house,pos):w(house,neg) 1->2:1@1.0 16->121:2@1.0,3@0.5
sample command line running 16 worker threads:
$ java -xmx10g foo 16 myfile.txt
example code:
public class foo implements runnable { string line; int id; public foo(string line, int id) { this.line = line; this.id = id; } public void run() { system.out.println(system.currenttimemillis()+" job start "+this.id); // line format: tab delimited // x[4] // graph[2] // features[m] <-- ':' delimited // edges[n] string[] x = this.line.split("\t",5); string[] graph = x[4].split("\t",4); string[] features = graph[2].split(":"); string[] edges = graph[3].split("\t"); (string e : edges) { string[] ee = e.split(":",2); ee[0].split("->",2); (string f : ee[1].split(",")) { f.split("@",2); } } system.out.println(system.currenttimemillis()+" job done "+this.id); } public static void main(string[] args) throws ioexception,interruptedexception { system.err.println("reading "+args[1]+" in "+args[0]+" threads..."); linenumberreader reader = new linenumberreader(new filereader(args[1])); executorservice pool = executors.newfixedthreadpool(integer.parseint(args[0])); for(string line; (line=reader.readline()) != null;) { pool.submit(new foo(line, reader.getlinenumber())); } pool.shutdown(); pool.awaittermination(7,timeunit.days); } }
updates:
- reading whole file memory first has no effect. more specific, read whole file, adding each line
arraylist<string>
. iterated on list create jobs pool. makes substrings-eating-the-heap hypothesis unlikely, no? - compiling 1 copy of delimiter pattern used worker threads has no effect. :(
resolution:
i've converted parsing code use custom splitting routine based on indexof()
, so:
private string[] split(string string, char delim) { if (string.length() == 0) return new string[0]; int nitems=1; (int i=0; i<string.length(); i++) { if (string.charat(i) == delim) nitems++; } string[] items = new string[nitems]; int last=0; (int next=last,i=0; i<items.length && next!=-1; last=next+1,i++) { next=string.indexof(delim,last); items[i]=next<0?string.substring(last):string.substring(last,next); } return items; }
oddly enough not blow number of threads increases, , have no idea why. it's functional workaround though, i'll live it...
in java 7, string.split()
uses string.substring()
internally, "optimization" reasons not create real new strings
, empty string
shells point sub-sections of original one.
so when split()
string
small pieces, original 1 (maybe huge) still in memory , may end eating heap. see parse big files, might risk (this has been changed in java 8).
given format well-known, recommend parsing each line "by hand" rather using string.split()
(regex bad performance anyway), , creating real new ones sub-parts.
Comments
Post a Comment