开发者

Making ruby faster for memory intensive programs

开发者 https://www.devze.com 2023-04-08 19:38 出处:网络
I\'m running a pretty memory intensive program in ruby which is fast initially and then slows down as the memory utilization increases.

I'm running a pretty memory intensive program in ruby which is fast initially and then slows down as the memory utilization increases.

开发者_如何学JAVA

The program has two phases: (1) build a large hash: string => list in memory, (2) do some computation on the hash. The slowdown occurs in phase 1.

Why does this happen? Is it that there are more calls to the garbage collector? Or is ruby swapping memory out to disk?

In either case, is there any configuration I can do to speed things up? For instance, can I increase the heap-size or the maximum amount of memory that ruby is allowed to consume? I didn't see anything in the man-page.


I find ruby to be really slow too with large datasets, the way I deal with it is to run ruby 1.9.2 or higher, or even jruby if possible. If that is not enough, when traversing extremely large datasets I usually revert to the mapreduce paradigm, that way I only need to keep one row in memory at a time. For something akin to your problem in with building the hash, I'd just have a ruby program emit to $stdout for chaining or divert the output to a file:

$ ruby build_csv.rb > items.csv
$ cat items.csv
foo,23
foo,17
bar,42

then have a second program that can read the datastructure into a hash

@hsh = Hash.new { |hash, key| hash[key] = [] }
File.open("items.csv").each_line do |l|
  k,v = l.split(',')
  @hsh[k] << v
end

The previous program could of course be faster if it used the CSV library. Anyway, it'll read in the hsh to something like this.

@hsh => {"foo"=>[23, 17], "bar"=>[42]} 

Splitting a problem into many small programs really makes a difference for speed, because less is kept in memory, if the operation on the hash only needs to work on a single key, then it's easy to write that part as something that just reads until a new key is found, produces output on the last key and then proceeds with the new key, in much the same fashion as the first round. Keeping memory use down like this by splitting and producing intermediate results on file really speeds up the process a lot. If you can divide your data, you can also run the several stage one / mapping jobs at once either in the shell or by using threads.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号