How to debug Ruby memory issues

Overview on how to deal with memory problems.

Here at RightScale we have noticed recently issues with some of our server array boxes running dangerously low on memory. This can be tracked using RightScale’s alerting as well as our instance monitoring page.

Currently we are using a restart script to periodically restart the daemons in order to recover, however this should be considered a short term solution. The problem we are facing can be defined in two groups bloat and leaks.

Memory Bloat

Once Ruby has requested memory it does not return it to the operating system. This means that even though it may be able to garbage collect 100% of the used memory, if a long running process requests a large portion of memory the memory usage of the process will never go below that maximum. This is why it is important to ensure that your process only requests the memory it really needs. Again this comes in two forms start up and runtime.

Starting Up

When a Ruby process starts you should look to assign an appropriate amount of memory based on your application. I found an excellent article which helps describe some of the garbage collection tweaks that can be made. I have tried to reword some of these explanations to help understand what they do.

RUBY_GC_HEAP_INIT_SLOTS

When starting up the process Ruby assigns an initial heap size. Assigning this to the number used by the running state (once you have required all of you environment, etc.) of your application has several benefits. You remove the need to perform garbage collection in the start up of the application as a result lowering CPU usage and start up time. It also ensures that memory growth factors do not cause unwanted bloat when the process has to request more memory during that start up period. There is also a fairly trivial process for finding this value so I would recommend investigating this for any larger sized applications.

1
2
3
4
5
6
# Require your app code here eg
# require_relative 'environment.rb'
GC.start # Start a full garbage collection
s = GC.stat # Take a snapshot of the gc status
# Take the current heap usage and apply the percentage of promotion to long term memory
GC.stat(:heap_live_slot) * (100.0 * s[:old_object] / s[:heap_live_slot] / 100)

This will give you a reasonable sense for the kind of number you should be using. In our case this was 490130 which was simply rounded to 500000.

RUBY_GC_HEAP_FREE_SLOTS

Ruby uses a mark and sweep form of garbage collection. This mark phase is intrusive on the application as it must “stop the world” in order to mark objects as available for garbage collection. Changing this value sets how much memory usage to consume before having to perform one of these mark operations. Very generally a higher value will require more memory but will result in running fewer garbage collections.

RUBY_GC_HEAP_GROWTH_FACTOR

This is the factor the heap can increase each time the memory limit of the process is exceeded. Having a large value for this allows the memory limit of the application to rapidly increase. Again for larger applications having a high value here can potentially lead to bloat as application start up will result in many memory limit increases. Assuming you have set a suitable value for RUBY_GC_HEAP_INIT_SLOTS you should be able to lower this value to cater for the runtime memory increase. If you’re also limited in memory capacity having a lower factor is beneficial. The drawback to this is you could potentially slow the system making more requests for memory. The table below shows some examples.

Memory growth request RUBY_GC_HEAP_GROWTH_FACTOR
1.8 1.25 1
1 18.000mb 12.500mb 10.000mb
2 50.400mb 28.125mb 20.000mb
3 141.120mb 63.280mb 40.000mb
4 395.140mb 142.380mb 80.000mb

RUBY_GC_HEAP_GROWTH_MAX_SLOTS

This is essentially a cap on the maximum growth size for the memory requested. This works in conjunction with RUBY_GC_HEAP_GROWTH_FACTOR. If you have to make several requests for memory growth having a factor of greater than 1 you can potentially be asking for a huge increase in memory. Placing a cap on this allows the memory requested to grow with the needs of the application but stay within a reasonable limit. Again the drawback to this is you can potentially slow the system making more requests for memory.

Runtime Memory Consumption

In order to reduce your memory footprint you first need to work out what methods are resulting in an increase in memory consumption. A great gem for collecting this information is ruby-prof. Sadly however in order to profile memory consumption we need to patch Ruby itself. Fortunately this is not as difficult as you may think. This guide assumes MRI Ruby version 2.1.2 and using rbenv to manage versions, however the principles should be applicable regardless.

Patch Ruby To Collect Memory

The rvm-patchsets project provides railsexpress patches which add the functionality we are looking for.

1
2
3
4
5
6
7
cd tmp
mkdir patching
cd patching
wget https://raw.githubusercontent.com/skaes/rvm-patchsets/master/patches/ruby/2.1.2/railsexpress/03-display-more-detailed-stack-trace.patch
wget https://raw.githubusercontent.com/skaes/rvm-patchsets/master/patches/ruby/2.1.2/railsexpress/02-improve-gc-stats.patch
wget https://raw.githubusercontent.com/skaes/rvm-patchsets/master/patches/ruby/2.1.2/railsexpress/01-zero-broken-tests.patch
cat \*.patch | rbenv install --patch -f 2.1.2

With the patch in place we can start generating some meaningful memory profiling reports. Before we dive in however the output can be quite terse so we can install one last tool to help us interpret the results. I use QCachegrind which is essentially the Mac version of KCachegrind. This can read the output from our Ruby profile and present it in a more intuitive format. OK finally we can profile some code!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require 'ruby-prof'
# Remember to set ruby-prof to profile memory as this is not default
RubyProf.measure_mode = RubyProf::MEMORY

# Start will indicate when to begin collecting data
RubyProf.start

# This is the code that you want to actually profile
# Can be many lines of code or just one
AuditTagsJob.new.perform

# Stop the profiler and store the results
result = RubyProf.stop

# Write the results to a file
open("callgrind.profile", "w") do |f|
  RubyProf::CallTreePrinter.new(result).print(f, :min_percent => 1)
end

This produced the following when sorted using individual method memory consumption (self)

As can be seen there is a lot of memory being used creating regular expressions along with URI parsing and string concatenation. The real art of solving these issues is working out where the problem is coming from. QCachegrind allows you to drill into what is calling methods to try and gain some understanding of what’s causing the issue. The following shows the process that lead to the eventual fix of our memory problem.

Ultimately this leads us to

1
2
3
4
# for UrlEncoded escape the keys
def handle_key key
  parser.escape(key.to_s, Regexp.new("[^#{URI::PATTERN::UNRESERVED}]"))
end

Which is the source for RestClient::Payload::UrlEncoded#handle_key

We were building extremely large POST requests and passing the parameter as a hash. This resulted in us using RestClient’s URL encoding implementation. This has been updated since the version we were using but we realised that we can provide a much simpler and more memory efficient implementation.

After this change the memory profile showed a significant improvement! We can now move on to resolve the next memory issue. Sadly it is not always as straight forward as this however hopefully this shows how dramatic memory improvements can be made without too much effort. The inclusive memory usage is now only 32mb compared with a whopping 538mb before the change.

Memory Leak

When memory is allocated that can never be garbage collected it forms a permanent allocation. If the process is repeated and it adds more allocation that cannot be garbage collected this is known as a memory leak. This will continually push the memory limit of your process until it runs out.

A simple example of this (pre Ruby 2) was symbols. Symbols were global objects that did not get garbage collected. If you dynamically created symbols then you could easily introduce a memory leak into your application. As of Ruby 2 this is no longer a problem. In order to try and work out if any of our workers were leaking we added a hook around the code that is executed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class DelayedJobMemLogPlugin < Delayed::Plugin
  callbacks do |lifecycle|
    lifecycle.around(:invoke_job) do |job, *args, &block|
      GC.start
      start_mem = NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample
      block.call(job, *args)
      GC.start
      end_mem = NewRelic::Agent::Samplers::MemorySampler.new.sampler.get_sample

      mem_info = {
        :eventType    => "DelayedJobMemLog",
        :environment  => ADS.config.env,
        :job_name     => job.name,
        :start_memory => start_mem,
        :end_memory   => end_mem,
        :symbol_count => Symbol.all_symbols.size
      }

      RestClient::Request.execute(
        :method  => :post,
        :url     => INSIGHTS_APP_URL,
        :payload => mem_info.to_json,
        :headers => {
          'Content-Type' => 'application/json',
          'X-Insert-Key' => YOUR_API_KEY_HERE
        })
    end
  end
end

This forced a garbage collection and took a sample of memory usage, ran the worker code, and then ran another garbage collection and took another memory sample. Whilst not an exact science the memory usage before and after should be similar. If there is a consistent increase you may have a memory leak. With this data in place you can create some cool graphs. At RightScale we are using NewRelic Insights as can be seen below, it looks like our next port of call should be the ReservationUsagesJob.

From here you can use the tips outlined in runtime memory consumption in order to try and diagnose what is causing the leak.

Resources