EC2 Micro Instance Throttling

January 31, 2012

I'm a big fan of Amazon EC2 micro instances. If you have modest requirements, or if you need to split functionality onto separate servers for security purposes, micro instances can be a very economical way to do it.

But be careful. While micro instances perform very well in short bursts, if you consume excessive CPU for more than a few seconds, the EC2 infrastructure may throttle back your VM, perhaps by more than a factor of thirty!

I stumbled across this when our monitoring tools were regularly reporting trouble connecting to our micro instance servers each day around 6:00 in the morning. It turns out that's when the system was running rkhunter, which can run for a minute or more at high CPU loads looking for malware. It seemed a shame to upgrade to a "small" instance, at four times the cost, just to satisfy a non-critical, daily batch job, so I set out to find some way to run rkhunter without running afoul of Amazon's throttling mechanism.

The solution is as follows.

First I had to figure out the threshold at which Amazon would stop throttling the VM. By playing with the script shown on gregsramblings.com I figured out that one second of execution and nine seconds of sleep worked indefinitely without throttling.

To apply this to rkhunter, it helps to know a little about signals, which are the Unix/Linux way to manage running processes. You send a signal to a process using the kill utility, passing the name of the signal to send and the pid of the target process. You can use pkill to identify the process by name if you don't know the pid.

If you run kill without specifying a signal it assumes the TERM signal, which terminates the process. Obviously that doesn't help much here; however, there are two signals that do help. The STOP signal pauses the target process, removing it from the operating system's scheduling queue. The CONT signal resumes the process. We can use these two signals to ensure rkhunter goes about its business in sprints of one second, taking a nine second breather in between and avoiding the wrath of the Amazon throttler.

OK, enough theory. To implement this I renamed /etc/cron.daily/rkhunter to /etc/cron.daily/rkhunter.norun. The dot in the filename prevents cron from running this script directly. Then I created a replacement called /etc/cron.daily/rkhunter-throttled that looks like this:

#!/bin/sh

/etc/cron.daily/rkhunter.norun &

while true; do
    sleep 1
    if ! pkill -STOP -x rkhunter >/dev/null 2>&1; then break; fi
    sleep 9
    if ! pkill -CONT -x rkhunter > /dev/null 2>&1; then break; fi
done

Note the -x argument, which asks pkill to match the process name rkhunter exactly. If we don't do this the script pauses itself (because it's called rkhunter-throttled) and hangs forever.

This approach should work for any long-running background process on your instance, so long as you don't mind it taking ten times longer than usual.

Update (2012-02-13): Turns out the above approach had its own problems. I noticed that the throttled rkhunter runs were hanging around for days. Here's an example, generated by running ps -ef --forest:

root       542     1  0  2011 ?        00:00:10 cron
root     22026   542  0 Jan31 ?        00:00:00  \_ CRON
root     22027 22026  0 Jan31 ?        00:00:00  |   \_ /bin/sh -c test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
root     22028 22027  0 Jan31 ?        00:00:00  |       \_ run-parts --report /etc/cron.daily
root     22376 22028  0 Jan31 ?        00:01:28  |           \_ /bin/sh /etc/cron.daily/rkhunter-throttled
root     22377 22376  0 Jan31 ?        00:00:02  |               \_ /bin/sh /etc/cron.daily/rkhunter.norun
root     22380 22377  0 Jan31 ?        00:00:06  |               |   \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog
root      3253 22380  0 Jan31 ?        00:00:05  |               |       \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog
root      3255  3253  0 Jan31 ?        00:00:00  |               |           \_ sed -e s:^::
root     18732 22376  0 21:24 ?        00:00:00  |               \_ sleep 1

This looks pretty normal. rkhunter is spawning children and invoking other processes. Something's become stuck, but it's not clear why. Let's try looking at the process state by running ps -e -o pid,ppid,state,cmd --forest:

22026   542 S  \_ CRON
22027 22026 S  |   \_ /bin/sh -c test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
22028 22027 S  |       \_ run-parts --report /etc/cron.daily
22376 22028 S  |           \_ /bin/sh /etc/cron.daily/rkhunter-throttled
22377 22376 S  |               \_ /bin/sh /etc/cron.daily/rkhunter.norun
22380 22377 T  |               |   \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog
 3253 22380 T  |               |       \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog
 3255  3253 T  |               |           \_ sed -e s:^::
20305 22376 S  |               \_ sleep 9

According the the ps manual, state S means "Interruptable sleep", while T means "Stopped, either by a job control signal or because it is being traced". Notice that the sed process launched by rkhunter is stopped. This is troublesome, because the our throttling script will only continue processes called rkhunter. No wonder our job got stuck.

But how did sed get stopped in the first place, considering we also only send the stop signal to processes named rkhunter? I'm not exactly sure, but the answer may lie in the way processes on Unix-like operating systems create child processes.

To create a child process, the parent process calls the fork system call, which creates an exact copy of the current process. Both parent and child processes continue on in the code from the call to fork. After the fork, the child process typically wants to run a different executable, so it calls exec, which replaces itself with code from the specified executable.

No consider how a process like rkhunter runs sed to do some work. It first calls fork. There are now two rkhunter processes running on the system. The child rkhunter then calls exec to replace itself with sed. So far so good. But what happens if we call pkill -STOP after the fork, but during the call to exec. As far as I can tell, the exec call, being a system call, is not immediately interrupted, but is instead allowed to complete before the process is stopped. When the exec call completes, we have a stopped sed, and our script will never continue it since it's no longer called rkhunter.

After pondering this problem for a while, I decided that process groups may provide a more robust way to implement throttling. Every process is a member of a process group, and a signal can be sent to all processes in a group at the same time. By default, a child process inherits the process group of its parent. The trick is to get rkhunter and it's chid processes in a different process group than our throttler script, so that we can stop and start the entire group without worrying about the timing of our STOP and CONT signals.

Luckily, the util-linux package on Ubuntu contains a utility called setsid, which simply runs a child process in a new session, which implies a new process group. Here's the new throttler script using process groups:

#!/bin/sh

setsid /etc/cron.daily/rkhunter.norun &

PGRP=$!

while true; do
    sleep 1
    if ! kill -STOP -$PGRP >/dev/null 2>&1; then break; fi
    sleep 9
    if ! kill -CONT -$PGRP > /dev/null 2>&1; then break; fi
done

This script has been running for a few days and so far there are no problems with rkhunter hanging around.

By the way, if you're interested in the full gory details of Unix processes, signals, and such, I highly recommend Advanced Programming in the UNIX Environment by W. Richard Stevens. Although it is targeted at C programmers, it very clearly explains many of the concepts required by any Linux administrator.