EC2 Micro Instance Throttling
January 31, 2012
I'm a big fan of Amazon EC2 micro instances. If you have modest requirements, or if you need to split functionality onto separate servers for security purposes, micro instances can be a very economical way to do it.
But be careful. While micro instances perform very well in short bursts, if you consume excessive CPU for more than a few seconds, the EC2 infrastructure may throttle back your VM, perhaps by more than a factor of thirty!
I stumbled across this when our monitoring tools were regularly reporting trouble connecting to our micro instance servers each day around 6:00 in the morning. It turns out that's when the system was running rkhunter, which can run for a minute or more at high CPU loads looking for malware. It seemed a shame to upgrade to a "small" instance, at four times the cost, just to satisfy a non-critical, daily batch job, so I set out to find some way to run rkhunter without running afoul of Amazon's throttling mechanism.
The solution is as follows.
First I had to figure out the threshold at which Amazon would stop throttling the VM. By playing with the script shown on gregsramblings.com I figured out that one second of execution and nine seconds of sleep worked indefinitely without throttling.
To apply this to rkhunter, it helps to know a little about
signals, which are the Unix/Linux way to manage running processes. You
send a signal to a process using the
kill utility, passing the name of
the signal to send and the pid of the target process. You can use
pkill to identify the process by name if you don't know the pid.
If you run
kill without specifying a signal it assumes the
signal, which terminates the process. Obviously that doesn't help much
here; however, there are two signals that do help. The
pauses the target process, removing it from the operating system's
scheduling queue. The
CONT signal resumes the process. We can use
these two signals to ensure rkhunter goes about its business in sprints
of one second, taking a nine second breather in between and avoiding the
wrath of the Amazon throttler.
OK, enough theory. To implement this I renamed
/etc/cron.daily/rkhunter.norun. The dot
in the filename prevents
cron from running this script directly. Then
I created a replacement called
looks like this:
#!/bin/sh /etc/cron.daily/rkhunter.norun & while true; do sleep 1 if ! pkill -STOP -x rkhunter >/dev/null 2>&1; then break; fi sleep 9 if ! pkill -CONT -x rkhunter > /dev/null 2>&1; then break; fi done
-x argument, which asks
pkill to match the process name
rkhunter exactly. If we don't do this the script pauses itself
(because it's called
rkhunter-throttled) and hangs forever.
This approach should work for any long-running background process on your instance, so long as you don't mind it taking ten times longer than usual.
Update (2012-02-13): Turns out the above approach had its own
problems. I noticed that the throttled rkhunter runs were hanging around
for days. Here's an example, generated by running
ps -ef --forest:
root 542 1 0 2011 ? 00:00:10 cron root 22026 542 0 Jan31 ? 00:00:00 \_ CRON root 22027 22026 0 Jan31 ? 00:00:00 | \_ /bin/sh -c test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) root 22028 22027 0 Jan31 ? 00:00:00 | \_ run-parts --report /etc/cron.daily root 22376 22028 0 Jan31 ? 00:01:28 | \_ /bin/sh /etc/cron.daily/rkhunter-throttled root 22377 22376 0 Jan31 ? 00:00:02 | \_ /bin/sh /etc/cron.daily/rkhunter.norun root 22380 22377 0 Jan31 ? 00:00:06 | | \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog root 3253 22380 0 Jan31 ? 00:00:05 | | \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog root 3255 3253 0 Jan31 ? 00:00:00 | | \_ sed -e s:^:: root 18732 22376 0 21:24 ? 00:00:00 | \_ sleep 1
This looks pretty normal. rkhunter is spawning children and invoking
other processes. Something's become stuck, but it's not clear why.
Let's try looking at the process state by running
ps -e -o pid,ppid,state,cmd --forest:
22026 542 S \_ CRON 22027 22026 S | \_ /bin/sh -c test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) 22028 22027 S | \_ run-parts --report /etc/cron.daily 22376 22028 S | \_ /bin/sh /etc/cron.daily/rkhunter-throttled 22377 22376 S | \_ /bin/sh /etc/cron.daily/rkhunter.norun 22380 22377 T | | \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog 3253 22380 T | | \_ /bin/sh /usr/bin/rkhunter --cronjob --report-warnings-only --appendlog 3255 3253 T | | \_ sed -e s:^:: 20305 22376 S | \_ sleep 9
According the the
ps manual, state
S means "Interruptable sleep",
T means "Stopped, either by a job control signal or because it
is being traced". Notice that the
sed process launched by rkhunter is
stopped. This is troublesome, because the our throttling script will
only continue processes called
rkhunter. No wonder our job got stuck.
But how did
sed get stopped in the first place, considering we also
only send the stop signal to processes named
rkhunter? I'm not exactly
sure, but the answer may lie in the way processes on Unix-like operating
systems create child processes.
To create a child process, the parent process calls the
call, which creates an exact copy of the current process. Both parent
and child processes continue on in the code from the call to
After the fork, the child process typically wants to run a different
executable, so it calls
exec, which replaces itself with code from
the specified executable.
No consider how a process like
sed to do some work. It
fork. There are now two
rkhunter processes running on
the system. The child
rkhunter then calls
exec to replace itself
sed. So far so good. But what happens if we call
pkill -STOP after
the fork, but during the call to
exec. As far as I can tell, the
exec call, being a system call, is not immediately interrupted, but is
instead allowed to complete before the process is stopped. When the
exec call completes, we have a stopped
sed, and our script will
never continue it since it's no longer called
After pondering this problem for a while, I decided that process groups
may provide a more robust way to implement throttling. Every process is
a member of a process group, and a signal can be sent to all processes
in a group at the same time. By default, a child process inherits the
process group of its parent. The trick is to get
rkhunter and it's
chid processes in a different process group than our throttler script,
so that we can stop and start the entire group without worrying about
the timing of our STOP and CONT signals.
util-linux package on Ubuntu contains a utility called
setsid, which simply runs a child process in a new session, which
implies a new process group. Here's the new throttler script using
#!/bin/sh setsid /etc/cron.daily/rkhunter.norun & PGRP=$! while true; do sleep 1 if ! kill -STOP -$PGRP >/dev/null 2>&1; then break; fi sleep 9 if ! kill -CONT -$PGRP > /dev/null 2>&1; then break; fi done
This script has been running for a few days and so far there are no
rkhunter hanging around.
By the way, if you're interested in the full gory details of Unix processes, signals, and such, I highly recommend Advanced Programming in the UNIX Environment by W. Richard Stevens. Although it is targeted at C programmers, it very clearly explains many of the concepts required by any Linux administrator.