Calculating Downtime On a Linux System

Whether you are an engineer or an end user, most of us who work on critical systems find ourselves concerned with the availability of services those systems provide every day. There are times when these systems fail for one reason or another and we find ourselves interested in just how long those systems may have been down. This is especially true for those of you who may be working in a datacenter, as an engineer, or even in some cases are customers who simply want to know if their provider is meeting an established SLA for uptime and availability.

The question I often get is, "Why would you choose not to rely on external services such as pingdom when they are available?" The answer to that question can be complex depending on who you are but in general external services are subject to the same types of networking and routing issues any of us are. Since that is the case these services can often paint pretty grim pictures even when real server-side problems do not exists. It's generally a better idea to rely on monitoring system within an internal network as well data that can be provided by server-side services even if you have to write them on your own. With that out the way lets take a look at some basic information that might help us determine the downtime on a linux system.

The first thing to look for when calculating downtime is reboots. Run the following command to see when this server has been rebooted:

[[email protected] ~]# last reboot
reboot   system boot  3.2.31           Thu Oct 18 01:08 - 08:05 (20+07:56)
reboot   system boot  3.2.28           Wed Aug 29 03:45 - 01:05 (49+21:19)
reboot   system boot  3.2.17           Fri May 18 03:05 - 03:41 (103+00:36)
reboot   system boot  3.2.15           Fri Apr 20 01:47 - 03:01 (28+01:13)

wtmp begins Sat Mar 31 04:31:21 2012

From the above output, we can see that this server was rebooted on Oct. 18, Aug. 29, May 18, and Apr 20. Each line also displays the time the server was booted and the time it stopped, along with how long it was up. On Aug 29, we can see that the server booted at 3:45 AM, then 49 days, 21 hours, and 19 minutes later it stopped. The system stopped at 1:05 AM on Oct. 18. The system then started again at 1:08 AM on Oct. 18. Using this information we can determine that the server was offline for a total of 3 minutes. In some cases however the reboot times will not provide enough information on the amount of downtime a system has seen. This is especially true if we are attempting to investigate ho well we are meeting an SLA, for instance the golden rule of 99.9%. We should probably do little more research, specifically around the times that the reboots listed above have occurred. Generally speaking its fairly uncommon for a system reboot to occur unless there is a problem or a specific need in an enterprise environment.

For each reboot, you should look over the sar data logged before the reboot in order to get an idea of what the server may have been doing. We are interested in the pre-reboot conditions in the event that those conditions are responsible for the downtime. I would also suggest looking at sar data for other dates and times to identify other potential issues. Data collected by sar is logged in /var/log/sa/YearMonth/sa%d. All but the last 2 weeks of data are bzipped (in my case). The following one-liners will bunzip the last 2 months worth of sar data, then run "sar -r" and prints the 5 lines with the highest "%commit" values for each day. It will also show you each reboot as it is recorded by sar.

Before we go much further some of you may not be familiar with sar, that is a topic for another post which I will get to another day. The short answer is that sar is better referred to as the System Activity Report tool which can historically archive information about conditions on a server. This information is often useful when attempting to narrow down bottle necks or problems on a system.

For the current month:

[[email protected] ~]# cd /var/log/sa/$(date +%Y%m); ls -1 *.bz2 > files; cat files |while read file;do bunzip2 $file;done; ls -t -r |grep -oP '^sa\d+'|while read i ; do echo;echo "== `ls -alh $i|awk {'print $6,$7,$8'}` ==";sar -f $i -r 2>&1|grep -v "Invalid"|awk '{print $9" "$1" "$2}'|sort -n | tail -5 ; sar -f $i |grep -i restart -C4;done; echo

For the previous month:

[[email protected] ~]# cd /var/log/sa/$(echo `date +%Y`|perl -p -e 's/\n//' ; echo $(printf "%02d" `date +%m-1|bc`)); ls -1 *.bz2 > files; cat files |while read file;do bunzip2 $file;done; ls -t -r |grep -oP '^sa\d+'|while read i ; do echo;echo "== `ls -alh $i|awk {'print $6,$7,$8'}` ==";sar -f $i -r 2>&1|grep -v "Invalid"|awk '{print $9" "$1" "$2}'|sort -n | tail -5 ; sar -f $i |grep -i restart -C4; done; echo

You can change the sar commands to print any data that you believe may be relevant, but in my experience, when the %commit grows over 1000% the server typically will become unresponsive. To better understand that you would certainly need to know what commit represents, though again that is something better covered in a separate post. Other types of data might show performance problems that would not cripple a system but would certainly indicate slow or poor performance. In general I wouldn't qualify this as downtime but it would be something worth investigating.You might then look into specific incidents in more detail to get an idea of just how long the event actually occurred and what impact it might have had on availability. For example, on this system I see that the server was rebooted on Oct 18, so I can view sar data from Oct 18 after bunzipping the sar data with the following command:

[[email protected] ~]# sar -f /var/log/sa/201210/sa18 -r

In this instance the sar data doesn't indicate any particular issues and I haven't included the output just for the sake of saving space in the article. Another good place to stop while you are investigating downtime is "sar -n DEV"  which could reveal potential network issues. A large spike in traffic could indicate an attack, while a sudden drop could indicate other network issues.

When you are finished with the sar data, you should clean up after yourself by bzipping the proper files. If you used the one-liners above, then the following should help you get everything cleaned up.

For the current month:

cd /var/log/sa/$(date +%Y%m); cat files|sed 's/.bz2//' |while read file;do bzip2 $file;done; rm -v files

For the previous month:

cd /var/log/sa/$(echo `date +%Y`|perl -p -e 's/\n//' ; echo `date +%m-1|bc`); cat files|sed 's/.bz2//' |while read file;do bzip2 $file;done; rm -v files

Outside of some basic math, that is all there really is to it. Further load isolation and investigation might lead you to problem points or bottle necks that should be addressed though this often takes time and experience to recognize. Additional article posted here in the future may help you decipher sar output and later find bottlenecks on a system.