I have two Ubuntu servers running at home which both have large RAID volumes on them, set up via mdadm. This summer I had a total disk failure in one of my RAID5s which luckily didn’t result in data loss. Thanks you RAID5!
In any case, it caused me to write a script that logs SMART-data to a MySQL database. I also wrote an admin webpage that displays this data for me in an easy to follow way. The monitoring script itself is written in php5 and so are the admin pages. I used php5 because it is easy to communicate with MySQL from it, and it has the needed string manipulation commands. It could probably be done as easily in Python though. The script is called as a cron job every 2 hours on the servers and every hour on the desktops when they’re running. Examples of the code I’m using is attached below and includes the cron-ed script and the code generating the log output and plot.
What to look for?
Well, that is the big question. How do you know if a drive is about to fail? Google Labs has looked into this topic back in 2007 in this paper: «Failure Trends in a Large Disk Drive Population». An interesting read if you are at all concerned about harddrive failure in servers. The results for how temperature affects the lifetime and failure rate in harddrives are especially interesting. It turns out, at least in their data, that low temperature isn’t such a good thing for the drives contrary to what many people seem to assume. I have up until now been concerned that my drives get too hot, but in fact they seem to be almost overcooled the way I have things set up now.
When I wrote these scripts this summer I decided to log temperature and reallocated sector count primarily, which is what is emphasized in the log display scripts. Seems now I also should be including scan errors as well after reading that paper. The colour coding I use in the temperature plot below is loosely based on Figure 4 in the paper and reflects what seems to be the optimal operating temperature for harddrives.
Screenshots



Screenshot 1: The overview page.
Screenshot 2: Details of one of the RAID-drives.
Screenshot 3: Details of one of the drives with reallocated sectors.
Code
The php source code I wrote is available in this file: hd-mon.tar.gz
It is specifically designed to work with my setup and hardware and probably isn’t universal, but it gives an idea of how I set it up. There are probably better ways of doing this though. I just call shell commands from php and parse the returned text-string and do simple search on it and input the data into a MySQL database.
I also included php-snippets showing how the admin page is generated. These are not standalone php files, they need to be wrapped in a template. However they reproduce what is seen in the screenshots above.
Packages needed for these scripts to run:
- php5-cli for the php5 command line.
- mysql-client, php5-mysql for the database connection.
- smartctl to access the SMART-data.
- mdadm to access RAID-data (assuming you use mdadm for RAIDs in the first place)
All are available in the Ubuntu repository.



