The Hidden Curse of High Uptime
A number of Unix/Linux people seem to pride themselves on obtaining the highest uptime they can. While this may seem like a little harmless fun, in a production environment (which are mostly fun-free places), it can hide a number of problems that will later become major issues.
At some point the machine will have to come down and face a power off or reboot, and then it’s expected to come back up, and this is where the problems can start. In almost any environment, no matter how simple, and this problem gets worse as more complexity and people are involved, a number of changes will be made to the running system and given some testing time; and then they will be forgotten about and never made persistent and able to survive a reboot.
Whether it’s the simple addition of a firewall rule thats never written to the config file, an unsaved routing table entry or forgetting to enable a service in rc.local, on any machine with a high up time their is a chance that something won’t come up. And if it’s a remote box it’ll be something that stops you getting in to fix it, Murphy ensures this.
My recommendation? Pick a schedule (a month, three months, maybe once a quarter) and take the machines off line and then see what doesn’t come up (you do have monitoring in place don’t you?) If you have the opportunity you should combine this with your UPS testing (and you better be testing those!). If you can’t afford to take a server down for testing then you’ve got a resilience problem and a single point of failure that needs addressing.