No one likes a whinger - The systems fight back
After my little whine I logged in to do my last checks for the evening to discover that one of our webservers had died due to a hard drive going bang, our production environment Nagios box had lost one of its network connections and a chunk of our SAN kit was complaining about power issues. Turns out that most of these were due to a power surge that killed a network switch and three of the racks power strips. On the very plus side no one outside of the systems team noticed. Resilience is a wonderful thing when you get it right.
Woke up this morning, checked the Nagioses Nagii and found
out that one of our other products database servers had gone boom (my
fellow sysadmins were fixing that one) and the fail over had mostly worked.
No interesting logs, no hardware problems and a three hour gap in syslog
(and only syslog) to help explain the outage.
What have I learned? That the production servers read my blog. And they hate me.