Failover Pairs - A short Rant
Let’s cover the basics, if you’ve got two machines working as an identical failover pair then THEY SHOULD BE IDENTICAL. Adding services, hell, adding nearly anything, to only one of them is a mistake. You’ve now created a bias on which one you need running and you can no longer assume they’ll both do the same thing in the same situation. Which defeats the whole point of having them. This might seem obvious, but the number of people who break this simple rule never fail to make that pretty little vein in my neck dance.
Now we’ll discuss testing the failover. You should do regular, scheduled and signed off, failover tests. It might be difficult to get permission for a test when everything is working. This is typically because people don’t have enough confidence in the technology, people and process - often accompanied by uncertainty about the length and impact of the outage. In a very chicken and egg style you can only get confidence by (successfully) performing the test and measuring the impact. You should have a staging setup that’ll let you perform the test as many times as you need to get it down pat. And then a couple more times just to be certain before you perform it in production.
This is also solves one of the related problems, things that happen rarely don’t get tested or explained and the documentation drifts out of sync with reality. You should have a set of machines in staging that the new guys can play with, these should be tested (with the documentation) on a set schedule.
An untested failover pair are a working machine and a hope - nothing more.