Low hanging BCP and DR scenarios
Just before Christmas I had to do some work on new business continuity plans (BCP) and disaster recovery (DR) documents. To help warm up and get myself in the right frame of mind I posted a few easy opening scenarios to Twitter for comment and I’ve decided to collect them back up and post here, in my external memory, for posterity.
Each of these ideas should be considered the most generic and low hanging fruit of your plans. They will not be enough on their own but if you don’t have something covering each of them you may well have some gaps.
Unavailable offices
What would happen if one of your office buildings became unavailable for a few days to a couple of weeks? Do you have a call tree in place to ensure everyone knows not to make the journey in and why? Do you have anything that’s only available in that location? Visual displays you can’t see the output of via any comparable route? A safe with AWS root credentials or physical security tokens? The easiest remedy is to be remote friendly and practice the occasional enforced remote working week and game day.
Unavailable third party
What would happen if your main hosting provider became unavailable to you? There are a couple of ways to look at this one, someone compromises your accounts, deletes everything and closes them (and makes S3 buckets using the same names as your deleted ones, the scoundrels!). A new law comes in preventing you from storing your data in their main jurisdiction or your finance team makes a mistake and your systems are shutdown due to lack of payment. You should look at this in a different way to how you’d consider resilience in the cloud, being multi-region doesn’t help if Google shuts you down for lack of contract agreement or late payment.
On a related but smaller scale, pick any of your more focused vendors
and run the same thought experiment. What would happen if you lost
access to DataDog or PagerDuty for a week? What about Dockerhub? We’ll
ignore the loss of Slack as a ‘happy accident’. Maybe one of your
employees finally snaps from one too many if err != nil {
and after
going on an emoji fuelled rampage gets your organisation banned from
GitHub. You might have the code locally but which processes have to
change? What things need to be migrated immediately and which can be
left in a degraded state for a few days? Collecting information on these
aspects can help inform the rebuild order and priorities if you do
suffer from a BCP level issue.
People. It’s always the people.
How many people, in which roles, can you afford to not have available? From the risk of an employee leaving lunch laying out an entire team with food poisoning (which I’ve had happen) to the positively spun scenario of the office lottery syndicate winning and all leaving without handing back their yubikeys. Do you have any roles with special sign off powers for compliance or regulatory reasons? What is your response? There’s nothing wrong with the answer being ‘We can’t afford to have the extra headcount so we’d enter “maintenance mode” until they recover’ but it’s always nicer to think about these things when you’re not in the middle of one of them.
How about crossing the streams? Both people that work at the small vendor you use to run your CI/CD lava lamps decide they’d rather do something non-technical and walk off into the sunset. What’s your response? Questions like these are useful things to include as part of your procurement process so you always (assuming you do the occasionally review) have a deliberate fallback plan.
One of my favourite comments on the original twitter thread, and related to people, was from tgulfie on Decimation. “Get a 10 sided die, when people arrive on Monday they roll it. When someone rolls a 10, they get an orange arm band. Anyone with an orange arm band is unavailable for any business operations, they can however watch and learn what happens when they are not there.” This could be that mythical “I’ll go back and write documentation for $foo” time we’ve always dreamed of.
None of these examples are really industry or company specific and they don’t take local context into consideration. Hopefully they widely applicable enough to help kick start your own thinking and lead to more relevant and important questions for your own platform or company.