Small Business Disaster Recovery Plan
A Short Story
Your computer is acting sluggish for the past couple hours. All of the sudden you only see a screen saying your data is encrypted and you must send $1000 worth of bitcoins to an anonymous account. You hear on the news that a new virus was recently found but that gov agencies have shutdown the command & control servers so even if you do pay the bitcoin ransom you can’t decrypt your own files.
Looking at the external hard drive sitting on your desk thinking you luckily escaped that unfortunate fate. Or did you? Disconnecting the drive, plugging it into your laptop to check that all those QuickBooks backups, Word documents, among many other files are still there.
Double clicking to open one of the Word docs, “Word was unable to read this document. It may be corrupt.”. You try another Word doc, same result. QuickBooks can’t read one of the backup files either. Turns out the malware infected whatever was connected to the computer including your cloud files because they too worked just like files stored on the computer.
Fumbling with the phone as you tap your IT admin’s contact number hoping they have a solution. It goes to voicemail. A few minutes later you receive a call back. Your IT has been fielding the same situation with several other clients. He says not to worry, he made a backup during the weekend and stored the external drive in the fireproof safe but he advises you not to touch anything until they can handle it because the virus is still active.
“That’s great! Wait. The weekend? It’s Thursday. That’s four days of data I have to manually track down and re-enter?” “Unforuntately, yes. But it’s better than losing all of the data, right?” “Yea, I guess. What about the website?” “It wasn’t affected by the current virus but there are also backups for that too.”
Disaster Averted With Disaster Recovery
Disaster Recovery (DR) isn’t an enjoyable thought because it requires thinking negatively to protect what exists but it’s better thought of as a necessary investment in your business to develop such a plan for your business processes.
Often times a small business has a limited digital presence where disaster recovery isn’t much of a thought. The website works and business continues on. However, what happens when you think of the website as a sales channel and that channel suddenly dries up. Do you have alternative sales channels? How quickly can you get the website back online? Sometimes businesses don’t even have time to think of these things.
Now if your website is a core part of your business operations, those and many more questions become even more important. Typically a Disaster Recovery Plan covers all IT elements of a business (internal file sharing servers, workstations, your QuickBooks Desktop data, your website, etc.).
In the story earlier, the business owner was protected enough in that specific scenario of ransomware by having an offline, disconnected backup. Would the fireproof safe be enough if the office were to burn down? How long would it take to recover from that situation assuming the backup was good?
Insurance is only going to cover the physical items and it’ll still take time to buy, install, and setup new computers, servers, the data, etc.
Seven Tiers Of Disaster Recovery
There’s a popular model to help businesses understand what tier of protection they have from tech disasters. It’s called the “seven tiers of disaster recovery”. As a business moves up in the tiers the better equipped they are to handle an outage with minimal or no data loss and minimal loss of operations.
Tier 0: No backups
Tier 1: Backups with off-site storage of backups
Tier 2: Backups with standby off-site equipment. Backups sent to other site periodically.
Tier 3: Backups are electronically and automatically transmitted to off-site equipment
Tier 4: Data is transmitted to the off-site location more periodic than the usual backups. This could be every hour, or in realtime. However it’s application-dependent.
Tier 5: Mission-critical data in a database is sent in real-time to off-site location
Tier 6: All data is delivered in real-time to off-site location with no or very little data loss
Tier 7: Automatically switching operations over to the off-site location without the need for manual response and intervention.
Surprisingly a lot of small businesses operate in Tier 0. They don’t have any DR plan in place, no formal method of backups, and no testing. In the story before they were about a Tier 0.75. They had a backups (good start!), an offline disconnected backup to protect against active malware, but still had it on-site vulnerable to a fire or some other event.
One thing that seven tiers doesn’t really capture though is that the automatic mirror of data is not a replacement for regular backups and offline backups. Real-time mirroring of data is a only a part of disaster recovery. Reason being is that should the data be encrypted with malware, that encryption is also mirrored. There are possible options like maintaining the ability to rollback to any point in the past but that’s an advanced topic we won’t be covering.
What about your website?
The same tiers also apply to your website. Your hosting company may perform backups for you but they might not say exactly how they’re performed or where they’re stored. What happens if, like a small business office, the backups are stored in the same location as the data both subject to a disaster? Further what’s obligating them to not do so? It’s unlikely there’s a contractual obligation and there’s no common third-party certification for such a system either.
Such a setup is a business-level decision. The higher up in tier the greater the cost in both technology, engineering, and staff to support it.
This is where you need to apply due diligence to ensure that your website and data are sufficiently protected. At minimum small businesses should strive for Tier 1.
RTO and RPO
Let’s take a step back and simplify the seven tiers a bit to two fundamental objectives from the perspective of a website:
How long can your website be offline before it significantly affects the ability to continue the business?
RTO: Recovery Time Objective
How much website data can be lost before it affects the ability to continue the business?
RPO: Recovery Point Objective
We saw that the tiers largely revolved around how quickly a business can resume IT operations (including any websites) and how much data would be lost in the process.
For some businesses it isn’t a significant issue for a website to be offline for a whole 24 hours. The site probably doesn’t generate much of any new data either. In that case it’s likely just an informational page. The main thing that business owner should be concerned about is having a copy of their website just in case there’s ever an issue with the hosting provider.
However, for other businesses it can be a significant impact to be offline for 24 hours (RTO). Furthermore, for some it can be very impactful to the business to lose hours of data (RPO). In that case a disaster recovery solution needs to be implemented to be within those two metrics.
Some business owners might say an hour is too long, which might be the case. An important factor is the lesser RTO the more budget is required. Typically that’ll involve double the equipment to automatically replace a failing one, and probably operating in a geographically different location. An hour might be too long, but is a solution too costly?
This is where the idea of disaster recovery being an investment comes in to the discussion. If you’re just relying on the good faith of a hosting provider is that really a sufficient investment?
3-2-1 Rule
Another simplification of the Seven Tiers is in the strategy of implementation. RTO and RPO are metrics that dictates how the strategy is implemented. The 3-2-1 Rule is a bare minimum disaster recovery strategy:
Have three or more copies of all data
Store the copies on at least two different storage or media types
Have at least one of the copies off-site
In some cases, you’re probably already doing this without even realizing it:
Your phone has pictures (one copy)
It may already be syncing those pictures to your desktop (second copy)
As well as syncing them to a cloud of some sort such as Google Photos or iCloud (second copy).
You effectively have three copies with one of them being off-site (cloud provider)
Should your phone be lost, they’re still on the computer. Should the house suffer a disaster where both the phone and computer are destroyed, they’re still in the cloud provider. It feels pretty good to have at least Tier 1 disaster recovery in place with the 3-2-1 Rule.
What is my RTO and RPO?
The most fundamental way to calculate your RTO and RPO is how much your business loses but not being online. If it’s an e-commerce site, an easy metric is how much your site makes per time period. Use that as the baseline for what budget is available to invest in minimizing the RTO and RPO. Also, factor in the hidden cost to business image to not be available to consumers.
For other businesses, what RTO and RPO will depend on what impact the business experiences. For some businesses that are time-critical with their own contracted SLAs it can be costly to be down.
Be aware that cost-cutting and not giving IT the time it needs to practice DR plans periodically will only increase RTO and RPO. A DR plan isn’t a DR plan if it isn’t practiced and refined. By practicing a DR plan the RTO and RPO metrics can be better identified and improved if needed, either with more practice or automating parts of the DR plan as seen in the Seven Tiers.
What Can I Do?
First and foremost is to understand you need IT assistance to do this properly for your whole business. If you have IT acumen that’s great, you can set a lot of this up, but you should verify your setup with IT and delegate to them maintaining and testing of your disaster recovery efforts while you focus elsewhere.
I recommend hiring a local IT for your local assets such as workstations, any desktop-bound data such as QuickBooks Desktop, rotating offline backups or showing you how, and so forth. I would also task your IT partner or staff with verifying and reporting to you the status and progress of disaster recovery plans and testing results.
Remote assets such as web hosting can be handled by a different remote IT partner. Your local IT might not have the expertise to handle other systems like Linux or specific cloud environments like AWS. However, they should have enough expertise to know the above DR techniques. More importantly all IT teams should work together to verify DR plans and identify where improvement is needed.
How about my website then?
This is where the RPO metric becomes important. A typical backup runs every night, and if done correctly is also replicated to an off-site location. However, most hosting providers don’t do this because it’s an additional expense that most consumer aren’t interested in spending money on, making them appear less competitive on price to the usual consumer.
This generally makes sense though because companies that care take actions into their own hands to ensure they have a sufficient disaster recovery process as well as verifying those DR plans work for when they’re actually needed.
If there was to be an incident where the hosting company could recover with a local backup, what is the response time of the hosting company? That response time would be added to the RPO. For some situations the hosting provider’s terms state a 24-72 hours response via a support ticket.
Also, be careful of SLAs. Typically they barely cover a small part of your hosting fees, and a miniscule percentage of what an outage actually costs your business. You or your staff need to have a personal investment in the continuation of operations that go beyond what a provider’s SLAs can offer.
Assuming they were able to handle the situation within 12 hours, that would be an RTO of at least 12 hours since the incident was identified, with a RPO of however long it was until that last backup, at worst 23 hours or so.
If your business can handle that kind of outage, great because then the existing hosting company will work fine. If not, then it’s time to work towards improving the situation.
My business can’t handle that kind of outage
Like we saw with the seven tiers of DR, there’s several methods to minimize RTO and RPO. Websites will have specific technologies and methods of implementing any of those tiers in a more cost-effective manner.
The typical SMB site, such as e-commerce or an application, won’t be hosted within the business but with a hosting provider. You’ll first need to work with the provider to identify what RTO and RPO your site could experience and then work on solutions to improve those metrics.
Most importantly is to be aware of what RTO and RPO you’ll be facing so it isn’t a surprise when a DR plan needs to be put into action. More importantly is to have a Business Continuity Plan that works with the DR plan so that staff can respond effectively.
Example DR and Business Continuity Steps
An example of a DR plan would include a subset of items such as:
Assess the situation with the hosting provider to determine what steps need to be taken on your side
Steps that your IT team will take if the hosting provider is offline such as experiencing a regional outage
Restore the database from a backup to a new server
Showing customers a status page with an expected resolution time to help support from being flooded with requests.
Once the backup is restored, the steps needed to verify that the site is functional and data integrity
Opening the site to users again before or after manual restoration of data
A timeline of how often to practice the DR plan and identify & improve any deficiencies either with process improvements or additional technology
The parts of a Business Continuity Plan would include a subset of items that are working around or with the DR plan, such as:
Steps staff will take while the DR plan is in effect, such as as manually working with Excel files and how to share data that used to be shared through the site
Steps that staff need to take re-add the new manually created data back to the site and when to officially switch back over to the site
Steps that staff need to take to restore lost data. Such as using data from email receipts to re-create new users, orders, their status, etc.
A timeline of how often to practice the Business Continuity Plan so that staff know what to expect instead of disengaging because they can’t get work done or worse affecting staff with negative comments and reactions
Moving Forward
We’ve briefly touched on various subjects in this article that should give you enough insight into what questions you need to ask and what staff you need to put in place to answer them confidently.
If you’re curious to discuss with us what your situation is and how to improve we’d be more than happy to help guide you with next steps and achieve them.
The key to remember is that this is an investment in your business.