How do you Achieve an Affordable Restore Time Objective (RTO) of Minutes for your Disaster Recovery Plan?:
Answer: Automation and Public Cloud
Almost every business executive and IT team understands the need for Disaster Recovery.
The goal of applications being offline for only seconds or minutes with no data loss.
If an outage were to occur while applications and infrastructure are shifted to another geographical location is well understood by both business and IT executives.
Achieving this goal, however, is seldom done due to the level of cost and complexity involved with a business continuity or disaster recovery architecture.
This leaves minute or second based business continuity for only the largest companies able to dedicated large amounts of money and resources to a solution.
Is there really an affordable Restore Time Objective? (RTO)
Contact us today if you’d like an immediate low cost solution.
Many companies, however, tell their executives (or even themselves) that they have a reasonable disaster recovery plan in place to get data and applications back online.
This architecture usually involves using data replication through 3rd party replication software, replication built into storage solutions or native database replication built into most relational databases.
Unfortunately, replicating data to another geographic location only addressing the DATA side of the equation or the Restore Point Objective (RPO).
The second measure of a disaster recovery or business continuity solution is how fast the actual application running in the second location can be accessed by users. This is defined as Restore Time Objective (RTO). While the data might be safe at a different location if users cannot access the application for days you don’t have a solution.
The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. (Source: Wikipedia)
Recovery Time Objective A data backup plan and restoration of electronic data is essential. Some data is vital to the survival of the business. – Source: Ready.gov.com
While many companies spend a lot of resources on RPO solutions like replication, they tend to have a very loose plan to address the application side of the equation or the Restore Time Objective (RTO).
RTO plans usually involve an “all hands on deck” where application owners work alongside infrastructure teams to deploy the application manually using app deployment tools. But anytime a process relies on people “pressing buttons”…mistakes can happen.
Automation using DevOps tools are now allowing companies to achieve incredibly fast RTO’s to align with low RPO’s at a very affordable cost.
Instead of expensive Active Active Disaster Recovery solutions, the rise of Warm or Cold Disaster Recovery solutions utilizing Public Cloud Providers and leveraging Automation is bringing Cloud Disaster Recovery to the forefront.
How is Cloud Computing lowering both cost and Restore Time Objective (RTO) for Disaster Recovery?
Let’s start by quickly reviewing the measurements of any Disaster Recovery / Business Continuity solution: Restore Time and Restore Point Objectives.
A Review of Restore Point and Restore Time Objectives
Restore Point and Restore Time Objectives (RPO & RTO) are used by IT executives to measure and goal their Disaster Recovery Plans. The lower the time for each objective the better.
Every executive wants these within seconds or minutes. The cost to deliver these goals however usually results in IT negotiating with business executives for longer timelines in exchange for lower costs. So lets look at these individually….
Restore Point Objectives
Restore Point Objectives (RPO) are associated with data. The idea being if a catastrophic event were to occur…
How much data would the company loose before coming back online?
RPO is addressed with replication and snapshotting technology.
Data from the production data center is replicated real-time to a separate geographical location where versions of the data are snapshotted and stored locally on disk.
Since the replication is real time, the question is how to go back in time if a corruption were to occur at the primary location as the data corruption would be replicated to the disaster recovery location.
The answer to this is through some sort of versioning technology such as snapshotting or imaging. This allows the data to be frozen in time.
The more frequent the snapshotting the lower the Restore Point Objective. And with disk and networking transport prices falling companies are finding it easier to implement strategies to greatly reduce their RPOs.
The above scenario demonstrates a 15 minutes RPO where snapshots of the active data are occurring every 15 minutes. If a disaster or corruption where to occur the last “clean” snapshot would be used for the restore. This would ensure that no more than 15 minutes of data would be lost by the company. The RPO can be adjusted downwards but would utilize more disk space and result in more snapshots to manage.
Restore Time Objectives
Restore Time Objective (RTO) is the unit of measure in time it takes for the applications AND data to come back on line and be available to customers when a disaster strikes.
The timer can start the minute the application goes off line or when a disaster is declared by operations, either way it is a race to bring the applications back on line for customers.
Active/Active or Hot Disaster Recovery
The fastest RTO times are delivered using an active/active or Hot data center strategy.
In an active/active Disaster Recovery scenario, the application could be running in both data centers simultaneously using synchronous replication or a secondary data center utilizing compute, storage and networking resources is an a standby mode relying on asynchronous replication.
One of the issues with synchronous replication is the geographical limit on how far the data centers can be apart from each other. For this paper we will assume customers want a geographically disperse data center and will be using asynchronous replication.
An Active/Active Disaster Recovery delivers an extremely low RTO.
This is because the standby application and infrastructure are active the time to divert customers to the standby data center is relatively small. While this is a great strategy the cost involved with an active/active Disaster Recovery solution is very costly.
The same level of infrastructure found in the active data center is provisioned in the DR data center in addition to doubling the amount of maintenance on that infrastructure and applications. So data center costs are essentially doubled while the DR facility stands by waiting for a declared disaster. This scenario has only been utilized by those companies willing to invest large amount of money into their disaster recovery solution.
Public Cloud Warm Disaster Recovery
We define a Warm Disaster recovery architecture as an architecture with an active database and storage within a Public Cloud receiving replicated data from the active application within the primary data center. Networking, compute and the application are not active and running saving the customer money on infrastructure.
The main benefit to this strategy is a much lower monthly infrastructure costs vs the active active solution. The database is active but is extremely small, just powerful enough to process the incoming data over a much longer period of time as there is no latency issues because the application is not active.
Because the database is active, daily consistency checks can be run against the database to ensure there are no corruption issues which would spoil the database when needed.
Automation plays a key role in this architecture when a disaster is declared.
Using automated run books built with Ansible, Check, SaltStack or other configuration management tools, the database infrastructure can he quickly scaled to a production size while network and compute infrastructure is built out. The application is deployed and synchronized with the data. All of this automated infrastructure build out and application deployment takes under 60 minutes with most application averaging 45 to 15 minutes.
Beside the low cost and quick RTO another main benefit is that because no people are involved in deploying the application there is no mistakes to occur ensuring a smooth disaster recovery.
Cold Disaster Recovery
A Cold Disaster recovery architecture is defined by having a storage target in place to receive active data transfers from the primary data center with no compute, networking or database resources active.
A Public Cloud instance for running a 3rd party replication solution will also be needed if you elect to not use native database replication. Snapshotting or imaging still needs to occur to maintain a proper RPO.
The main difference between the Cold vs Warm DR is that there is no active database within the DR solution. This saves on the infrastructure and licensing costs to run a database server in the Disaster Recovery datacenter.
The trade off, however, is a longer RTO which we will discuss.
As with the warm solution, the cold solution relies heavily on prebuilt automation scrips using Ansible, Salt Stack, Chef, Puppet or other automation solutions.
The compute, networking and application layers are stored in an Artifact Repository such as Nexus or any other application type repository. Once a disaster is declared the compute, networking and applications files are retrieved from the Artifact Repository and the automation tool begins building these out against the predefined run book and installing the database and application on the newly provisioned infrastructure. Everything is automated and the length of time it takes to bring the entire application back in line (RTO) is based on the amount of steps and the speed at which the public cloud provider can deliver the needed infrastructure services at the appropriate time.
The RTO’s using a Cold DR solution are slower than a Warm DR solution but the advantages are that the monthly DR costs are extremely low as the customer is only paying for object storage (.02 per GB per month) vs block storage in a warm DR (.10 per GB per month) and there is no active database server to maintain.
This solution works well if customers are OK with using native database replication for their RPO solution. A third party replication solution may need additional compute resources in place which could increase the monthly costs slightly. Depending upon the application we generally see a RTO under two hours for a Cold DR solution.
The Rise of Warm and Cold Disaster Recovery: Public Cloud + Automation
Before Public Cloud providers and automation tools, provisioning infrastructure within minutes was very challenging which left Warm and Cold Disaster Recovery strategies as merely theoretical concepts. Companies such as SunGuard built whole business models to tackle the issues of provisioning just in time infrastructure for companies but could never come close to minute based RTOs. Through automation and Public Cloud providers, however, companies can now deploy their applications and infrastructure in minutes and deliver Restore Time Objectives in a cost effective manner to align with business expectations.
Using a Public Cloud Platform such as Amazon Web Services, Micorosft Azure, Google Cloud Platform or any other Public Cloud provider with the proper APIs combined with an automation/change management tool such as Ansible, Salt Stack, Chef, Puppet Labs or any other automation tool any company can deliver a GUARANTEED application Restore Time Objective (RTO) within minutes.
The actual length of time is determined by the application and how complicated it is to provision and the amount of data.
How Public Cloud Providers using Software Defined Networking have enabled Cold and Warm Disaster Recovery Solutions.
Software defined networking has been a goal for many large companies for some time. VMware spent a $1.26 billion for Nicira in 2012 and has spent a lot of time and money bringing this acquisition to the market under the NSX brand.
Why is Software Defined Networking (SDN) so big in Disaster Recovery?
Because even when you have a DR solution that involves bringing a VM back on line the networking process is still manual and time consuming to work through which affects the overall Restore Time Objective.
While companies are now implementing Software Defined Networking within the own data centers, Software Defined Networking has existed for years with Public Cloud providers such as Amazon Web Services, Microsoft Azure, Rackspace Public Cloud, Google Cloud Platform, and IBM Softlayer. These companies have been delivering all three infrastructure components as software; Networking, Compute and Storage.
When you combine the advantages that Public Cloud Providers deliver for Disaster Recovery the case is fairly compelling:
- Software Defined Networking (plus software defined Compute and Storage)
- Geographically disperse locations.
- Ability to locate dedicated transport circuits from a primary data center to a Public Cloud data center.
- The falling cost of transport circuits.
- Hourly based billing for Disaster Recovery Infrastructure.
So why hasn’t every company implemented a Public Cloud Disaster Recovery Solution?
One of the main reasons is a lack of knowledge around Automation.
The importance of Automation in a Warm and Cold Disaster Recovery Solution
One of the key advantages of an active/active or Hot Disaster Recovery architecture is that the standby application is already running and is updated with the latest versions and patches.
The infrastructure is the same and the application is updated when the main application is updated. In a Warm and Cold DR architecture the application needs to be deployed (for Cold DR) and scaled (Warm and Cold DR) along with the infrastructure. Again the advantages of a Warm or Cold DR solution is the cost to maintain this architecture is SIGNIFICANTLY less than an active active architecture.
But how to quickly deploy and update applications to achieve a reasonable RTO?
The answer is with automation.
Using tools such as Ansible, Puppet, Chef, Salt Stack or other automation tools companies can now treat their infrastructure AND applications as code and automate the deployment of both the infrastructure and application together.
Warm and Cold Disaster Recovery is a great first time use case for companies who want to start using Public Cloud resources.
PiNimbus is a DevOps consulting company that has worked to help companies consume Cloud through automation and containers. We have built a practice dedicated to helping customer achieve affordable Disaster Recovery using Public Cloud and Automation and have trained companies around how best to implement automation within their environment.
Are your ready for an Affordable Restore Time Objective?
Contact us Now for an Automated Cloud Disaster Recovery Solution!