Faster Delivery = Happy Users
Automated Process = Fewer Errors
Standards = Cost Reduction
Order Visibility = Confidence
Linking Systems = Efficiency
There are many reasons why companies choose to leverage public cloud resources. One of the most common is that the cloud provider shares the responsibility for the infrastructure with the customer. This means that the consumer of cloud resources can spend more time and effort focusing on the application itself rather than on the underlying hardware. In the case of Infrastructure as a Service, the cloud provider typically owns resources from the virtual infrastructure down to the physical datacenters and everything in between, such as physical networking, physical storage and the hypervisor that your resources are running on. Learn here the key considerations on data protection in Azure.
With that in mind, it is important to understand three things when it comes to service availability in the public cloud:
If we focus specifically on data protection, we’ll find that there are many options available and they all solve for different failure modes.
We do not see the physical disks or storage appliances underneath the covers in Azure. Instead, we are provided with various types of storage – blob, page, table, hot, cold, archive, etc and we simply pick what we want. Virtual storage in Azure is provided as a commodity and that’s the beauty of it. We don’t have to care about the storage appliances, storage networks, zoning, and LUNs that make it all happen. It just shows up and we can work with it. That said, what we do need to plan for are the various possible failures to these underlying platforms.
When provisioning a storage account in Azure, we are given many choices regarding storage availability, but most workloads will leverage either Locally Redundant Storage (LRS) or Geo-Redundant Storage (GRS). With LRS, we have three independent copies of the data within the same Azure datacenter. This means that you have a high degree of durability of your data within that Azure region.
So, what happens if there is a region-wide failure? If you provisioned with LRS storage, you’re waiting for that region to come back online. However, if you had provisioned with GRS, you have three independent copies of the data within the same Azure datacenter and three independent copies in another Azure datacenter. For example, primary copies of GRS storage that live in East US will be paired with secondary copies of the data in West US.
Choosing whether to use LRS or GRS with your Azure storage is a critical decision point, but what’s just as critical is understand how and when failover of these resources occurs. In the case of GRS storage – Microsoft chooses when your secondary copies in another region come online. This is not an option that the consumer has available on their end. In the event of a catastrophic failure in one region, GRS copies will be brought online in the secondary region by Microsoft. It is also important to note that this simply exposes the storage in that region. It does not automatically re-create virtual networks, virtual machines, Azure SQL instances, etc. It simply makes the storage available for you to re-create those resources yourself in the secondary region.
This level of dependence on Microsoft combined with the high level of post-failover provisioning is unacceptable to many organizations, which is completely understandable and justified. GRS storage is meant to be a piece of the puzzle and not a complete availability solution. Utilizing something like Azure Site Recovery to be able to control the full failover of all your IaaS resources is a much more attractive option but comes with higher management overhead and additional costs.
Azure Site Recovery (ASR) is one of the most important services available for heavy consumers of IaaS services in Azure. ASR started off as a way for customers who are still on-prem to leverage Azure as a Disaster Recovery site and fail their on-prem workloads into the cloud during a disaster. This is still a critical feature and one that is typically far more cost-effective than renting space and equipment in a second on-prem data center. However, with Microsoft’s recent launch of Azure-to-Azure regional replication and failover, ASR has captured an even more important role in IaaS architectures.
With GRS, we are replicating storage plain and simple. We are not replicating VM configuration, automating DNS record updates, or any of the dozens of other things that must happen during a failover event. With Azure Site Recovery, we can do all of that. When you onboard servers into ASR, you create one or more recovery plans. A recovery plan specifies the order that servers should be started in (i.e. Active Directory first, then SQL, then the IIS servers) and can also take automated script-based actions, or pause to prompt the administrator to take manual action before continuing to the next step.
If you feel that GRS storage doesn’t leave enough control in your hands but running an entire standby replica of your app in a secondary region it too costly, then you should take a very long look at what ASR has to offer.
Most of my customers are worried about protecting against the type of platform-level failure that we just explored. However, most of the time data recovery is needed because someone made a mistake. This is especially true when we’re talking about a platform where end users can alter data, such as on a file share or something like a SharePoint document library. This is where it becomes very important to know what your data recovery capabilities are when adopting a cloud platform.
Virtual Machines running on top of Azure have no native backups that are occurring. If you’re using LRS or GRS you have some level of protection against an underlying hardware failure, but you do not have any protection against an end user deleting the wrong file, or a server administrator changing or deleting data. In this scenario, backups are just as important as ever. Microsoft has a cloud-native backup solution in Azure Backup that is relatively inexpensive and integrates very nicely with Azure Virtual Machines. This makes it quick and easy to back up your data to cloud storage for fast file-level recovery in the event of an end user or administrative mistake.
Traditional backups, although very important for VMs, may not be as useful for other cloud services. Azure SQL is a great example of this. If you’re leveraging Microsoft’s Database as a Service offering, you already have significant recoverability options built in.
“Full database backups happen weekly, differential database backups generally happen every few hours, and transaction log backups generally happen every 5 – 10 minutes. The first full backup is scheduled immediately after a database is created. [….] After the first full backup, all further backups are scheduled automatically and managed silently in the background.”
When leveraging Azure SQL as your database platform, Microsoft has done all the heavy lifting for you. Sometimes there is still a need for an external tool for longer-term backups or retention due to compliance reasons, but recovery from “oops” moments becomes very fast and easy.
There’s no one-size-fits-all here. It’s important to understand what each cloud service that you’re using offers in terms of data protection, data durability, and recoverability. It’s also important to understand how you can architect your applications around the most common failure modes so that you’re not taken by surprise during an outage and left waiting for the cloud provider to bring services back online. Cloud providers generally do a good job of keeping their services up and running, but it’s still our jobs to predict where it can go wrong and be ready to react.
Softchoice, North America’s #1 Azure Partner, can help accelerate your Azure cloud strategy. Check out our Azure services and deliver an effective Azure implementation, reduce risk and receive expert mentorship every step of the way.