When your system goes down, it’s a disaster – plain and simple.
Whether it’s a ransomware attack that’s shut users out of their email or a software glitch that’s knocked over an e-commerce portal, end users don’t care about the cause of an outage.
They expect the systems they rely on to be “always-on.”
As such, preparing for a swift return to business-as-usual after a catastrophic event is a core component of any IT resilience strategy. Still, 4 out of 5 organizations have an availability gap, meaning their IT systems cannot meet recovery time objectives (RTOs) expected by the business.
In fact, IDC’s 2019 State of IT Resilience Survey found 91% of respondents said they had experienced a tech-related business disruption in the past two years. Of these, 74% suffered some measurable impact.
Deliberate or otherwise, IT systems go down for many reasons. Although many organizations take every precaution imaginable, it’s near impossible to reduce the threat of downtime to zero.
Meanwhile, every IT team struggles to balance the expectations of an always-available business with revenue-generating projects. A lack of time or resources to direct to IT resilience often means these vital operations don’t get the attention they deserve – and leave the business at risk of a devastating outage.
Below, we’ll explore the four biggest challenges organizations face in keeping systems running and why you should consider offloading IT resilience operations to refocus on digital transformation.
Recovering from Malicious Attacks
In the modern cybersecurity landscape, an organization’s defenses undergo constant pressure. Malicious attacks come with such a frequency that many IT leaders have conceded that intrusions are inevitable.
In fact, McAfee estimates the rate of ransomware has doubled in 2019. Meanwhile, exploits have evolved. As a rule, today’s threats attack at several vectors at once and iterate new versions to circumvent the most recent signatures and patches. It’s now common for ransomware to target, encrypt and delete backups along with source data.
Despite well-intentioned efforts to implement quality security measures, business disruption due to malicious intruders is a prevailing concern. Yet, the ability of many organizations to avoid and mitigate security-related disruption falls short.
When a breach occurs, the ability for IT to make a fast and effective response is critical. Still, more than half of enterprises admit they can’t keep up with the volume of incidents. Furthermore, 83% feel they lack the processes to deal with successful intrusions. This deficiency comes at a high cost.
As Fortinet reported, the City of Baltimore spent over $18 million in recovery efforts (after refusing to pay a $100,000 ransom on FBI advice) when the Robbin Hood exploit sent critical municipal systems back to the proverbial Stone Age in May.
With 3.5 million cybersecurity roles expected to go unfilled through 2021, the skills shortage means improving in-house capabilities in attack prevention and recovery will be expensive and slow-going.
Mitigating Human Error
As the IT service desk cliché goes, things often go wrong due to the “twelve-inch problem.” That is, the problem is sitting twelve inches from the screen.
In fact, an ITIC survey in 2018 ranked human error the #1 cause of unplanned downtime. The causes range from misconfiguration of server hardware, operating systems or applications to failure to keep up with the latest security patches and updates.
In some cases, organizations lapse in sending IT personnel for training and certification and suffer the consequences of a corresponding slip in standards or best practices. In others, employees choose not to adhere to set protocols or standards. For example, a developer who commits broken code or an administrator who updates an untested package may cause significant problems through neglect.
“Stupid mistakes” like forgetting to check on data center temperature or disconnecting a critical device at the wrong time also exist in this category.
While human error is impossible to eliminate from the equation, these types of errors most often strike when IT teams lack the time or resources to do better. Human decision-making at the management level also plays a role in causing unplanned downtime. When budgetary constraints or shifting priorities prevent critical upgrades or comprehensive backups and testing, the results can be expensive.
ITIC also found 98% of organizations pegged the cost of an hour of downtime at $150,000 or more. Among these, 31% claimed an hour of system unavailability cost them over $400,000.
Overcoming Technical Failures
Technology advances in the last decade have increased the reliability of software, server hardware and its underlying components by leaps and bounds. Nevertheless, networking failures, faulty technology, incompatibility, bugs and even basic system upgrades can all wreak havoc on business operations.
In the short-term, server, OS or application outages stop productivity in its tracks. Longer outages have a domino effect, preventing customers, suppliers and business partners from accessing data or applications vital to business processes and transactions.
As such, 80% of organizations require 99.99% uptime for mission-critical systems and hardware. This includes major enterprise companies and institutions to startups and small businesses.
Factors like uninterruptible power supply, performance bottlenecks and hard drive failures in ageing hardware often cause operational downtime. Unplanned outages also occur when IT updates drivers, firmware or applications on legacy infrastructure.
System migrations are also a prevalent cause of downtime, often leaving systems offline for 25 to 100 hours. Operator error often leads to system or network crashes.
While backups have been a staple of any IT strategy for years, the growth in heterogeneous infrastructure environments has made IT management much more complex. Mixing legacy solutions with emerging private and public cloud options only add to the challenge.
For instance, a major outage that kept Facebook, Instagram and WhatsApp dark for 14 hours in the summer of 2019 was the result of a standard configuration change. This incident followed a spate of unplanned downtime for these sites caused by a “routine test” the previous November.
Many (much leaner) IT teams spend more time on manual maintenance of existing infrastructure. Responding to outages, identifying quick resolutions and remediating issues to minimize business impact is a challenge for most of these.
This diverts IT away from efforts to deliver the revenue-generating projects that line of business owners demand.
Defending Against Disaster
Disasters like hurricanes, volcanic eruptions or earthquakes seem like obvious – and scary – threats to IT availability. Less dramatic occurrences like lightning strikes and excessive heat have proven to be more frequent causes of serious downtime.
Events like major floods or server room fires can cause harmful gaps in service availability. In these cases, vital hardware is often damaged beyond repair. Data – without a geographic redundancy plan – is lost forever. Nonetheless, ITIC found just 15% of IT leaders had confidence in their disaster recovery plans.
At the same time, storing all backups in one location creates a single point of failure, exposing the business in the event of a physical disaster.
Meanwhile, the US Federal Emergency Management Agency (FEMA) estimates that 40% of businesses do not reopen after a natural disaster. Returning to normal after such an event is a challenge, as the specific nature and extent of an emergency are often difficult to predict.
A continuity plan details the processes an organization should follow in case of a major disruption, whether caused by a fire, flood, or other disasters. But like any IT resilience activity, getting it right takes time, energy and careful planning.
Get Back to Transforming Your Business
Whether it’s malicious intent, simple error or the unpredictable whims of nature – being ready when things go wrong isn’t optional. But for many IT teams, managing the day-to-day operations required for thorough, effective backup and recovery is too complex and time-consuming. Using multiple backup and recovery tools only adds to the difficulty.
Working with a third-party service provider like Softchoice helps you offload IT resilience with expert guidance and 24×7 support. That way you can refocus on realizing the ideas that will drive your business forward.
Are you ready to take the next step toward IT resilience?
Protect your critical data and applications with our turnkey Backup as a Service solution. Reinforced by our deep understanding of data center and network technologies and enterprise-grade managed services, this offering helps you resolve issues faster and free IT resources to refocus on business transformation.