Surviving the Storm: Disaster-recovery for SAP HANA Downtime

Surviving the Storm: Disaster-recovery for SAP HANA Downtime

Reading time: 5 mins

disaster recovery for sap hana downtime

Even a few hours of downtime can prove devastating for the mission-critical applications running on SAP HANA. If you want to keep your services online, it’s vital that you make a solid disaster recovery plan and be aware of where the risks are coming from, says Todd Doane of SIOS Technology.

No business ever likes the idea of their systems going offline. But when we’re talking about the mission-critical applications typically running on SAP HANA, even a short break in service can quickly become a disaster.

Between refunds to customers, costly IT support, and reputational damages, the bill associated with even a relatively small outage can quickly balloon. This is supported by research from the Information Technology Industry Council, which found that every hour of downtime can easily cost $300,000. This cost is even higher for some larger businesses, with major organizations putting their bill closer to $4 million per hour.

With downtime bringing such ruinous costs, it’s easy to see the importance of trying to ensure that your SAP HANA systems reach the heights of high availability (HA) – generally agreed on as 99.999% uptime.

However, actually achieving this level of reliability for SAP HANA is complicated. The truth is that most HA solutions for SAP HANA are complex, unreliable, unstable, arduous to maintain, demanding to support, and difficult to test.

Even with backups in place, the failover process – where the software automatically detects the system is offline and switches to the backup – can be unreliable. There are specific failover best practices for each of the multiple layers in the SAP HANA environment, and there are many areas where it can go wrong. On top of this, the process often requires manual intervention – especially when operating in the cloud.

Planning for the Worst

There are three layers to SAP: The presentation layer, the application layer, and the database layer. Each of these has its own quirks and requirements when it comes to avoiding downtime.

The presentation layer is easy to protect as the web servers are mostly static, so load balancing two active copies is generally sufficient to ensure reliable availability. The application layer, however, is more difficult, and while SAP has resiliency features baked in, it still needs a third-party clustering solution to implement those features. Similarly, while the database layer has methods to failover to a backup server, these also need to be managed by third-party clustering solutions.

It’s clear, then, that choosing a solution that fits your business needs is a vital piece of the puzzle. However, there are several other elements to consider when minimizing downtime.

For example, it’s vital that we introduce as much resiliency as possible. We want to ensure there are no single points of failure anywhere in our system, whether that means having our primary and backup servers located in the same data center; or having our disaster recovery plan hinge on having a single member of staff who knows all the details and passwords.

Redundancy, therefore, is the watchword for maintaining high availability. This can seem like an incredible waste of money when everything is running smoothly, but if maintaining a solid backup saves you even a few hours of downtime a year, it can easily pay for itself ten times over.

On top of this, any HA or disaster recovery (DR) system needs to be well maintained. There need to be regular backups, and everything needs to be tested on a regular basis to ensure that all our systems and processes work the way they’re meant to.

After all, we don’t want to find out that the fire extinguisher doesn’t work when we’re in the middle of trying to extinguish a fire.

The Scenarios of System Failure

Another element in maintaining high availability is careful preparation. As part of this, we need to know where the risks come from and what questions we need to consider when trying to alleviate them.

In general, there are four major categories of catastrophic system failure:

  • Hardware issues
  • Software issues
  • Environmental issues
  • Human intervention

Hardware failures are a fact of life. They’re frustrating, but they happen. The key areas we are concerned about involve errors in computing, storage, memory, and the network, and we need to prepare for failure in any one of them.

We must ask ourselves, what is our plan for recovery, and who is responsible for that recovery? Our cloud provider may be responsible for most of the hardware failures, but what if the failure occurs in our corporate network?

Next up are software issues. These are also relatively common, as the entire end-to-end SAP application is incredibly complex. The number of scripts, programs, services, frameworks, utilities, and operating systems involved creates a broad landscape of where a software failure can occur. We must have robust testing, monitoring, and backup recovery plans to ensure maximum uptime.

Environmental issues are the most attention-grabbing – and potentially devastating – source of system failure. Some locations are more prone to environmental issues than others. In Florida, where I live, hurricanes are always a threat in the summer and fall months, but the truth is that nowhere is immune from some kind of natural disaster.

Earthquakes, tornadoes, hurricanes, tidal waves, volcanoes, blizzards, extreme heat and cold have all taken data centers out of service. The question, then, is mostly one of disaster recovery. When your data center is down, where is the service restored?

Finally, we have an unexpectedly common source of failure – human beings. Human errors can come from several sources, including hackers, mistakes, negligence, neglect, incompetence, and ignorance. Together, these factors account for up to 75% of data loss.

While we can control for many sources of human error, such as by adding strict security and access controls on our mission-critical systems, there’s no way to eliminate them completely. The human error rate table states that an error will occur 10% of the time in complicated, non-routine tasks, and, unfortunately, managing SAP HANA is a process that includes a large number of complex, non-routine tasks. These can easily translate to errors, some of which will be serious enough to take the entire system offline.

The best – and only – way to truly eliminate human error from your system is to remove the humans themselves. Some protection frameworks for SAP allow you to create automations for complex processes, with protection suite packages that manage the automation for you.

Considered Countermeasures

If you’re looking to reduce downtime as much as possible, you need solid backup and recovery procedures that should be automated and routinely tested. You need to have clearly defined processes and ensure that you educate your support staff on them.

It’s vital that you’re using the right HA software solution – one that fits the needs of your business and can guarantee the maximum possible uptime. One of the biggest ways to make an impact on downtime is by using top-rate HA clustering technology and support software that automates as many parts of the process as possible.

More Resources

See All Related Content