How to Speed Up System Recovery Times
In addition to identifying single points of failure and deploying associated recovery processes and procedures, it is also important to consider the associated recovery times to ensure the system is as Highly Available as possible. For example, if the recovery of an active network card on a server with dual network cards requires manual operator intervention to configure and bring the redundant card online, then system uptime will be impacted.
All inherent system redundancy and associated recovery procedures should be well defined, tested and ideally regularly practiced as part of a preventative maintenance fire-drill to ensure system downtime caused by failure is minimalised.
Preventative Maintenance
It is unrealistic to assume a critical system can continue to operate for a typical 3-5 year life-cycle without any maintenance requirements such as hardware replacements, Operating System upgrades, software patching or even a system reboot. Such needs can be considered planned downtime events that can also be mitigated using best High Availability practice.
System High Availability can be enhanced by scheduling maintenance windows at more convenient quiet times where any delays to service uptime will be minimal, or, for example, by failing mission critical services over to redundant servers to allow the primary server to be maintained with minimal system downtime.
Enhancing High Availability
Enhanced server High Availability is achieved with the addition of component redundancy and the addition of system capacity to tolerate faults and to be able to quickly recover from them.
Enhanced system High Availability on the other hand can be achieved by identifying and analysing all other potential risks to system uptime with the implementation of a High Availability Cluster product to manage all other aspects of failure and recovery.