How to Avoid A Single Point of Failure
Here we start to explore single points of failure, fault detection and recovery strategies. The key to understanding the risks and potential failures within the system is to identify all the Single Points of Failure (SPOFs) and to implement redundancy, fault detection and recovery strategies to cope with failure.
At a simple level, using only high-quality components and adding redundancy throughout the system will increase the MTBF of the overall system, albeit at a cost.
It is critical to consider SPOFs and redundancy strategies across every aspect of the overall operation including external dependencies, utilities, user access points and it’s operating environment rather than just the system itself.
In addition to identifying Single Points of Failure and implementing redundancy strategies to minimise and recover from failure, it is also important to design and implement procedures for failure detection and recovery to minimise unplanned system downtime.
Identifying Single Points of Failure
The following table outlines aspects of the overall system that should be considered together with associated risks and mitigation, fault detection and recovery strategies to minimise system downtime:
Aspect | Considerations | |
Physical system components | • Risks | CPU, memory, I/O cards, power supplies, cooling fans, storage drives |
• Mitigation | Use high quality enterprise-grade components with high MTBF ratings and redundant components throughout | |
• Detection | System monitoring and alerts | |
• Recovery considerations | Secure fast access to spare components, documentation and engineering capability | |
Storage | • Risks | Connectivity, storage configuration, data integrity, performance |
• Mitigation | Implement optimal redundant storage design and configuration, e.g. mirroring | |
• Detection | System monitoring and alerts | |
• Recovery considerations | In-house capability for monitoring, storage configuration and backup and restore management | |
Software | • Risks | Operating System, middleware and applications, licensing, patch management |
• Mitigation | Pre-production testing, controlled updates and upgrades, robust patch management | |
• Detection | System software alerts, application monitoring agents, bug reports | |
• Recovery considerations | Readily available and up-to-date backups and recovery procedures | |
User Access | • Risks | Network components, firewalls, routers, cabling, user authentication |
• Mitigation | Secure physical cabling, optimal network and firewall design, robust user authentication | |
• Detection | Access monitoring, network alerts | |
• Recovery considerations | On-site network engineering monitoring, schematics and testing capability | |
External dependencies | • Risks | 3rd party systems, online services, network configuration, other external system requirements |
• Mitigation | Ensure all external dependencies are known and identified with service agreements and support | |
• Detection | Access monitoring and alerts | |
• Recovery considerations | On-site testing, monitoring and engineering capability | |
Environmental | • Risks | Air quality and moisture, temperature, hygiene |
• Mitigation | Identify risks and implement air-conditioning and purifying, dehumidifying, dust collection, regular cleaning and preventative maintenance | |
• Detection | Environmental monitoring and alerts | |
• Recovery considerations | Consider relocation of system to cleaner and cooler environment | |
Physical Security | • Risks | Nearby risks, system console access, secure power and cabling |
• Mitigation | Secure systems away from physical hazards such as water sources and fire risks. Secure physical console access and cabling | |
• Detection | Site inspections and reviews | |
• Recovery considerations | Consider relocation of system to more secure location | |
Building Security | • Risks | Risk of damage from fire, earthquake, weather, terrorism |
• Mitigation | Locate systems away from flood and fire risks and areas prone to natural or other external potential disasters | |
• Detection | Environmental reviews, news reports, weather forecasts | |
• Recovery considerations | Consider relocation of system to safer location | |
Utilities | • Risks | Utilities • Risks External power, network, cooling |
• Mitigation | Implement UPS, backup generators, multiple physical networks and providers, robust SLAs with utility providers. Contract with Disaster Recovery (DR) providers | |
• Detection | System monitoring and alerts | |
• Recovery considerations | Use backup utility providers, implement proven and tested DR procedures | |
Human error | • Risks | Root user negligence |
• Mitigation | Manage and lockdown system access for only essential users and procedures. Robust and comprehensive training for administrators and security | |
• Detection | System monitoring and alerts, logging, use of tripwire technologies | |
• Recovery considerations | Review incident, system password management, use of robust recovery procedures | |
Sabotage | • Risks | Security, system access, port management, encryption, network manipulation, Denial of Service attacks |
• Mitigation | Review system access needs, lockdown all unrequired network access, use of encryption technologies, lockdown security with service provider SLAs. Regular security penetration tests | |
• Detection | System and firewall monitoring and alerts | |
• Recovery considerations | Identify vulnerabilities leading to incident, lock down as appropriate, review security service providers and procedures |
A High Availability Strategy
Identify and analyse all Single Points of Failure
A High Availability strategy should begin with the identification and analysis of all the Single Points of Failure that the overall system uses and depends on, together with strategies to mitigate the potential risks together with redundancy, fault detection and recovery strategies to minimise server downtime.
Identify components that be bolstered with System Redundancy
As we have discussed, increased redundancy usually results in an exponential increase in costs to achieve higher levels of availability. A risk-reward and cost benefit analysis exercise should be undertaken to determine an acceptable trade off balance between minimum system service availability expectation versus cost.
Minimise server downtime
Implementing system redundancy and procedures to tolerate and recover from failures is key to minimising both planned and unplanned server downtime, and deploying a framework that not only manages all the failure, detection and recovery scenarios outlined in this document, but that is optimally designed to automatically recover to minimise server downtime, is critical.
Implement a High Availability Cluster framework
This is what High-Availability.com do, and our flagship product, RSF-1, has been providing cost-effective enterprise-grade High Availability technology to thousands of mission critical system deployments across all industries and around the globe for over 25 years.