Why is High Availability important?
IT systems and services are, by their very nature, intended to be continuously accessible and available, not just during the traditional business days and hours, but at all times. There is nothing more frustrating than systems being down for maintenance or just not available when service is required. This is especially true in today's globalised and increasingly online world where unfettered access to IT services is both expected and assumed.
As there is almost always a financial element associated with system downtime, whether through loss of direct revenue, productivity, reputation, or consequential loss, ensuring key systems operate with a high level of availability and accessibility is critical.
As most IT service contracts are governed by Service Level Agreements (SLAs) including provisions for service availability guarantees and penalties, it is important to design, build and run systems that operate within the prescribed service availability requirements.
Enhancing system High Availability is therefore not just aspirational but absolutely essential.
What is the importance of system redundancy?
Redundancy refers to the strategy of using spare components and capacity to keep systems running in the event of individual component failure or resource exhaustion. Examples of such individual component redundancy systems include dual power supplies, mirrored disk drives and Uninterruptable Power Supplies (UPS).
To minimise system downtime, it is critical not only to use the most reliable components, utilities, software and access points available, but also to plan for each and every dependency failing by implementing system redundancies.
Any system component, no matter how reliable will eventually fail.
System reliability explained
Component reliability is often described by Mean Time Between Failure (MTBF) or Mean Time To Failure (MTTF) metrics, given as an expected amount of time that will elapse before, or between, failure. For example, today, Hard Disk Drives (HDDs) typically have MTBF ratings in excess of 100,000 hours, or around 11 years, whereas Solid State Drives (SSDs) demonstrate in excess of 1,000,000 hours, or 111 years. However, an alternative reliability metric, Annual Failure Rate (AFR), refers to the anticipated proportion of devices expected to fail within one year, and corresponding figures for HDDs are typically around 1.38% and 1.05% for SSDs respectively. It is worth noting that failure rates for such electro-mechanical components will vary depending on the operating environment, frequency of power on/off cycles and usage.
To measure this potential failure risk by way of example, consider a mid-range computer facility with a storage farm consisting of 200 disk drives. The HDDs AFR metrics suggest that at least two drives will fail each year. There is therefore a chance, no matter how small, that the system's root disk and mirror could experience simultaneous failures within the first year of operation. A single root drive failure should offer seamless continuation of services to the mirrored drive, but if both system root and mirror drive failed simultaneously, this would result in considerable service downtime. Whilst this scenario would be seen as shear bad luck, it should be considered a foreseeable event that could be prevented, or at least, quickly recovered from with comprehensive recovery procedures.
What is a reliability and redundancy strategy?
Redundancy strategy should also include an element of remoteness to further reduce risk of failure. For example, where possible, redundant systems not sharing the same power source, network access or physical location can enhance overall system MTBF.
It should be obvious to state that adding redundancy also adds cost and although some redundant components can be also used to enhance capability and performance, the greater the availability required, the greater the costs, which can grow exponentially.