Creating A Cluster Application or Service Management
In addition to automatic cluster service management, RSF-1 Cluster Services can easily be operator controlled with the use of a comprehensive CLI, GUI and API. This allows the Cluster Manager or operator to flexibly add, remove, edit, stop, start and move services around the cluster. This is particularly useful when planning maintenance schedules and upgrades and for load-balancing exercises.
RSF-1 Application Service Monitoring
In addition to cluster server heartbeats, RSF-1 provides a service monitoring framework that allows regular health-checks of application services to be performed. The health-checks are application service specific and can automatically trigger the stopping and restarting of application services running on the cluster.
RSF-1 Service User Access
For each application service, any number of virtual network addresses can be associated for user access. This is to ensure user access continuity in the event of a service failover between cluster server nodes.
RSF-1 Application Service Failover
In the event of a cluster server node failing, all heartbeats from the failed node will cease and other cluster server nodes will detect the failure immediately. In this case, all heartbeat channels and connections must fail. However, a small timeout must pass first for the detected failure to be considered fatal. If new heartbeats are received during the timeout, service is resumed without incident. This is to avoid false failover situations where a server node may have temporarily become overloaded and frozen but quickly recovered.
Where a server failure is deemed to be fatal and the timeout passed without new heartbeats being received, the surviving cluster server nodes will restart the services that have gone offline, if they are so enabled and configured in automatic mode. In this case, each service that is to be restarted will initiate independent failover events.
In the rare case where a failed server somehow recovers beyond the timeout, newly received heartbeats with conflicting state information or any attempts to access underlying storage that have been newly reserved elsewhere will cause the recovered system to panic and reboot. Once back up and operational, the rebooted server will receive new heartbeats from the running cluster server nodes and enter standby mode for the services it is configured to run.
RSF-1 Application Service Failback
A key element of RSF-1's design is that application services should not automatically failback to a recovered cluster server node but instead should be manually moved by the system administrator at a convenient time. This is to ensure higher levels of uptime and predictable service availability.
This is important when you consider the scenario of a server exhibiting intermittent failures perhaps due to hardware issues or operating system misconfiguration that requires troubleshooting and repair. If automatic failback was available and enabled, this scenario could result in highly available services constantly stopping and restarting within the cluster resulting in avoidable downtime.
RSF-1 High Availability Cluster Principles
This section describes how RSF-1 has implemented the key principles and concepts of High Availability Cluster design necessary for the provision of enterprise-grade High Availability services. High-Availability.com pioneered many of these principles and RSF-1 has continued to evolve significantly through technology waves with constant innovation and evolving customer needs.
RSF-1 was designed to be as Unix-like and as simple to operate as possible bringing familiar concepts, configurability and manageability to capable system administrators, whilst evolving with technology innovations and growing new use cases.
A significant use case which High-Availability.com has also pioneered and worked with extensively is the provision of Highly Available ZFS storage services using ZFS storage technology, which will be covered in the next section.