How Does Cluster Management Work?
Each service configured to be an RSF-1 Highly Available service has it's own configuration files that define how the service needs to be started and shutdown, and is available to each cluster server node enabled to run that service. Designed to be similar and compatible with the standard Unix rc.d structure, this consists of a numerical sequence of startup and associated shutdown scripts that includes the acquisition of underlying storage and filesystems that the service requires, application and service specific instructions, and virtual IP address configuration management for user network access.
In the event that the execution of a service startup sequence fails, RSF-1 will run the shutdown sequence to ensure all startup steps, including releasing associated network address(es) and underlying storage have been reversed, and the service will be marked as broken on that server. Administrator intervention will then be required to resolve the issue. Subsequent attempts to start the failed service elsewhere in the cluster will be attempted, if failover is enabled on other cluster server nodes.
RSF-1 Cluster Heartbeats
Heartbeats are low-level communications between cluster server nodes used to determine server health and responsiveness. For RSF-1, the heartbeat takes the form of a sophisticated encrypted and check summed packet of stateful information that describes the entire cluster status. As well as providing system health checks, exchanging status information also manages and ensures cluster integrity.
RSF-1 was designed to not rely on any single components for high availability provision, and this included the design and implementation of heartbeats. Most High Availability systems use only IP based network media for heartbeats and are therefore wholly dependent on network connectivity for cluster communications. From it's first release, RSF-1 was designed to also support RS-232C serial heartbeats which uses a low-level proprietary protocol and therefore not dependent on any virtual IP networking capability. A further non-IP based shared disk heartbeat is also available, thus providing three independent heartbeat channels.
Deploying non-IP based heartbeat channels also further protects against the possibility of downtime designed to protect against split-brain scenarios as the cluster will continue to operate with healthy heartbeats even when all network connectivity is lost.
For optimum high availability operation, High-Availability.com therefore recommend at least two independent heartbeat connections are implemented per channel and two different forms of heartbeat channel are implemented. For example, two separate network, and two separate shared-disk heartbeats.
Storage Subsystem and Data Protection
On service startup or service failover, prior to attempting to mount underlying storage subsystems and filesystems, RSF-1 uses a sophisticated failsafe disk-fencing mechanism to lock storage devices exclusively to the master node for that service. Using low-level SCSI reservations, this ensures that the storage devices cannot be accessed by any other cluster server node. In the event that any other server node attempts to access the now reserved drives, this will cause an immediate panic and the server node attempting the access will reboot. This ensures that the underlying storage and data is fully protected against corruption.
On normal service shutdown, any associated storage device reservations will be freed allowing the service to be restarted freely elsewhere in the cluster.