Failover

Manual failover

When a manual failover is performed, pool data is automatically synchronised between nodes before the service is moved. To do this the following steps are undertaken:

Disconnect the VIP(s) so clients are temporarily suspended from accessing the pool.
Take a snapshot of the pool and apply it to the remote node to synchronise data across the cluster.
Complete the service stop.
Start the service on the other node and enable the VIP so clients can once again have access to the pool.
Data now starts synchronising between the new active/passive nodes.

When an automatic failover occurs

Should an automatic failover from the active node to the passive node occur (i.e. when the active node crashes), the passive node will take over the service using its local copy of the data. The service then becomes locked to the now active node and data synchronisation is suspended.

To understand why a sevice becomes locked after an automatic failover consider the synchronisation state of a cluster:

This diagram illustrates that in normal operation the passive node will be slightly behind the active node in terms of pool data synchronisation. A snapshot applied to the passive node will bring it in sync with the active node at the time the snapshot was taken; however, the active node will continue writing data, which will not be synchronised to the passive node until the next snapshot is taken and applied.

As snapshots are taken and applied at regular cyclic intervals, the active node will always be slightly ahead of the passive node in terms of data, up to a maximum delta of the snapshot interval. For example, with an interval of 15 minutes and a snapshot pulled and applied on the passive node at 12:00, then by 12:14 the passive node will be 14 minutes behind the active node. If the active node crashes at this point then the passive node will take over the service using the data synchronised at 12:00. Therefore on the failed node there is up to a possible 14 minutes of unsynchronised data. For this reason data synchronisation is suspended and the service is locked to the new active node to prevent failover.

This is highlighted in the GUI cluster health pool synchronisation Suspended for <pool-name>:

sync-suspended

Resuming synchronisation after an automatic failover

Synchronisation is suspended so that once a failed node is brought back online, but before any data is overwritten by the resynchronisation process, there is an opportunity to retrieve data from the original pool in a stable environment.

In the above example this would encompass any data written between 12:00 - 12:14 on the active node before it crashed. Note, when a failed node is brought back online any clustered pools will be imported so data retrieval can be performed.

Finally, once any required data has been retrieved from the pool on the passive node instruct the cluster to restart synchronisation and clear any errors by clicking RESUME SYNC located on the dashboard in the services section or in Settings -> Shared Nothing.

This will bring up a dialog showing all suspended pools. Select CLEAR LOCK for each pool for which synchronisation should be resumed and click OK to confirm:

resume-sync

In this example only SNPOOL was resumed - the dashboard will be updated to reflect this:

sync-suspended

Once all pools have been resumed the GUI health will return back to normal:

syncing-again