Terminology
Services
In an RSF-1 cluster a service refers to a ZFS pool that is managed by the cluster. The cluster may consist of one or more services under it's control, i.e. multiple pools. Furthermore an individual service may consist of more than one pool - refered to as a pool group, where actions perfromed on that service will be performed on all pools in the group.
A service instance is the combination of a service and a cluster node that that service is eligible to run on. For example, in a 2-node cluster each service will be configured to have two available instances - one on each node in the cluster. Only one instance of a service will be active at any one time.
Modes (automatic/manual)
Each service instance has a mode setting of either automatic
or
manual
. The mode of a service is specific to each node in the
cluster, so a service can be manual
on one node and automatic
on
another. The meaning of the modes are:
AUTOMATIC
Automatic mode means the service instance will be automatically started when all of the following requirements are satisfied:
- The service instance is in the stopped state
- The service instance is not blocked
- No other instance of this service is in an active state
MANUAL
Manual mode means the service instance will never be started automatically on that node.
State (running/stopped etc)
A service instance in the cluster will always be in a specific
state. These states are divided into two main groups, active states
and inactive states1. Individual states within these groups are
transitional, so for example, a starting
state will transition to a
running
state once the startup steps for that service have completed
successfully, and similarly a stopping
state will transition to a
stopped
state once all the shutdown steps have completed
successfully (note that this state change stopping
==>stopped
also
moves the service instance from the active state group to the inactive
state group).
Active States
When the service instance is in an active state, it will be
utilising the resources of that service (e.g. an imported ZFS pool, a plumbed
in VIP etc.). In this state the service is considered up and running and
will not be started on any other node in the cluster until it
transitions to a inactive state; for example if
a service is STOPPING
on a node it is still in an active state, and
cannot yet be started on any other node in the cluster until it
transitions to a inactive state - see below for the definition of inactive states.
The following table describes all the active states.
Active State |
Description |
---|---|
STARTING |
The service is in the process of starting on this node. Service start scripts are currently running - when they complete successfully the service instance will transition to the RUNNING state. |
RUNNING |
The service is running on this node and only this node. All service resources have been brought online. For ZFS clusters this means the main ZFS pool and any additional pools have been imported, any VIPs have been plumbed in and any configured logical units have been brought online. |
STOPPING |
The service is in the process of stopping on this node. Service stop scripts are currently running - when they complete successfully the service instance will transition to the STOPPED state. |
PANICKING |
While the service was in an active state on this node, it was seen in an active state on another node. Panic scripts are running and when they are finished, the service instance will transition to PANICKED . |
PANICKED |
While the service was in an active state on this node, it was seen in an active state on another node. Panic scripts have been run. |
ABORTING |
Service start scripts failed to complete successfully. Abort scripts are running (these are the same as service stop scripts). When abort scripts complete successfully the service instance will transition to the BROKEN_SAFE state (an inactive state). If any of the abort scripts fail to run successfully then the service transitions to a BROKEN_UNSAFE state and manual intervention is required. |
BROKEN_UNSAFE |
The service has transitioned to a broken state because service stop or abort scripts failed to run successfully. Some or all service resources are likely to be online so it is not safe for the cluster to start another instance of this service on another node. This state can be caused by one of two circumstances:
|
Inactive States
When a service instance is in an inactive state, no service resources are online. That means it is safe for another instance of the service to be started elsewhere in the cluster.
The following table describes all the inactive states.
Inactive State |
Description |
---|---|
STOPPED |
The service is stopped on this node. No service resources are online. |
BROKEN_SAFE 2 |
This state can be the result of either of the following circumstances:
|
Blocked (blocked/unblocked)
The service blocked state is similar to the service mode (AUTOMATIC
/MANUAL
) except
that instead of being set by the user, it is controlled automatically by the cluster's
monitoring features.
For example, if network monitoring is enabled then the cluster constantly checks the state of the network connectivity of any interfaces VIP's are plumbed in on. If one of those interfaces becomes unavailable (link down, cable unplugged, switch dies etc.) then the cluster will automatically transition that service instance to blocked.
If a service instance becomes blocked when it is already running,
the cluster will stop that instance to allow it to be
started on another node so long as there is another
service instance in the cluster that is UNBLOCKED
, AUTOMATIC
and STOPPED
, otherwise no action will be taken.
Also note, a service does not have to be running on a node for that service instance to become blocked - if a monitored resource such as a network interface becomes unavailable then the cluster will set the nodes service instance to a blocked state, thus blocking that node from starting the service. Should the resource become available again then the cluster will clear the blocked state.
The following table describes all the blocked states.
Blocked State |
Description |
---|---|
BLOCKED |
The cluster's monitoring has detected a problem that affects this service instance. This service instance will not start until the problem is resolved, even if the service is in automatic mode. |
UNBLOCKED |
The service instance is free to start as long as it is in automatic mode. |
-
For example
running
andstopping
are members of the active group, whereasstopped
is a member of the inactive group. ↩ -
A
broken_safe
state is considered an inactive state as, although the service was unable to start up successfully, it was able to free up all the resources during the shutdown/abort step (hence thesafe
state). ↩