Configuration Guide
Introduction
This guide is for when a cluster and services have been created and configured; for details on creating clusters please see the following guides:
Terminology
Services
In an RSF-1 cluster a service refers to a ZFS pool that is managed by the cluster. The cluster may consist of one or more services under it's control, i.e. multiple pools. Furthermore an individual service may consist of more than one pool - refered to as a pool group, where actions perfromed on that service will be performed on all pools in the group.
A service instance is the combination of a service and a cluster node that that service is eligible to run on. For example, in a 2-node cluster each service will be configured to have two available instances - one on each node in the cluster. Only one instance of a service will be active at any one time.
Modes (automatic/manual)
Each service instance has a mode setting of either automatic
or
manual
. The mode of a service is specific to each node in the
cluster, so a service can be manual
on one node and automatic
on
another. The meaning of the modes are:
AUTOMATIC
Automatic mode means the service instance will be automatically started when all of the following requirements are satisfied:
- The service instance is in the stopped state
- The service instance is not blocked
- No other instance of this service is in an active state
MANUAL
Manual mode means the service instance will never be started automatically on that node.
State (running/stopped etc)
A service instance in the cluster will always be in a specific
state. These states are divided into two main groups, active states
and inactive states1. Individual states within these groups are
transitional, so for example, a starting
state will transition to a
running
state once the startup steps for that service have completed
successfully, and similarly a stopping
state will transition to a
stopped
state once all the shutdown steps have completed
successfully (note that this state change stopping
==>stopped
also
moves the service instance from the active state group to the inactive
state group).
Active States
When the service instance is in an active state, it will be
utilising the resources of that service (e.g. an imported ZFS pool, a plumbed
in VIP etc.). In this state the service is considered up and running and
will not be started on any other node in the cluster until it
transitions to a inactive state; for example if
a service is STOPPING
on a node it is still in an active state, and
cannot yet be started on any other node in the cluster until it
transitions to a inactive state - see below for the definition of inactive states.
The following table describes all the active states.
Active State |
Description |
---|---|
STARTING |
The service is in the process of starting on this node. Service start scripts are currently running - when they complete successfully the service instance will transition to the RUNNING state. |
RUNNING |
The service is running on this node and only this node. All service resources have been brought online. For ZFS clusters this means the main ZFS pool and any additional pools have been imported, any VIPs have been plumbed in and any configured logical units have been brought online. |
STOPPING |
The service is in the process of stopping on this node. Service stop scripts are currently running - when they complete successfully the service instance will transition to the STOPPED state. |
PANICKING |
While the service was in an active state on this node, it was seen in an active state on another node. Panic scripts are running and when they are finished, the service instance will transition to PANICKED . |
PANICKED |
While the service was in an active state on this node, it was seen in an active state on another node. Panic scripts have been run. |
ABORTING |
Service start scripts failed to complete successfully. Abort scripts are running (these are the same as service stop scripts). When abort scripts complete successfully the service instance will transition to the BROKEN_SAFE state (an inactive state). If any of the abort scripts fail to run successfully then the service transitions to a BROKEN_UNSAFE state and manual intervention is required. |
BROKEN_UNSAFE |
The service has transitioned to a broken state because service stop or abort scripts failed to run successfully. Some or all service resources are likely to be online so it is not safe for the cluster to start another instance of this service on another node. This state can be caused by one of two circumstances:
|
Inactive States
When a service instance is in an inactive state, no service resources are online. That means it is safe for another instance of the service to be started elsewhere in the cluster.
The following table describes all the inactive states.
Inactive State |
Description |
---|---|
STOPPED |
The service is stopped on this node. No service resources are online. |
BROKEN_SAFE |
This state can be the result of either of the following circumstances:
|
Blocked (blocked/unblocked)
The service blocked state is similar to the service mode (AUTOMATIC
/MANUAL
) except
that instead of being set by the user, it is controlled automatically by the cluster's
monitoring features.
For example, if network monitoring is enabled then the cluster constantly checks the state of the network connectivity of any interfaces VIP's are plumbed in on. If one of those interfaces becomes unavailable (link down, cable unplugged, switch dies etc.) then the cluster will automatically transition that service instance to blocked.
If a service instance becomes blocked when it is already running,
the cluster will stop that instance to allow it to be
started on another node so long as there is another
service instance in the cluster that is UNBLOCKED
, AUTOMATIC
and STOPPED
, otherwise no action will be taken.
Also note, a service does not have to be running on a node for that service instance to become blocked - if a monitored resource such as a network interface becomes unavailable then the cluster will set the nodes service instance to a blocked state, thus blocking that node from starting the service. Should the resource become available again then the cluster will clear the blocked state.
The following table describes all the blocked states.
Blocked State |
Description |
---|---|
BLOCKED |
The cluster's monitoring has detected a problem that affects this service instance. This service instance will not start until the problem is resolved, even if the service is in automatic mode. |
UNBLOCKED |
The service instance is free to start as long as it is in automatic mode. |
Dashboard
The Dashboard is the initial landing page when connecting to the webapp once a cluster has been created. It provides a quick overview of the current status of the cluster and allows you to perform operations such as stopping, starting and moving services between nodes:
The dashboard is made up of three main sections along with a navigation panel on the left hand side:
- The status panel located at the top of the page providing a instant view of the overall health of the cluster with node, service and heartbeat summary status.
- The nodes panel detailing each nodes availability in the cluster along with its IP address and heartbeat status.
- The services panel detailing the services configured in the cluster, which node thay are running on, if any, and any associated VIPs.
Clicking on the icon for an individual node or service brings up a context sensitive menu, described in the following sections.
Nodes panel
The nodes panel shows the status of each node in the cluster:
Clicking on a node opens a side menu that allows control of services known to that node. In the example above,
clicking on the icon for node-a
would bring up the following menu:
Available actions can then be viewed by clicking on the ⋮
button in the right hand column for an individual service:
Alternatively, the ⋮
button on the Clustered Services
row brings up a menu that performs actions on all services on that node:
Services Panel
The services panel shows the status of each service in the cluster:
Clicking on a service opens up a side menu that allows control of that service in the cluster. In the example above
clicking on the icon for pool1
would bring up the following menu:
Available actions can then be viewed by clicking on the ⋮
button in the right hand column for an individual service:
New Services
When a service is added to an RSF-1 High Availability cluster, its state will initially be set to stopped
/ automatic
and the cluster will start the service on the services' preferred node.
Clustering a Docker Container
These steps show the process of creating a Clustered docker container. The container will be created using
a standard docker compose.yaml
file.
-
Navigate to HA-Cluster -> Docker in the webapp:
-
Click
Cluster a Docker application
to get to the creation/addition page and fill in the fieldsAvailable options:
- Select HA Service - Select the service/pool to associate the container to in the event of a failover
- Container Description - Optional description of the container
- Location of
compose.yaml
file within selected service - The path in the selected pool/service to save the compose.yaml - Contents of
compose.yaml
file - Enter the contents of thecompose.yaml
file for the container - An Example
compose.yaml
:
services: apache: image: httpd:latest container_name: my-apache-app ports: - 8080:80 volumes: - ./website:/usr/local/apache2/htdocs restart: no
Warning
When adding your content, make sure to add
restart: no
to your service configurations. RSF-1 will manage the restart of clustered containers in the event of a failover -
When finished click
Create
. -
By default the container will remain stopped until started. Click the
Start
button to spin up the container.
Heartbeats
In the cluster, heartbeats perform the following roles:
- To continually monitor the other nodes in the cluster, ensuring they are active and available.
- Communicate cluster and service status to the other nodes in the cluster. Status information includes
mode and state for every service on that node (
manual
/automatic
running
/stopped
etc), along with any services that are currently blocked. - A checksum of the active cluster configuration on that node.
Configuration checksums
The configuration checksums must match on all cluster nodes to ensure the validity of the cluster; should a mismatch be detected then the cluster will lock the current state of the all services (active or not) until the mismatch is resolved. This safety feature protects against unexpected behaviour as a result of unsynchronised configuration.
The cluster supports two types of heartbeats:
- Network heartbeats
- Disk heartbeats
Heartbeats are unidirectional therefore for each heartbeat configured there will be two channels (one to send and one to receive).
The same information and structures is transmitted over each type of heartbeat. The cluster supports multiple heartbeats of each type. When the cluster is first created a network hearbeat is automatically configured between cluster nodes using the node hostnames as the endpoints. Disk heartbeats are automatically configured when a service is created and under normal circumstances require no user intervention.
It is recommended practice to configure network heartbeats across any additional network interfaces.
For example, if the hostnames are on a 10.x.x.x
network, and an additional private network exists
with 192.x.x.x
addresses, then an additional heartbeat can be configured on that private network.
Using the following example hosts file an additional network heartbeat can be configured using the
node-a-priv
and node-b-priv
addresses as endpoints:
10.0.0.1 node-a
10.0.0.2 node-b
192.168.72.1 node-a-priv
192.168.72.2 node-b-priv
By specifying the endpoint using the address of an additional interface the cluster will automatically route heartbeat packets down the correct network for that interface.
To view the cluster heartbeats navigate to HA-Cluster -> Heartbeats
on the left side-menu:
Adding a Network Heartbeat
To add an additional network heartbeat to the cluster, select Add
Network Heartbeat Pair
.
In this example an additional physical network connection exists between the two nodes.
The end points for this additional network are given the names
SAM node-a-priv
and node-b-priv
respectively. These
hostnames are then used when configuring the additional heartbeat:
Click Submit
to add the heartbeat.
The new heartbeat will now be displayed on the Heartbeats
status page:
Removing a Network Heartbeat
To remove a network heartbeat select the heartbeat using the slider on the
left hand side of the table and click the remove selected
button:
Finally, confirm the action:
Disk heartbeats
Under normal circumstances it should not be necessary to add or remove disk heartbeats as this is handled automatically by the cluster.
NFS shares
Enabling clustered NFS
Note
Before enabling NFS please ensure all relevant packages (i.e. nfs-kernel-server) are installed and enabled on all nodes in the cluster.
By default RSF-1 does not manage NFS shares - the contents of the /etc/exports
file are
left to be managed by the system administrator manually on each node in the cluster.
To enable the management of the exports file from the webapp and
synchronise it across all cluster nodes, navigate to
Shares -> NFS
and click ENABLE NFS SHARE HANDLING
:
Once enabled the shares table will be shown:
Before creating new shares the option to import the existing
/etc/exports
file is available (this option is disabled once any new
shares are added via the webapp):
Clustering an NFS share
-
Navigate to
Shares -> NFS
and click+Add
on the NFS table to fill in the required info. The available options are:Description
- Description of the Share (optional)Path
- Path of the directory/dataset to share - for example/pool1/nfs
Export Options
- For a detailed description of the available options click theSHOW NFS OPTIONS EXAMPLES
button.
-
Click
✓
to add the share:The share will now be available and clustered.
FSID setting for failover
NFS identifies each file system it exports using a file system UUID or the device number of the device holding the file system. NFS clients use this identifier to ensure consistency in mounted file systems; if this identifier changes then the client considers the mount stale and typically reports "Stale NFS file handle" meaning manual intervention is required.
In an HA environment there is no guarantee that these identifiers will be the same on failover to another node (it may for example have a different device numbering). To alleviate this problem each exported file system should be assigned a unique identifier (starting at 1 - see the note below on the root setting) using the NFS fsid= option, for example:
/tank 10.10.23.4(fsid=1)
/sales 10.01.23.5(fsid=2,sync,wdelay,no_subtree_check,ro,root_squash)
/accounts accounts.dept.foo.com(fsid=3,rw,no_root_squash)
Here each exported file system has been assigned a unique fsid thereby ensuring that no matter which cluster node exports the filesystem it will always have a consistent identifier exposed to clients.
For NFSv4 the option fsid=0 or fsid=root is reserved for the "root" export. When present all other exported directories must be below it, for example:
/srv/nfs 192.168.7.0/24(rw,fsid=root)
/srv/nfs/data 192.168.7.0/24(fsid=1,sync,wdelay,no_subtree_check,ro,root_squash)
As /srv/nfs
is marked as the root export then the export /srv/nfs/data
is mounted by clients as nfsserver:/data
. For further details see the NFS manual page.
Modifying an NFS Share
To modify an NFS chare, click the pencil icon to the left of the dataset:
When done, click ✓
to update the share.
Deleting an NFS Share
To delete an NFS share click the trash can icon and then confirm the deletion
SMB shares
Enabling Samba/SMB in the cluster
Note
Before enabling SMB please ensure all relevant Samba packages (i.e. Samba, NMB, Winbind) are installed and enabled on all nodes in the cluster.
By default RSF-1 does not manage SMB shares - the smb.conf
file is
left to be managed by the system administrator manually on each node in the cluster.
To enable the management of the SMB from the webapp and
synchronise it across all cluster nodes, navigate to
Shares -> SMB
and click ENABLE SMB SHARE HANDLING
:
You will now be presented with the main SMB shares screen consisting of a number of tabs to handle different aspects of SMB configuration.
Initial SMB configuration
SMB/Samba provides numerous ways to configure authentication and sharing depending upon the environment and the complexity required. This guide documents two commonly used configurations:
- User Authentication - standalone clustered SMB sharing with local user authentication.
- ADS Authentication - member of an Active Directory domain with authentication managed by a domain controller.
Local User Authentication
With local user authentication, cluster users must be created with the SMB support enabled. A user created this way will have the same login name, UID and GID on all nodes in the cluster along with an equivalent Samba user entry to provide the required SMB authentication. See Unix users in this guide for further details.
Configuring Samba Globals for User Authentication
Navigate to Shares -> SMB
, select the GLOBALS
tab, then select User
from the drop down security
list and optionally set the desired workgroup name. Click save changes
:
ADS Authentication
RSF-1 can also be configured to use Active Directory for user authentication (ADS) when being deployed for use in a Microsoft environment.
In a Microsoft environment users are identified using security identifiers (SIDs). A SID is not just a number, it has a structure and
is composed of several values, whereas Unix user and group identifiers consist of just a single number. Therefore a mechanism needs to be
chosen to map SIDs to Unix identifiers. Winbind (part of the Samba suite) is capable of performing that mapping using a number of such
mechanisms known as Identity Mapping Backends; two of the most commonly used being tdb
(Trivial Data Base) and rid
(Relative IDentifier).
tdb
- The default idmap backend is not advised for an RSF-1 cluster as tdb
generates and stores UIDs/GIDs locally on each cluster node,
and works on a "first come first served" basis. When allocating UIDs/GIDs it simply uses the next available number with no consideration
given to a clustered environment, which can lead to UID/GID mismatches between cluster nodes.
rid
- This mechanism is recommended as the idmap backend for a clustered environment. rid
implements a read-only API to retrieve account and group information
from an Active Directory (AD) Domain Controller (DC) or NT4 primary domain controller (PDC). Therefore using this approach ensures UID/GID continuity on all cluster nodes.
When using the rid
backend, a windows SID (for example S-1-5-21-1636233473-1910532501-651140701-1105
) is mapped to a UNIX UID/GID by
taking the relative identifier part of the SID (the last set of digits - 1105
in the above example) and combining it with a preallocated range of numbers to provide
a unique identifier that can be used for the UNIX UID/GID. This preallocated range is configured using the Samba IDMAP entry.
Configuring Samba Globals for ADS Authentication
-
Navigate to
Shares -> SMB
, select theGLOBALS
tab, then selectADS
from the drop down security list, set the workgroup and realm name. Clicksave changes
: -
Navigate and open the
IDMAP
setting section. By default a wildcard entry is preconfigured (this is used by Samba as a catchall and is a required entry). Click+Add
and configure an entry for therid
mapping. Enter the same workgroup name used in the security settings and enter the desired range of numbers to use for mapping the IDs. Select a range that starts after the wildcard range and provide enough scope to cover the expected maximum number of windows users in the domain. Click the✓
to update the mapping table: -
Click
SAVE CHANGES
. The resulting configuration file (viewable from theCONFIG
tab) should look similar to the following:[global] encrypt passwords = Yes idmap config * : backend = tdb idmap config * : range = 3000-100000 idmap config HACLAB : backend = rid idmap config HACLAB : range = 100001-300000 realm = HACLAB.COM security = ADS server role = Member Server workgroup = HACLAB
Samba is now configured to be able use ADS authentication.
Testing ADS authentication (optional)
ADS authentication can be tested by allowing users from the windows domain to login to the Unix cluster hosts. A successful login proves that Samba is able to authenticate using the Windows Domain Controller.
Some additional configuration is required as follows (remember to do this on all nodes in the cluster):
-
Configure winbind authentication for users and groups in the name service switch file
/etc/nsswitch.conf
. Addwinbind
as a resolver for users and groups:This tells the operating system to lookup users locally first (passwd: files winbind systemd group: files winbind systemd
/etc/passwd
), followed bywinbind
. -
Change DNS in
/etc/resolv.conf
so it refers to the Active Directory server:domain haclab.com search haclab.com nameserver 10.254.254.111
-
Join the active directory domain.
# net ads join -U Administrator Password for [HACLAB\Administrator]: Using short domain name -- HACLAB Joined 'WCALMA1' to dns domain 'haclab.com' No DNS domain configured for wcalma1. Unable to perform DNS Update. DNS update failed: NT_STATUS_INVALID_PARAMETER # net ads info LDAP server: 10.254.254.111 LDAP server name: ws2022.haclab.com Realm: HACLAB.COM Bind Path: dc=HACLAB,dc=COM LDAP port: 389 Server time: Fri, 13 Sep 2024 17:07:32 BST KDC server: 10.254.254.111 Server time offset: 1 Last machine account password change: Fri, 13 Sep 2024 16:57:04 BST
-
Restart
winbind
.# systemctl restart winbind
-
Query
winbind
to confirm it is able to query the Active Directory server:# wbinfo -U HACLAB\administrator HACLAB\guest HACLAB\krbtgt HACLAB\hacuser1 HACLAB\hacuser2
Samba can be further configured to allow AD users to login if so desired. Two further steps are required:
-
Enable auto creation of home directories. For Debian based systems:
For RedHat based systems:# vi /etc/pam.d/common-session # add to the end if you need (auto create a home directory at initial login) session optional pam_mkhomedir.so skel=/etc/skel umask=077
# authselect enable-feature with-mkhomedir # systemctl enable --now oddjobd
-
Configure a login shell for Samba. Navigate to
Shares -> SMB
, select theGLOBALS
tab, expand themiscellaneous
section and set an appropriate login shell for the system: -
Finally, restart
winbind
.# systemctl restart winbind
It should now be possible to login to the Unix servers using AD users.
Clustering an SMB Share
These steps show the process of creating a clustered SMB share.
-
Shares are managed via the
SHARES
tab. Click+Add
to create a new SMB ShareAvailable options:
Share Name
The name of the SMB sharePath
Path of the folder to be shared, for example/pool1/SMB
Valid Users
A space separated list of Valid users. When User authentication is in effect these will be Unix Cluster users; for ADS authentication this can be local Unix cluster and Windows domain users.
-
Click
✓
when done. The share will now be available and clustered (a Samba reload is automatically applied): -
Advanced share settings can be applied once the share is created by clicking the cog on the left hand side of each individual share:
Additional SMB settings
LOCAL CONFIG
This tab is used to apply Samba configuration settings specific to each node rather than all cluster nodes. One example
of this is the netbios name
which needs to be unique on each node in the Windows Domain.
CONFIG
The tab shows the current Samba configuration for each cluster node; this view includes the globals, shares and local configuration.
STATUS
This tab shows the status of the Samba daemon services that are running. It also allows management of the services per node.
Unix Users
Creating Users
Creating Unix users in the WebApp will create the user across all cluster nodes using the same credentials (Username, UID and GUID).
-
In the WebApp, navigate to
System -> UNIX Users
, and click+Add
: -
Enter the Username and Password, and provide any of the additional information if required:
List of Groups
- Add the user to any available groups (optional)UID/GID
- Specify the User ID and Group ID of the user (optional - if unspecified the next available UID/GID will be used).Add user to sudo group
- This user will be able to issue commands as a different user (requires sudo package to be installed).Enable SMB support for user
- Adds this user to the valid Samba users.Home Directory
- Specify location for the user home directory (optional)Shell
- Specify the default shell for the user (optional)
-
When done click
SAVE
. Once saved, the user will be created on all nodes in the cluster:Warning
If the user name or UID specified already exists on any node in the cluster then the user add operation will fail with the message "Error creating user clusterwide..."
Modifying Users
To modify a user, click on the pencil icon on the left hand side of the user list table:
Deleting Users
To delete a user from all cluster nodes click trash can icon and then confirm the deletion:
Note
Local users (users that exist on one node only) can only be modified and deleted by logging into the WebApp on the node where the user exists.
-
RSF-1 uses broadcast packets to detect cluster nodes on the local network. Broadcast packets are usually blocked from traversing other networks and therefore cluster node discovery is usually limited to the local network only. ↩
-
A
broken_safe
state is considered a stopped state as, althought the service was unable to start up successfully, it was able to free up all the resources during the shutdown/abort step (hence thesafe
state). ↩