RSF-1 for ZFS is a fully featured software-only middleware product that turns your Solaris, illumos, FreeBSD or Linux storage servers into highly available ZFS NAS cluster appliances, which can be installed and ready for enterprise storage use within minutes.

RSF-1 for ZFS allows multiple ZFS pools to be managed across multiple servers providing High Availability for both block and file services beyond a traditional two-node Active/Active or Active/Passive topology. With RSF-1 for ZFS Metro edition, highly available ZFS services can also span beyond the single data centre.

Managed by a standalone GUI, command line interface and a rich API, RSF-1 for ZFS can also be easily seamlessly integrated into your own management administration toolset.

At the heart of the RSF-1 for ZFS solution is a mature and stable enterprise class high availability product. It was the first commercial HA solution for Sun/Solaris environments and has a 20+ year track record in data centres worldwide providing high-availability assurance for some of the most demanding customer service availability needs.

RSF-1 for ZFS has provided Enterprise-grade High-Availability ZFS Storage services to thousands of mission-critical deployments across all industries worldwide since 2009.

Example High Availability ZFS Topology

The following section describes how RSF-1 brings Highly Available storage services to ZFS in a two-server node, shared storage topology.

This example consists of a storage service of two storage servers (node A and Node B) with shared storage made up of two ZFS pools (Pool1 and Pool2). The two storage nodes are interconnected with public network, private network and storage connectivity.

RSF-1 for ZFS installs on both servers and communicates via a number of heartbeat connections. Each heartbeat transmits RSF-1 state and control information describing each node’s view of the cluster. In this example, heartbeats are established via both private and public networks (TCP/IP) and via High-Availability’s unique stateful disk heartbeat mechanism. This mechanism ensures that in the event of total network failure, cluster control is maintained independently.

Any number of heartbeat connections (disk, network and serial connection) can be used in an RSF-1 cluster, and at least two different mechanisms are recommended. In this example, we are using two independent network and two independent disk heartbeat mechanisms.

In this simple two-node two-pool topology example, we are going to deploy an Active-Active configuration where each of the two servers will manage ZFS services for each of the two ZFS pools.  Note that Active-Active here refers to the fact that both servers can actively run ZFS pool services.

On start-up, each node in the RSF-1 cluster determines which services need to be started as defined by the RSF-1 configuration preferences and will assume a role of “master” or “standby” for each service depending on the state of each service at that time.

When RSF-1 has determined that a ZFS pool service is not already active, it initiates a countdown to become master for that service. Once that countdown has expired, it will initiate service startup informing the rest of the cluster that it will be master for that service. If the service is already running elsewhere, it will become standby server for that service.

On service startup, it begins by fencing the underlying storage to protect the ZFS pool data. It does this by locking access to the drives that make up the ZFS pool to ensure that any other storage server cannot access them.

Once node A has protected and secured the underlying storage, it imports the ZFS pool, starts the associated ZFS services (file and/or block) and enables a Virtual IP interface for network access to ZFS Pool1.

At the same time, node B executes the same process for ZFS Pool2. Each service has a defined node preference order and associated timeouts that must expire before a service is started. If, in this example, node B were not started within the ZFS Pool2 timeout after node A, node A would also assume control and start services for Pool2.

The ZFS services are now available to the rest of the network and each pool is accessible via the Virtual IPs. Each node continues to constantly monitor all other nodes in the cluster via all the available heartbeat channels.

In the event of a node failure (as determined by loss of heartbeats from all live mechanisms), the failover process is initiated. In this example, let’s assume that node A has crashed. Network access to ZFS Pool1 will hang momentarily during the failover process.

Node B begins the failover process by first breaking the low level locks placed on the underlying ZFS Pool1 drives. It then imports the ZFS pool and starts the associated ZFS services Virtual IP. After a short interruption, while failover completes, network storage access will continue. Although RSF-1 exploits a number of mechanisms for speeding up ZFS importation, actual failover time will vary depending on the complexity of the ZFS pool structure such as number of drives, RAID levels, volume of ZFS snapshots etc.

Node B is now running ZFS services for both Pool1 and Pool2 with minimal disruption.

In the event that node A recovers during or after the failover process, RSF-1 triggers an immediate panic and shutdown as the underlying storage devices have been reserved elsewhere. When node A is restarted, (e.g. after repair and/or reboot), it will rejoin the cluster and act as standby server for both services). The system administrator can manually failback ZFS Pool1 to node A at a convenient time.

Whilst the above example describes a simple two-node two-pool architecture, RSF-1 for ZFS supports multiple nodes, multiple pools and multiple VIPs to provide extremely flexible ZFS storage topologies.

Please view the short video below to see how RSF-1 for ZFS provides highly available ZFS services in more detail.

Contact us today for a non-obligation free evaluation to see how RSF-1 for ZFS can work for you.

RESOURCES

RSF-1 HA Plugin ZFS Storage Cluster Concept Guide  (pdf)

RSF-1 for ZFS Washington University Case Study (pdf)