Bringing High Availability to ZFS Storage

What is ZFS?

ZFS Enterprise Features
  • Self Healing
  • End-to-end data integrity
  • Infinitely scalable
  • Storage Pooling
  • Combined file system and volume manager
  • RAID-Z
  • Deduplication
  • Compression
  • Unlimited snapshots

ZFS, and its open-source sibling, OpenZFS, is an advanced storage platform that combines a file system with a logical volume manager incorporating built-in advanced enterprise features such as data protection, replication, deduplication, compression and unlimited snapshots. It is inherently massively scalable allowing for file sizes of up to 16 Exabytes and up to 256 Quadrillion Zettabytes of storage. Some of its key advanced features include storage pooling, RAID-Z, copy-on-write, end-to-end data integrity verification and automatic repair.

ZFS is a transactional self-healing 128-bit file system that supports almost unlimited storage capacity. One of it’s unique features is that is calculates checksums of all data and meta data allowing it to identify and self-repair data corruption.

Combining what have been traditionally separate storage components: data management, storage and device management, virtualized volumes and file system management into one, ZFS does not require hardware RAID controllers or battery-backed NVRAM to ensure data integrity in the event of physical component failure.

A Brief History of ZFS

Designed to overcome the limitations of general-purpose file systems, ZFS was originally conceived and created by Sun Microsystems in 2001 for their proprietary Solaris platform. In 2005, the source was released in the open-sourced OpenSolaris platform shortly before Sun’s acquisition by Oracle. Since then, Oracle placed ZFS development back under closed source license on their Solaris platform and they continue to sell their own proprietary ZFS based storage appliances.

Since 2005 however, growing open-source communities have ported and improved ZFS on other Unix platforms including the OpenSolaris derived illumos, OpenIndiana and OmniOS operating systems, Linux, Mac OSX and FreeBSD. Since 2013, ongoing ZFS development and releases have been coordinated by OpenZFS, an umbrella organisation whose team comprises of individuals and companies that use, improve and promote the ZFS file system, including many commercial organisations that embed ZFS in their own products.

Why the Enterprise should consider OpenZFS

ZFS Storage Components
  • Commodity x86 server hardware
  • Any drive mix of HDD, SSD, AFA, NVMe
  • Choice of Solaris, illumos, OmniOS, Linux and FreeBSD Operating Systems
  • OpenZFS software

OpenZFS is a reliable and key component for enterprise storage as it:

  • Has been proven in demanding enterprise environments for nearly two decades
  • Has almost unlimited scalable capacity
  • Is self-healing
  • Includes many advanced enterprise storage features
  • Has a mature and growing open-source community continuing developments
  • Has a wealth of commercial buy-in
  • Is very cost-effective compared to closed proprietary offerings
  • Is compatible with the latest and greatest storage technologies and components (e.g. flash, NVMe, Infiniband)
  • Is being deployed widely in all storage tiers across all vertical markets

In addition, OpenZFS can be deployed on a wide range of generic commodity hardware from a myriad of suppliers on the Operating System platform of your choice. Open source, open architecture, commodity hardware, hugely flexible topologies and many hardware options and vendors mean ZFS is enterprise-ready.

Bringing High Availability to ZFS Storage Appliances

Being more than just a file-system and incorporating advanced logical volume management capability, ZFS has a number of revolutionary built-in features that customers would normally pay large additional license fees for elsewhere. For example, thin provisioning, iSCSI and Fibre Channel support, NFS & CIFS, unlimited snapshots, replication and so on.

Whilst ZFS has built-in data verification and integrity checking, what ZFS isn’t however, is a clustered file-system meaning that ZFS pools can only be served by a single server head at a time; a clear single-point-of-failure. If the server fails, even though the data may be safe, the ZFS storage and associated file and block services become inaccessible.

For enterprise use, businesses demand high availability to ensure business continuity in the event of system breakdown or disaster, and no matter how reliable the hardware, there is no guarantee of maintaining service availability in the event of failure or error of a single hardware component, whether storage device or server head.

High Availability considerations for ZFS
  • Not a clustered file system
  • No pool import protection
  • No inherent data fencing mechanisms
  • Varying ZFS pool import times
  • Managing ZFS cache devices
  • File & Block service failover time and timeout
  • Network connectivity and failover
  • Failover management framework

Adding a second server head to improve storage availability in the event of a single server failure can be achieved using simple active/passive high availability and failover technologies. However, bringing enterprise-grade high availability features to ZFS based storage appliances is significantly more complicated than simply failing over ZFS pools from one server head to another.

Due to its inherent design and the inventors’ foresight to know that physical disk drives gradually wear out over time, ZFS does an incredible job of protecting against data corruption as well as continuous integrity checking and automatic repair. It is reasonably easy however to inadvertently corrupt entire ZFS pools in clustered configurations if attempts are made to import pools on two or more storage server nodes simultaneously.

Although ZFS pools can be made available to any number of other storage nodes, there is no concept in ZFS for pools to be accessed by more than one node at a time, and there is no inherent protection against multi-import catastrophic data corruption scenarios. Whilst storage high availability is vital to ensure services are always up, securing and protecting the data foremost is absolutely critical.

In addition to ensuring the availability of ZFS pools and data protection, bulletproof data fencing capabilities are needed to protect the physical data from potential split-brain scenarios. For example, sometimes a server, or network access, may freeze for a few moments and recover during or after a service failover, leading to the potentially disastrous scenario of storage corruption by data pools being written to by two or more servers at the same time.

It is also important to provide high availability and integrity of associated file and block services (e.g. NFS, SMB, iSCSI, ALUA), to maintain connections for virtualized environments, to ensure replication and snapshot capabilities and backup schedules services are maintained, and so on.

More complex storage topologies may also involve a large number of ZFS pools serviced across multiple servers meaning a simple active/passive (two node) configuration is not possible. Simple high availability solutions reliant on network heartbeats may not accommodate stretched storage topologies where servers physically reside in different locations or include a remote replicated third-node in a backup or disaster recovery site.

RSF-1 for ZFS
  • Multiple server, multiple pool failover support
  • File and block service failover
  • Multiple heartbeat mechanisms
  • Bulletproof data fencing
  • ZFS-specific fast failover
  • Stretch metro capability
  • API integration framework
  • GUI and CLI administration toolset
  • Proven 20+ year enterprise pedigree
  • Proven 10+ year ZFS support
  • Thousands of ZFS enterprise deployments

Bringing these enterprise-grade high availability features to a critical ZFS storage deployment also requires deploying a bulletproof and proven high availability solution; this is where’s RSF-1 for ZFS comes in.

RSF-1 for ZFS allows multiple ZFS pools to be managed across multiple servers providing High Availability for both block and file services beyond a traditional two-node Active/Active or Active/Passive topology. With RSF-1 for ZFS Metro edition, highly available ZFS services can also span beyond the single data centre.

A typical 2-server RSF-1 for ZFS High Availability Topology

RSF-1 for ZFS storage is therefore more than a Disaster Recovery solution failing back to a previous snapshot point-in-time, but provides real-time failover in the event of failure or disaster where data is always up-to-date and failover transparent to system users. It is also a sophisticated system administration tool that allows for controlled management of storage services to facilitate smooth and seamless system upgrades and maintenance.

Managed by a standalone GUI, command line interface and a rich API, RSF-1 for ZFS can also be easily seamlessly integrated into your own management administration toolset.

At the heart of the RSF-1 for ZFS solution is a mature and stable enterprise class high availability product. It was the first commercial HA solution for Sun/Solaris environments and has a 20+ year track record in data centres worldwide providing high-availability assurance for some of the most demanding customer service availability needs. RSF-1 for ZFS has provided Enterprise-grade High-Availability ZFS Storage services to thousands of mission-critical deployments across all industries worldwide since 2009.