ZFS Tuning and Optimisation

Add a ZFS Metadata Special Device

A ZFS pool consists of two types of data, the actual data being stored in the pool (i.e. videos, pictures, documents etc) along with additional information known as metadata (pool properties, history, DDT, pointers to the actual data on-disk etc).

The metadata is stored along with the actual data in the pools vdevs, meaning that whenever metadata is required ZFS must first read the metadata from those vdevs, followed by another read to get the actual data; also note that ZFS will have to scan multiple vdevs when searching/reading the metadata.

A ZFS special device is a vdev dedicated to storing a pools metadata in preference to storing that data in the pools regular vdevs. Therefore adding a dedicated SSD/NVMe special device can have significant performance improvements, especially when the pool consists of many small files.

Important

When special vdevs are added to a pool they become an integral part of that pool, i.e. should the special vdev become unavailable then the whole pool will be unavailable. For this reason it is critical to introduce redundancy when creating/adding a special vdev; at the very least it should be mirrored.

Special device considerations:

If more than one special device is specified, then allocations are load-balanced between those devices. ZFS does not load balance an individual mirrored special device - if a pool has more that one special device (which should be mirrored) then it will load balance between them.
When adding a special device existing metadata in the pool is not migrated - only new metadata will be stored, therefore best practice is to add special devices at pool creation time.
If the special device becomes full, then ZFS will switch back to using the pool vdevs to store metadata.
In a shared storage cluster any pools utilising special devices must have those devices in the shared storage not held locally in the cluster nodes themselves.

Small files

Additionally special devices may also be provisioned to store small files under a specified block size dictated by the ZFS property special_small_blocks=size. Files smaller than or equal to this value will saved to the special device; valid values are zero or a power of two from 512B to 1M. The default is size 0 which equates to no small files being saved to the special device. For clustering we do not recommend using the special device for small files (i.e. leave special_small_blocks with the value 0).

General Recommendations

Alignment Shift (ashift=n)

The ashift property determines the block allocation size that ZFS will use per vdev (not per pool as is sometimes mistakenly thought). Ideally this value should be set to the sector size of the underlying physical device (the sector size being the smallest physical unit that can be read or written from/to that device).

Traditionally hard drives had a sector size of 512 bytes; nowadays most drives come with a 4KiB sector size and some even with an 8KiB sector size (for example modern SSDs).

When a device is added to a vdev (including at pool creation) ZFS will attempt to automatically detect the underlying sector size by querying the OS, and then set the ashift property accordingly. However, disks can mis-report this information in order to provide for older OS's that only support 512 byte sector sizes (most notably Windows XP). We therefore strongly advise administrators to be aware of the real sector size of devices being added to a pool and set the ashift parameter accordingly.

Read/Write amplification

Setting the ZFS block size too low can result in a significant performance degradation referred to as read/write amplification. For example, if the ZFS block size were set to 512 bytes, but the underlying device sector size is 4KiB, then writing 512 byte blocks means having to write the first sector, then read back the 4KiB sector, modify it with the next 512 byte block, write it out again to a new 4KiB sector and so on. Aligning the ZFS block size with the device sector size avoids this read/write penalty. In contrast, setting the ZFS block size greater than the device sector size can have little or no performance penalty, indeed on devices with 512 byte blocks we recommend a 4KiB ZFS blocks size setting for future-proofing.

An ashift value is a bit shift value (i.e. 2 to the power of ashift), so a 512 block size is set as ashift=9 (2⁹ = 512). The ashift values range from 9 to 16, with the default value 0 meaning that ZFS should auto-detect the sector size.

Once set there's no going back

Setting ashift is immutable. Specifying the wrong value can irrevocably harm the pool with an under performing vdev, with the only real option being to destroy the pool and start again.

Here's some examples of setting the ashift property for 4KiB devices, firstly creating a pool:


# zpool create -o ashift=12 pool1 mirror sdc sdd

Adding devices to the pool:


# zpool add -o ashift=12 pool1 mirror sde sdf

Access time (atime=on|off)

Like most Unix filesystems, ZFS maintains three timestamps per file, the access timestamp (atime), the modified timestamp (mtime), and the changed timestamp (ctime). The access time represents the last time the file was read, the modified time represents the last time the contents were modified and the changed timestamp refers to the last time some metadata related to the file was changed.

The atime field means ZFS has to update the access time every time a file is read and commit that update to disk for every access operation (for example a backup). This introduces unnecessary overhead and a performance hit.

ZFS allows you to control the updating of the atime timestamp through the atime flag. It's value is set to either on or off (it is set to on by default for backwards compatability - POSIX standards require that the access time for a file properly reflect the last time it was read). We recommend setting this property to off.

To check the current atime setting run the following command:


# zfs get atime <pool>

To disable access time for a whole pool:


# zfs set atime=off <pool1>

The atime setting can also be applied to individual datasets. To check the current setting:


# zfs get atime <pool>/<dataset>

To disable access time for a dataset:


# zfs set atime=off <pool>/<dataset>

Linux specific: relatime

ZFS on linux provides an additional setting, relatime. Setting this value to on means the access time is only updated under one of two circumstances:

If the mtime or ctime values changed
If the existing access time has not been updated within the past 24 hours (it will be updated the next time the file is accessed)

In order to use relatime the atime setting must also be enabled:


# zfs set atime=on <pool>
# zfs set relatime=on <pool>

or


# zfs set atime=on <pool>/<dataset>
# zfs set relatime=on <pool>/<dateset>

Record size (recordsize=n)

The recordsize property gives the maximum size of a logical block in a ZFS dataset. Unlike many other file systems, ZFS has a variable record size, meaning files are stored as either a single block of varying sizes, or multiple blocks of recordsize blocks.

The default blocksize for ZFS is 128KiB, meaning ZFS will dynamically allocate blocks of any size from 512B to 128KiB depending on the size of file being written. A file whose size is greater than recordsize will have a block size of 128Kib (so a file of size 156KiB will be written as two blocks of 128Kib, with the first block containing the first 128Kib of the file and the second block containing the remaining 28KiB).

This space allocation algorithm can be shown in the following example. Firstly create a file system with a 100M quota and a record size of 1M:


# zfs create -o quota=100M -o recordsize=1M pool/recordsize
# zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
pool             5.43M  1.74G       25K  /pool
pool/recordsize    24K   100M       24K  /pool/recordsize

Next create 10 files of size 512K:


# for f in {1..10}; do dd if=/dev/urandom of=/pool/recordsize/${f}.txt bs=512K count=1; done
# zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
pool             10.3M  1.74G       24K  /pool
pool/recordsize  5.03M  95.0M     5.03M  /pool/recordsize

ZFS is reporting an overall used space of 5.03M. This is the expected outcome as each file will only occupy 512K of disk space, even though the record size is 1M. If ZFS were allocating the full record size of 1M then the used space would be in the region of 10M.

Now repeat the same exercise but using a file size of 1148K (i.e. slightly larger than the 1M record size):


# for f in {1..10}; do dd if=/dev/urandom of=/pool/recordsize/${f}.txt bs=1148K count=1; done
# zfs list
root@mgub01:/pool/recordsize# zfs list
NAME               USED  AVAIL     REFER  MOUNTPOINT
pool             25.5M  1.72G       24K  /pool
pool/recordsize  20.0M  80.0M     20.0M  /pool/recordsize

ZFS is now reporting an overall space usage of 20M, thus showing it is using a record size of 1M, with each file requiring two 1M records.

Database considerations

When creating filesystems for database use, one of the most important considerations is record size. Typically databases write fixed sized blocks in an random manner. Because of this the records size of the ZFS filesystem should match that of the block size used by the database.

To show why this is important consider a filesystem with a record size of 128KiB used by a database with a 64KiB block size. Whenever a block write is performed by the database, ZFS must first read the 128KiB record, update 64KiB of that record and then rewrite that record back. However, in this scenario if the ZFS record size is 64KiB, then ZFS need only perform a write of the 64KiB record, hence avoiding the overhead of write amplification.

Large file considerations

When dealing with larger files, such as videos, photos etc., where data is typically read/written sequentially, then for optimal performance the recordsize should be set to 1M.

This will reduce the overall IOPS load on the system by reducing the amount of individual records needing to be processed. This can also help with compression, as this is perfromed on an individual record basis.

Other considerations

Modifying a file systems record size only affects files created after the modification; existing files are left untouched.

Record size must be a power of 2 and cannot exceed 1M (with the introduction of the large_blocks feature flag in 2015):


# zfs create -o quota=100m -o recordsize=4M pool/recordsize
cannot create 'pool/recordsize': 'recordsize' must be power of 2 from 512B to 1M

Compression (compression=setting)

The ZFS compression feature, when enabled, compresses data before it is written out to disk (as the copy-on-write nature of ZFS makes it easier to offer inline compression). By default ZFS does not enable compression. To check the current compression value of a pool use the following command:


# zfs get compression <pool>
NAME        PROPERTY     VALUE           SOURCE
<pool>      compression  off             default

The available compression types are as follows:

Setting	Description
off	No compression
on	Default compression (currently lz4)
gzip	GNU gzip compression - the Unix standard
gzip-[1-9]	Selects a specific gzip level. gzip-1 provides the fastest gzip compression. gzip-9 provides the best data compression. gzip-6 is the default when you specify gzip
lzjb	Performance optimised with good compression. The originzl ZFS compression algorithm, deprecated in favour of lz4 which is superior in all ways
zle	Zero Level Encoding - compresses large sequences of zeros whilst leaving normal data untouched. Aimed at incompressible data (i.e. already compressed formats) such as GIF, MP4, JPEG etc.
lz4	A stream algorithm with rapid compression/decompression, the best overall choice
zstd	A new compression algorithm from the creator of lz4. Zstd offers better compression, but slower than lz4

The recommended setting for compression is currently lz4, which is the default method chosen when compression is set to on. However, we recommend specifying the setting so it's clear which algorithm is being used:


# zfs set compression=lz4 <pool>
# zfs get compression <pool>
NAME        PROPERTY     VALUE           SOURCE
<pool>      compression  lz4             local

The compression property is inherited, so setting compresson on a pool will result in that value being used for any child datasets:


# zfs get compression pool
NAME   PROPERTY     VALUE           SOURCE
pool   compression  off             local

# zfs create pool/compressiontest
# zfs get compression pool/compressiontest
NAME                   PROPERTY     VALUE           SOURCE
pool/compressiontest   compression  off             inherited from pool

# zfs set compression=lz4 pool
# zfs get compression pool/compressiontest
NAME                   PROPERTY     VALUE           SOURCE
pool/compressiontest   compression  lz4             inherited from pool

Apply compression early on in a pools life

The compression setting is only applied to newly written data; it is not retrospectively applied to data already written to disk. We therefore recommend enabling compression at pool creation time.

Linux - Extended Attributes (xattr=on|off|sa|dir)

The xattr property defines how ZFS will handle Linux' eXtended ATTRibutes in a file system. The default setting of on (which maps to dir) stores those attributes in hidden sub directories. This can result in a performance impact as multiple attribute lookups may be required when accessing a file.

Changing this setting to sa (System Attributes) means the attributes are stored directly in the inodes, resulting in less IO requests when extended attributes are in use. For a file system with many small files this can have a significant performance improvement - for file systems with fewer, larger files then the impact is less significant.

The default setting of on/dir is for backwards compatability as the original implementation of ZFS did not have the facility of storing extended attributes in inodes.

Recommended settings

Property	Recommended Value	Description
ashift	12	4KiB block size
atime	off	Do not update atime on file read
recordsize	64KiB	Smaller record sizes for databases (match the database block size)
recordsize	128Kib	Standard usage (mixture of file sizes)
recordsize	1Mb	Recommended for large files
compression	lz4	Set compression to use the lz4 algorithm
xattr	sa	Store Linux attributes in inodes rather than files in hidden folders