ZFS Tuning and Optimisation
General Recommendations
Alignment Shift (ashift=n)
The ashift property determines the block allocation size that ZFS will use per vdev (not per pool as is sometimes mistakenly thought). Ideally this value should be set to the sector size of the underlying physical device (the sector size being the smallest physical unit that can be read or written from/to that device).
Traditionally hard drives had a sector size of 512 bytes; nowadays most drives come with a 4KiB sector size and some even with an 8KiB sector size (for example modern SSDs).
When a device is added to a vdev (including at pool creation) ZFS will attempt to automatically detect the underlying sector size by querying the OS, and then set the ashift property accordingly. However, disks can mis-report this information in order to provide for older OS's that only support 512 byte sector sizes (most notably Windows XP). We therefore strongly advise administrators to be aware of the real sector size of devices being added to a pool and set the ashift parameter accordingly.
Read/Write amplification
Setting the ZFS block size too low can result in a significant performance degradation referred to as read/write amplification. For example, if the ZFS block size were set to 512 bytes, but the underlying device sector size is 4KiB, then writing 512 byte blocks means having to write the first sector, then read back the 4KiB sector, modify it with the next 512 byte block, write it out again to a new 4KiB sector and so on. Aligning the ZFS block size with the device sector size avoids this read/write penalty. In contrast, setting the ZFS block size greater than the device sector size can have little or no performance penalty, indeed on devices with 512 byte blocks we recommend a 4KiB ZFS blocks size setting for future-proofing.
An ashift value is a bit shift value (i.e. 2 to the power of ashift), so
a 512 block size is set as ashift=9
(29 = 512). The
ashift values range from 9 to 16, with the default value 0 meaning that
ZFS should auto-detect the sector size.
Once set there's no going back
Setting ashift is immutable. Specifying the wrong value can irrevocably harm the pool with an under performing vdev, with the only real option being to destroy the pool and start again.
Here's some examples of setting the ashift property for 4KiB devices, firstly creating a pool:
# zpool create -o ashift=12 pool1 mirror sdc sdd
# zpool add -o ashift=12 pool1 mirror sde sdf
Access time (atime=on|off)
Like most Unix filesystems, ZFS maintains three timestamps per file, the access timestamp (atime), the modified timestamp (mtime), and the changed timestamp (ctime). The access time represents the last time the file was read, the modified time represents the last time the contents were modified and the changed timestamp refers to the last time some metadata related to the file was changed.
The atime field means ZFS has to update the access time every time a file is read and commit that update to disk for every access operation (for example a backup). This introduces unnecessary overhead and a performance hit.
ZFS allows you to control the updating of the atime timestamp through
the atime flag. It's value is set to either on
or off
(it
is set to on
by default for backwards compatability - POSIX
standards require that the access time for a file properly reflect
the last time it was read). We recommend setting this property to off
.
To check the current atime setting run the following command:
# zfs get atime <pool>
To disable access time for a whole pool:
# zfs set atime=off <pool1>
The atime setting can also be applied to individual datasets. To check the current setting:
# zfs get atime <pool>/<dataset>
To disable access time for a dataset:
# zfs set atime=off <pool>/<dataset>
Linux specific: relatime
ZFS on linux provides an additional setting, relatime
. Setting this
value to on
means the access time is only updated under one of two
circumstances:
- If the mtime or ctime values changed
- If the existing access time has not been updated within the past 24 hours (it will be updated the next time the file is accessed)
In order to use relatime
the atime
setting must also be enabled:
# zfs set atime=on <pool>
# zfs set relatime=on <pool>
# zfs set atime=on <pool>/<dataset>
# zfs set relatime=on <pool>/<dateset>
Record size (recordsize=n)
The recordsize
property gives the maximum size of a logical block in a
ZFS dataset. Unlike many other file systems, ZFS has a variable
record size, meaning files are stored as either a single block of
varying sizes, or multiple blocks of recordsize
blocks.
The default blocksize for ZFS is 128KiB, meaning ZFS will dynamically
allocate blocks of any size from 512B to 128KiB depending on the size
of file being written. A file whose size is greater than recordsize
will have a block size of 128Kib (so a file of size 156KiB will be
written as two blocks of 128Kib, with the first block containing the
first 128Kib of the file and the second block containing the remaining
28KiB).
This space allocation algorithm can be shown in the following example. Firstly create a file system with a 100M quota and a record size of 1M:
# zfs create -o quota=100M -o recordsize=1M pool/recordsize
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 5.43M 1.74G 25K /pool
pool/recordsize 24K 100M 24K /pool/recordsize
Next create 10 files of size 512K:
# for f in {1..10}; do dd if=/dev/urandom of=/pool/recordsize/${f}.txt bs=512K count=1; done
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 10.3M 1.74G 24K /pool
pool/recordsize 5.03M 95.0M 5.03M /pool/recordsize
ZFS is reporting an overall used space of 5.03M. This is the expected outcome as each file will only occupy 512K of disk space, even though the record size is 1M. If ZFS were allocating the full record size of 1M then the used space would be in the region of 10M.
Now repeat the same exercise but using a file size of 1148K (i.e. slightly larger than the 1M record size):
# for f in {1..10}; do dd if=/dev/urandom of=/pool/recordsize/${f}.txt bs=1148K count=1; done
# zfs list
root@mgub01:/pool/recordsize# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 25.5M 1.72G 24K /pool
pool/recordsize 20.0M 80.0M 20.0M /pool/recordsize
ZFS is now reporting an overall space usage of 20M, thus showing it is using a record size of 1M, with each file requiring two 1M records.
Database considerations
When creating filesystems for database use, one of the most important considerations is record size. Typically databases write fixed sized blocks in an random manner. Because of this the records size of the ZFS filesystem should match that of the block size used by the database.
To show why this is important consider a filesystem with a record size of 128KiB used by a database with a 64KiB block size. Whenever a block write is performed by the database, ZFS must first read the 128KiB record, update 64KiB of that record and then rewrite that record back. However, in this scenario if the ZFS record size is 64KiB, then ZFS need only perform a write of the 64KiB record, hence avoiding the overhead of write amplification.
Large file considerations
When dealing with larger files, such as videos, photos etc., where
data is typically read/written sequentially, then for
optimal performance the recordsize
should be set to 1M.
This will reduce the overall IOPS load on the system by reducing the amount of individual records needing to be processed. This can also help with compression, as this is perfromed on an individual record basis.
Other considerations
- Modifying a file systems record size only affects files created after the modification; existing files are left untouched.
- Record size must be a power of 2 and cannot exceed 1M (with
the introduction of the large_blocks feature flag in 2015):
# zfs create -o quota=100m -o recordsize=4M pool/recordsize cannot create 'pool/recordsize': 'recordsize' must be power of 2 from 512B to 1M
Compression (compression=setting)
The ZFS compression feature, when enabled, compresses data before it is written out to disk (as the copy-on-write nature of ZFS makes it easier to offer inline compression). By default ZFS does not enable compression. To check the current compression value of a pool use the following command:
# zfs get compression <pool>
NAME PROPERTY VALUE SOURCE
<pool> compression off default
The available compression types are as follows:
Setting | Description |
---|---|
off | No compression |
on | Default compression (currently lz4) |
gzip | GNU gzip compression - the Unix standard |
gzip-[1-9] | Selects a specific gzip level. gzip-1 provides the fastest gzip compression. gzip-9 provides the best data compression. gzip-6 is the default when you specify gzip |
lzjb | Performance optimised with good compression. The originzl ZFS compression algorithm, deprecated in favour of lz4 which is superior in all ways |
zle | Zero Level Encoding - compresses large sequences of zeros whilst leaving normal data untouched. Aimed at incompressible data (i.e. already compressed formats) such as GIF, MP4, JPEG etc. |
lz4 | A stream algorithm with rapid compression/decompression, the best overall choice |
zstd | A new compression algorithm from the creator of lz4. Zstd offers better compression, but slower than lz4 |
The recommended setting for compression is currently lz4
, which is the
default method chosen when compression is set to on
. However, we
recommend specifying the setting so it's clear which algorithm is being used:
# zfs set compression=lz4 <pool>
# zfs get compression <pool>
NAME PROPERTY VALUE SOURCE
<pool> compression lz4 local
# zfs get compression pool
NAME PROPERTY VALUE SOURCE
pool compression off local
# zfs create pool/compressiontest
# zfs get compression pool/compressiontest
NAME PROPERTY VALUE SOURCE
pool/compressiontest compression off inherited from pool
# zfs set compression=lz4 pool
# zfs get compression pool/compressiontest
NAME PROPERTY VALUE SOURCE
pool/compressiontest compression lz4 inherited from pool
Apply compression early on in a pools life
The compression setting is only applied to newly written data; it is not retrospectively applied to data already written to disk. We therefore recommend enabling compression at pool creation time.
Linux - Extended Attributes (xattr=on|off|sa|dir)
The xattr
property defines how ZFS will handle Linux' eXtended
ATTRibutes in a file system. The default setting of on
(which maps to
dir
) stores those attributes in hidden sub directories. This can
result in a performance impact as multiple attribute lookups may be
required when accessing a file.
Changing this setting to sa
(System Attributes) means the attributes
are stored directly in the inodes, resulting in less IO requests when
extended attributes are in use. For a file system with many small
files this can have a significant performance improvement - for file
systems with fewer, larger files then the impact is less significant.
The default setting of on
/dir
is for backwards compatability
as the original implementation of ZFS did not have the facility
of storing extended attributes in inodes.
Recommended settings
Property | Recommended Value | Description |
---|---|---|
ashift | 12 | 4KiB block size |
atime | off | Do not update atime on file read |
recordsize | 64KiB | Smaller record sizes for databases (match the database block size) |
recordsize | 128Kib | Standard usage (mixture of file sizes) |
recordsize | 1Mb | Recommended for large files |
compression | lz4 | Set compression to use the lz4 algorithm |
xattr | sa | Store Linux attributes in inodes rather than files in hidden folders |