A ZFS pool consists of two types of data, the actual data being stored in the pool (i.e. videos, pictures, documents etc) along
with additional information known as metadata (pool properties, history, DDT, pointers to the actual data on-disk etc).
The metadata is stored along with the actual data in the pools vdevs, meaning that whenever metadata is required ZFS must
first read the metadata from those vdevs, followed by another read to get the actual data; also note that ZFS will have
to scan multiple vdevs when searching/reading the metadata.
A ZFS special device is a vdev dedicated to storing a pools metadata in preference to storing that data in the pools
regular vdevs. Therefore adding a dedicated SSD/NVMe special device can have significant performance improvements, especially
when the pool consists of many small files.
Important
When special vdevs are added to a pool they become an integral part of that pool, i.e.
should the special vdev become unavailable then the whole pool will be unavailable. For this
reason it is critical to introduce redundancy when creating/adding a special vdev; at the very least
it should be mirrored.
Special device considerations:
If more than one special device is specified, then allocations are load-balanced between those devices.
ZFS does not load balance an individual mirrored special device - if a pool has more
that one special device (which should be mirrored) then it will load balance between them.
When adding a special device existing metadata in the pool is not migrated - only new
metadata will be stored, therefore best practice is to add special devices at pool creation time.
If the special device becomes full, then ZFS will switch back to using the pool vdevs to store metadata.
In a shared storage cluster any pools utilising special devices must have those devices in the shared storage
not held locally in the cluster nodes themselves.
Small files
Additionally special devices may also be provisioned to store small files under a specified block size
dictated by the ZFS property special_small_blocks=size. Files smaller than or equal to this value will
saved to the special device; valid values are zero or a power of two from 512B to 1M. The default is size 0
which equates to no small files being saved to the special device. For clustering we do not recommend using
the special device for small files (i.e. leave special_small_blocks with the value 0).
General Recommendations
Alignment Shift (ashift=n)
The ashift property determines the block allocation size
that ZFS will use per vdev (not per pool as is sometimes
mistakenly thought). Ideally this value should be set to
the sector size of the underlying physical device (the
sector size being the smallest physical unit that can
be read or written from/to that device).
Traditionally hard drives had a sector size of 512 bytes; nowadays
most drives come with a 4KiB sector size and some even with an
8KiB sector size (for example modern SSDs).
When a device is added to a vdev (including at pool creation) ZFS will
attempt to automatically detect the underlying sector size by querying
the OS, and then set the ashift property accordingly. However, disks
can mis-report this information in order to provide for older OS's that
only support 512 byte sector sizes (most notably Windows XP). We
therefore strongly advise administrators to be aware of the real
sector size of devices being added to a pool and set the ashift
parameter accordingly.
Read/Write amplification
Setting the ZFS block size too low can result in a significant
performance degradation referred to as read/write amplification.
For example, if the ZFS block size were set to 512 bytes, but the
underlying device sector size is 4KiB, then writing 512 byte blocks
means having to write the first sector, then read back the 4KiB sector,
modify it with the next 512 byte block, write it out again to a new
4KiB sector and so on. Aligning the ZFS block size with the device
sector size avoids this read/write penalty. In contrast, setting the
ZFS block size greater than the device sector size can have little or no
performance penalty, indeed on devices with 512 byte blocks we
recommend a 4KiB ZFS blocks size setting for future-proofing.
An ashift value is a bit shift value (i.e. 2 to the power of ashift), so
a 512 block size is set as ashift=9 (29 = 512). The
ashift values range from 9 to 16, with the default value 0 meaning that
ZFS should auto-detect the sector size.
Once set there's no going back
Setting ashift is immutable. Specifying the wrong value can
irrevocably harm the pool with an under performing vdev, with the only
real option being to destroy the pool and start again.
Here's some examples of setting the ashift property for 4KiB devices, firstly creating a pool:
# zpool create -o ashift=12 pool1 mirror sdc sdd
Adding devices to the pool:
# zpool add -o ashift=12 pool1 mirror sde sdf
Access time (atime=on|off)
Like most Unix filesystems, ZFS maintains three timestamps per file,
the access timestamp (atime), the modified timestamp (mtime), and the
changed timestamp (ctime). The access time represents the last time
the file was read, the modified time represents the last time the
contents were modified and the changed timestamp refers to the last
time some metadata related to the file was changed.
The atime field means ZFS has to update the access time every time
a file is read and commit that update to disk for every access
operation (for example a backup). This introduces unnecessary
overhead and a performance hit.
ZFS allows you to control the updating of the atime timestamp through
the atime flag. It's value is set to either on or off (it
is set to on by default for backwards compatability - POSIX
standards require that the access time for a file properly reflect
the last time it was read). We recommend setting this property to off.
To check the current atime setting run the following command:
# zfs get atime <pool>
To disable access time for a whole pool:
# zfs set atime=off <pool1>
The atime setting can also be applied to individual datasets. To check
the current setting:
# zfs get atime <pool>/<dataset>
To disable access time for a dataset:
# zfs set atime=off <pool>/<dataset>
Linux specific: relatime
ZFS on linux provides an additional setting, relatime. Setting this
value to on means the access time is only updated under one of two
circumstances:
If the mtime or ctime values changed
If the existing access time has not been updated within the
past 24 hours (it will be updated the next time the file is accessed)
In order to use relatime the atime setting must also be enabled:
# zfs set atime=on <pool>
# zfs set relatime=on <pool>
or
# zfs set atime=on <pool>/<dataset>
# zfs set relatime=on <pool>/<dateset>
Record size (recordsize=n)
The recordsize property gives the maximum size of a logical block in a
ZFS dataset. Unlike many other file systems, ZFS has a variable
record size, meaning files are stored as either a single block of
varying sizes, or multiple blocks of recordsize blocks.
The default blocksize for ZFS is 128KiB, meaning ZFS will dynamically
allocate blocks of any size from 512B to 128KiB depending on the size
of file being written. A file whose size is greater than recordsize
will have a block size of 128Kib (so a file of size 156KiB will be
written as two blocks of 128Kib, with the first block containing the
first 128Kib of the file and the second block containing the remaining
28KiB).
This space allocation algorithm can be shown in the following
example. Firstly create a file system with a 100M quota and a
record size of 1M:
# zfs create -o quota=100M -o recordsize=1M pool/recordsize
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 5.43M 1.74G 25K /pool
pool/recordsize 24K 100M 24K /pool/recordsize
Next create 10 files of size 512K:
# for f in {1..10}; do dd if=/dev/urandom of=/pool/recordsize/${f}.txt bs=512K count=1; done
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 10.3M 1.74G 24K /pool
pool/recordsize 5.03M 95.0M 5.03M /pool/recordsize
ZFS is reporting an overall used space of 5.03M. This is the expected
outcome as each file will only occupy 512K of disk space, even though
the record size is 1M. If ZFS were allocating the full record size of
1M then the used space would be in the region of 10M.
Now repeat the same exercise but using a file size of 1148K
(i.e. slightly larger than the 1M record size):
# for f in {1..10}; do dd if=/dev/urandom of=/pool/recordsize/${f}.txt bs=1148K count=1; done
# zfs list
root@mgub01:/pool/recordsize# zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 25.5M 1.72G 24K /pool
pool/recordsize 20.0M 80.0M 20.0M /pool/recordsize
ZFS is now reporting an overall space usage of 20M, thus showing it is
using a record size of 1M, with each file requiring two 1M records.
Database considerations
When creating filesystems for database use, one of the most
important considerations is record size. Typically databases
write fixed sized blocks in an random manner. Because of this the
records size of the ZFS filesystem should match that of the block size
used by the database.
To show why this is important consider a filesystem with a record size
of 128KiB used by a database with a 64KiB block size. Whenever a block
write is performed by the database, ZFS must first read the 128KiB record, update
64KiB of that record and then rewrite that record back. However, in
this scenario if the ZFS record size is 64KiB, then ZFS need only
perform a write of the 64KiB record, hence avoiding the overhead of write
amplification.
Large file considerations
When dealing with larger files, such as videos, photos etc., where
data is typically read/written sequentially, then for
optimal performance the recordsize should be set to 1M.
This will reduce the overall IOPS load on the system by reducing the amount of
individual records needing to be processed. This can also help with
compression, as this is perfromed on an individual record basis.
Other considerations
Modifying a file systems record size only affects files created after
the modification; existing files are left untouched.
Record size must be a power of 2 and cannot exceed 1M (with
the introduction of the large_blocks feature flag in 2015):
# zfs create -o quota=100m -o recordsize=4M pool/recordsize
cannot create 'pool/recordsize': 'recordsize' must be power of 2 from 512B to 1M
Compression (compression=setting)
The ZFS compression feature, when enabled, compresses data before it
is written out to disk (as the copy-on-write nature of ZFS makes it easier to
offer inline compression). By default ZFS does not enable
compression. To check the current compression value of a pool use the
following command:
# zfs get compression <pool>
NAME PROPERTY VALUE SOURCE
<pool> compression off default
The available compression types are as follows:
Setting
Description
off
No compression
on
Default compression (currently lz4)
gzip
GNU gzip compression - the Unix standard
gzip-[1-9]
Selects a specific gzip level. gzip-1 provides the fastest gzip compression. gzip-9 provides the best data compression. gzip-6 is the default when you specify gzip
lzjb
Performance optimised with good compression. The originzl ZFS compression algorithm, deprecated in favour of lz4 which is superior in all ways
zle
Zero Level Encoding - compresses large sequences of zeros whilst leaving normal data untouched. Aimed at incompressible data (i.e. already compressed formats) such as GIF, MP4, JPEG etc.
lz4
A stream algorithm with rapid compression/decompression, the best overall choice
zstd
A new compression algorithm from the creator of lz4. Zstd offers better compression, but slower than lz4
The recommended setting for compression is currently lz4, which is the
default method chosen when compression is set to on. However, we
recommend specifying the setting so it's clear which algorithm is being used:
# zfs set compression=lz4 <pool>
# zfs get compression <pool>
NAME PROPERTY VALUE SOURCE
<pool> compression lz4 local
The compression property is inherited, so setting compresson on a pool
will result in that value being used for any child datasets:
# zfs get compression pool
NAME PROPERTY VALUE SOURCE
pool compression off local
# zfs create pool/compressiontest
# zfs get compression pool/compressiontest
NAME PROPERTY VALUE SOURCE
pool/compressiontest compression off inherited from pool
# zfs set compression=lz4 pool
# zfs get compression pool/compressiontest
NAME PROPERTY VALUE SOURCE
pool/compressiontest compression lz4 inherited from pool
Apply compression early on in a pools life
The compression setting is only applied to newly written
data; it is not retrospectively applied to data already written to
disk. We therefore recommend enabling compression at pool
creation time.
Linux - Extended Attributes (xattr=on|off|sa|dir)
The xattr property defines how ZFS will handle Linux' eXtended
ATTRibutes in a file system. The default setting of on (which maps to
dir) stores those attributes in hidden sub directories. This can
result in a performance impact as multiple attribute lookups may be
required when accessing a file.
Changing this setting to sa (System Attributes) means the attributes
are stored directly in the inodes, resulting in less IO requests when
extended attributes are in use. For a file system with many small
files this can have a significant performance improvement - for file
systems with fewer, larger files then the impact is less significant.
The default setting of on/dir is for backwards compatability
as the original implementation of ZFS did not have the facility
of storing extended attributes in inodes.
Recommended settings
Property
Recommended Value
Description
ashift
12
4KiB block size
atime
off
Do not update atime on file read
recordsize
64KiB
Smaller record sizes for databases (match the database block size)
recordsize
128Kib
Standard usage (mixture of file sizes)
recordsize
1Mb
Recommended for large files
compression
lz4
Set compression to use the lz4 algorithm
xattr
sa
Store Linux attributes in inodes rather than files in hidden folders