diff --git a/docs/Performance and Tuning/Workload Tuning.rst b/docs/Performance and Tuning/Workload Tuning.rst index 0334a77..27f3b46 100644 --- a/docs/Performance and Tuning/Workload Tuning.rst +++ b/docs/Performance and Tuning/Workload Tuning.rst @@ -1,3 +1,6 @@ +Workload Tuning +=============== + Below are tips for various workloads. .. _basic_concepts: @@ -36,7 +39,7 @@ providing a superior hit rate. In addition, a dedicated cache device (typically a SSD) can be added to the pool, with -``zpool add``\ *``poolname``*\ ``cache``\ *``devicename``*. The cache +``zpool add POOLNAME cache DEVICENAME``. The cache device is managed by the L2ARC, which scans entries that are next to be evicted and writes them to the cache device. The data stored in ARC and L2ARC can be controlled via the ``primarycache`` and ``secondarycache`` @@ -90,7 +93,7 @@ respective methods are as follows: on FreeBSD; see for example `FreeBSD on 4K sector drives `__ (2011-01-01) -- `ashift= `__ +- `ashift= `__ on ZFS on Linux - -o ashift= also works with both MacZFS (pool version 8) and ZFS-OSX (pool version 5000). @@ -102,7 +105,7 @@ syntax `__ has contributed a `database of +In addition, there is a `database of drives known to misreport sector sizes `__ to the ZFS on Linux project. It is used to automatically adjust ashift @@ -133,9 +136,9 @@ The following compression algorithms are available: - LZ4 - New algorithm added after feature flags were created. It is - significantly superior to LZJB in all metrics tested. It is new - default compression algorithm (compression=on) in - OpenZFS\ `1 `__. + significantly superior to LZJB in all metrics tested. It is `new + default compression algorithm `__ + (compression=on) in OpenZFS. It is available on all platforms have as of 2020. - LZJB @@ -159,7 +162,7 @@ The following compression algorithms are available: If you want to use compression and are uncertain which to use, use LZ4. It averages a 2.1:1 compression ratio while gzip-1 averages 2.7:1, but gzip is much slower. Both figures are obtained from `testing by the LZ4 -project `__ on the Silesia corpus. The +project `__ on the Silesia corpus. The greater compression ratio of gzip is usually only worthwhile for rarely accessed data. @@ -265,8 +268,8 @@ Metaslab Allocator ZFS top level vdevs are divided into metaslabs from which blocks can be independently allocated so allow for concurrent IOs to perform -allocations without blocking one another. At present, there is a -regression\ `2 `__ on the +allocations without blocking one another. At present, `there is a +regression `__ on the Linux and Mac OS X ports that causes serialization to occur. By default, the selection of a metaslab is biased toward lower LBAs to @@ -280,8 +283,8 @@ The metaslab allocator will allocate blocks on a first-fit basis when a metaslab has more than or equal to 4 percent free space and a best-fit basis when a metaslab has less than 4 percent free space. The former is much faster than the latter, but it is not possible to tell when this -behavior occurs from the pool's free space. However, the command \`zdb --mmm $POOLNAME\` will provide this information. +behavior occurs from the pool's free space. However, the command ``zdb +-mmm $POOLNAME`` will provide this information. .. _pool_geometry: @@ -371,8 +374,7 @@ Free Space Keep pool free space above 10% to avoid many metaslabs from reaching the 4% free space threshold to switch from first-fit to best-fit allocation -strategies. When the threshold is hit, the `metaslab -allocator `__ becomes very CPU +strategies. When the threshold is hit, the :ref:`metaslab_allocator` becomes very CPU intensive in an attempt to protect itself from fragmentation. This reduces IOPS, especially as more metaslabs reach the 4% threshold. @@ -405,13 +407,12 @@ Note that larger record sizes will increase compression ratios on compressible data by allowing compression algorithms to process more data at a time. -.. _nvme_low_level_formatting: +.. _nvme_low_level_formatting_link: NVMe low level formatting ~~~~~~~~~~~~~~~~~~~~~~~~~ -See -`Hardware#NVMe_low_level_formatting `__. +See :ref:`nvme_low_level_formatting`. .. _pool_geometry_1: @@ -430,17 +431,16 @@ If your workload involves fsync or O_SYNC and your pool is backed by mechanical storage, consider adding one or more SLOG devices. Pools that have multiple SLOG devices will distribute ZIL operations across them. The best choice for SLOG device(s) are likely Optane / 3D XPoint SSDs. -See -`Hardware#Optane_.2F_3D_XPoint_SSDs `__ +See :ref:`optane_3d_xpoint_ssds` for a description of them. If an Optane / 3D XPoint SSD is an option, the rest of this section on synchronous I/O need not be read. If Optane / 3D XPoint SSDs is not an option, see -`Hardware#NAND_Flash_SSDs `__ for suggestions +:ref:`nand_flash_ssds` for suggestions for NAND flash SSDs and also read the information below. To ensure maximum ZIL performance on NAND flash SSD-based SLOG devices, you should also overprovison spare area to increase -IOPS\ `3 `__. Only +IOPS [#ssd_iops]_. Only about 4GB is needed, so the rest can be left as overprovisioned storage. The choice of 4GB is somewhat arbitrary. Most systems do not write anything close to 4GB to ZIL between transaction group commits, so @@ -495,15 +495,15 @@ Whole disks Whole disks should be given to ZFS rather than partitions. If you must use a partition, make certain that the partition is properly aligned to -avoid read-modify-write overhead. See the section on `Alignment -Shift `__ for a -description of proper alignment. Also, see the section on `Whole Disks -versus Partitions `__ +avoid read-modify-write overhead. See the section on +:ref:`Alignment Shift (ashift) ` +for a description of proper alignment. Also, see the section on +:ref:`Whole Disks versus Partitions ` for a description of changes in ZFS behavior when operating on a partition. Single disk RAID 0 arrays from RAID controllers are not equivalent to -whole disks. The `Hardware `__ page +whole disks. The :ref:`hardware_raid_controllers` page explains in detail. .. _bit_torrent: @@ -539,7 +539,7 @@ and are subject to significant sequential read workloads after creation. Database workloads ------------------ -Setting redundant_metadata=mostly can increase IOPS by at least a few +Setting ``redundant_metadata=mostly`` can increase IOPS by at least a few percentage points by eliminating redundant metadata at the lowest level of the indirect block tree. This comes with the caveat that data loss will occur if a metadata block pointing to data blocks is corrupted and @@ -553,18 +553,18 @@ InnoDB ^^^^^^ Make separate datasets for InnoDB's data files and log files. Set -recordsize=16K on InnoDB's data files to avoid expensive partial record +``recordsize=16K`` on InnoDB's data files to avoid expensive partial record writes and leave recordsize=128K on the log files. Set -primarycache=metadata on both to prefer InnoDB's -caching.\ `4 `__ -Set logbias=throughput on the data to stop ZIL from writing twice. +``primarycache=metadata`` on both to prefer InnoDB's +caching [#mysql_basic]_. +Set ``logbias=throughput`` on the data to stop ZIL from writing twice. -Set skip-innodb_doublewrite in my.cnf to prevent innodb from writing +Set ``skip-innodb_doublewrite`` in my.cnf to prevent innodb from writing twice. The double writes are a data integrity feature meant to protect against corruption from partially-written records, but those are not -possible on ZFS. It should be noted that Percona’s -blog\ `5 `__ -had advocated using an ext4 configuration where double writes were +possible on ZFS. It should be noted that `Percona’s +blog had advocated `__ +using an ext4 configuration where double writes were turned off for a performance gain, but later recanted it because it caused data corruption. Following a well timed power failure, an in place filesystem such as ext4 can have half of a 8KB record be old while @@ -578,15 +578,15 @@ off for better performance. On Linux, the driver's AIO implementation is a compatibility shim that just barely passes the POSIX standard. InnoDB performance suffers when -using its default AIO codepath. Set innodb_use_native_aio=0 and -innodb_use_atomic_writes=0 in my.cnf to disable AIO. Both of these +using its default AIO codepath. Set ``innodb_use_native_aio=0`` and +``innodb_use_atomic_writes=0`` in my.cnf to disable AIO. Both of these settings must be disabled to disable AIO. PostgreSQL ~~~~~~~~~~ -Make separate datasets for PostgreSQL's data and WAL. Set recordsize=8K -on both to avoid expensive partial record writes. Set logbias=throughput +Make separate datasets for PostgreSQL's data and WAL. Set ``recordsize=8K`` +on both to avoid expensive partial record writes. Set ``logbias=throughput`` on PostgreSQL's data to avoid writing twice. SQLite @@ -594,12 +594,12 @@ SQLite Make a separate dataset for the database. Set the recordsize to 64K. Set the SQLite page size to 65536 -bytes\ `6 `__. +bytes [#sqlite_ps]_. Note that SQLite databases typically are not exercised enough to merit special tuning, but this will provide it. Note the side effect on cache size mentioned at -SQLite.org\ `7 `__. +SQLite.org [#sqlite_ps_change]_. .. _file_servers: @@ -609,7 +609,7 @@ File servers Create a dedicated dataset for files being served. See -`Performance_tuning#Sequential_workloads `__ +:ref:`Sequential workloads ` for configuration recommendations. .. _sequential_workloads: @@ -619,12 +619,12 @@ Sequential workloads Set recordsize=1M on datasets that are subject to sequential workloads. Read -`Performance_tuning#Larger_record_sizes `__ +:ref:`Larger record sizes ` for documentation on things that should be known before setting 1M record sizes. -Set compression=lz4 as per the general recommendation for `LZ4 -compression `__. +Set compression=lz4 as per the general recommendation for :ref:`LZ4 +compression `. .. _video_games_directories: @@ -637,7 +637,7 @@ the game download application to place games there. Specific information on how to configure various ones is below. See -`Performance_tuning#Sequential_workloads `__ +:ref:`Sequential workloads ` for configuration recommendations before installing games. Note that the performance gains from this tuning are likely to be small @@ -683,3 +683,10 @@ QEMU / KVM / Xen ~~~~~~~~~~~~~~~~ AIO should be used to maximize IOPS when using files for guest storage. + +.. rubric:: Footnotes + +.. [#ssd_iops] +.. [#mysql_basic] +.. [#sqlite_ps] +.. [#sqlite_ps_change]