commit c47d449832ad333fbf9d00625c8aad2fcb570db1 Author: George Melikov Date: Sat May 16 16:39:45 2020 +0300 Initial wiki md to rst auto convertation diff --git a/docs/Admin-Documentation.rst b/docs/Admin-Documentation.rst new file mode 100644 index 0000000..27bbf7c --- /dev/null +++ b/docs/Admin-Documentation.rst @@ -0,0 +1,6 @@ +- `Aaron Toponce's ZFS on Linux User + Guide `__ +- `OpenZFS System + Administration `__ +- `Oracle Solaris ZFS Administration + Guide `__ diff --git a/docs/Async-Write.rst b/docs/Async-Write.rst new file mode 100644 index 0000000..21876a8 --- /dev/null +++ b/docs/Async-Write.rst @@ -0,0 +1,36 @@ +Async Writes +~~~~~~~~~~~~ + +The number of concurrent operations issued for the async write I/O class +follows a piece-wise linear function defined by a few adjustable points. + +:: + + | o---------| <-- zfs_vdev_async_write_max_active + ^ | /^ | + | | / | | + active | / | | + I/O | / | | + count | / | | + | / | | + |-------o | | <-- zfs_vdev_async_write_min_active + 0|_______^______|_________| + 0% | | 100% of zfs_dirty_data_max + | | + | `-- zfs_vdev_async_write_active_max_dirty_percent + `--------- zfs_vdev_async_write_active_min_dirty_percent + +Until the amount of dirty data exceeds a minimum percentage of the dirty +data allowed in the pool, the I/O scheduler will limit the number of +concurrent operations to the minimum. As that threshold is crossed, the +number of concurrent operations issued increases linearly to the maximum +at the specified maximum percentage of the dirty data allowed in the +pool. + +Ideally, the amount of dirty data on a busy pool will stay in the sloped +part of the function between +zfs_vdev_async_write_active_min_dirty_percent and +zfs_vdev_async_write_active_max_dirty_percent. If it exceeds the maximum +percentage, this indicates that the rate of incoming data is greater +than the rate that the backend storage can handle. In this case, we must +further throttle incoming writes, as described in the next section. diff --git a/docs/Buildbot-Options.rst b/docs/Buildbot-Options.rst new file mode 100644 index 0000000..9fba242 --- /dev/null +++ b/docs/Buildbot-Options.rst @@ -0,0 +1,245 @@ +There are a number of ways to control the ZFS Buildbot at a commit +level. This page provides a summary of various options that the ZFS +Buildbot supports and how it impacts testing. More detailed information +regarding its implementation can be found at the `ZFS Buildbot Github +page `__. + +Choosing Builders +----------------- + +By default, all commits in your ZFS pull request are compiled by the +BUILD builders. Additionally, the top commit of your ZFS pull request is +tested by TEST builders. However, there is the option to override which +types of builder should be used on a per commit basis. In this case, you +can add +``Requires-builders: `` +to your commit message. A comma separated list of options can be +provided. Supported options are: + +- ``all``: This commit should be built by all available builders +- ``none``: This commit should not be built by any builders +- ``style``: This commit should be built by STYLE builders +- ``build``: This commit should be built by all BUILD builders +- ``arch``: This commit should be built by BUILD builders tagged as + 'Architectures' +- ``distro``: This commit should be built by BUILD builders tagged as + 'Distributions' +- ``test``: This commit should be built and tested by the TEST builders + (excluding the Coverage TEST builders) +- ``perf``: This commit should be built and tested by the PERF builders +- ``coverage`` : This commit should be built and tested by the Coverage + TEST builders +- ``unstable`` : This commit should be built and tested by the Unstable + TEST builders (currently only the Fedora Rawhide TEST builder) + +A couple of examples on how to use ``Requires-builders:`` in commit +messages can be found below. + +.. _preventing-a-commit-from-being-built-and-tested: + +Preventing a commit from being built and tested. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Requires-builders: none + +.. _submitting-a-commit-to-style-and-test-builders-only: + +Submitting a commit to STYLE and TEST builders only. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Requires-builders: style test + +Requiring SPL Versions +---------------------- + +Currently, the ZFS Buildbot attempts to choose the correct SPL branch to +build based on a pull request's base branch. In the cases where a +specific SPL version needs to be built, the ZFS buildbot supports +specifying an SPL version for pull request testing. By opening a pull +request against ZFS and adding ``Requires-spl:`` in a commit message, +you can instruct the buildbot to use a specific SPL version. Below are +examples of a commit messages that specify the SPL version. + +Build SPL from a specific pull request +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Requires-spl: refs/pull/123/head + +Build SPL branch ``spl-branch-name`` from ``zfsonlinux/spl`` repository +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Requires-spl: spl-branch-name + +Requiring Kernel Version +------------------------ + +Currently, Kernel.org builders will clone and build the master branch of +Linux. In cases where a specific version of the Linux kernel needs to be +built, the ZFS buildbot supports specifying the Linux kernel to be built +via commit message. By opening a pull request against ZFS and adding +``Requires-kernel:`` in a commit message, you can instruct the buildbot +to use a specific Linux kernel. Below is an example commit message that +specifies a specific Linux kernel tag. + +.. _build-linux-kernel-version-414: + +Build Linux Kernel Version 4.14 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Requires-kernel: v4.14 + +Build Steps Overrides +--------------------- + +Each builder will execute or skip build steps based on its default +preferences. In some scenarios, it might be possible to skip various +build steps. The ZFS buildbot supports overriding the defaults of all +builders in a commit message. The list of available overrides are: + +- ``Build-linux: ``: All builders should build Linux for this + commit +- ``Build-lustre: ``: All builders should build Lustre for this + commit +- ``Build-spl: ``: All builders should build the SPL for this + commit +- ``Build-zfs: ``: All builders should build ZFS for this + commit +- ``Built-in: ``: All Linux builds should build in SPL and ZFS +- ``Check-lint: ``: All builders should perform lint checks for + this commit +- ``Configure-lustre: ``: Provide ```` as configure + flags when building Lustre +- ``Configure-spl: ``: Provide ```` as configure + flags when building the SPL +- ``Configure-zfs: ``: Provide ```` as configure + flags when building ZFS + +A couple of examples on how to use overrides in commit messages can be +found below. + +Skip building the SPL and build Lustre without ldiskfs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Build-lustre: Yes + Configure-lustre: --disable-ldiskfs + Build-spl: No + +Build ZFS Only +~~~~~~~~~~~~~~ + +:: + + This is a commit message + + This text is part of the commit message body. + + Signed-off-by: Contributor + Build-lustre: No + Build-spl: No + +Configuring Tests with the TEST File +------------------------------------ + +At the top level of the ZFS source tree, there is the ```TEST`` +file `__ which +contains variables that control if and how a specific test should run. +Below is a list of each variable and a brief description of what each +variable controls. + +- ``TEST_PREPARE_WATCHDOG`` - Enables the Linux kernel watchdog +- ``TEST_PREPARE_SHARES`` - Start NFS and Samba servers +- ``TEST_SPLAT_SKIP`` - Determines if ``splat`` testing is skipped +- ``TEST_SPLAT_OPTIONS`` - Command line options to provide to ``splat`` +- ``TEST_ZTEST_SKIP`` - Determines if ``ztest`` testing is skipped +- ``TEST_ZTEST_TIMEOUT`` - The length of time ``ztest`` should run +- ``TEST_ZTEST_DIR`` - Directory where ``ztest`` will create vdevs +- ``TEST_ZTEST_OPTIONS`` - Options to pass to ``ztest`` +- ``TEST_ZTEST_CORE_DIR`` - Directory for ``ztest`` to store core dumps +- ``TEST_ZIMPORT_SKIP`` - Determines if ``zimport`` testing is skipped +- ``TEST_ZIMPORT_DIR`` - Directory used during ``zimport`` +- ``TEST_ZIMPORT_VERSIONS`` - Source versions to test +- ``TEST_ZIMPORT_POOLS`` - Names of the pools for ``zimport`` to use + for testing +- ``TEST_ZIMPORT_OPTIONS`` - Command line options to provide to + ``zimport`` +- ``TEST_XFSTESTS_SKIP`` - Determines if ``xfstest`` testing is skipped +- ``TEST_XFSTESTS_URL`` - URL to download ``xfstest`` from +- ``TEST_XFSTESTS_VER`` - Name of the tarball to download from + ``TEST_XFSTESTS_URL`` +- ``TEST_XFSTESTS_POOL`` - Name of pool to create and used by + ``xfstest`` +- ``TEST_XFSTESTS_FS`` - Name of dataset for use by ``xfstest`` +- ``TEST_XFSTESTS_VDEV`` - Name of the vdev used by ``xfstest`` +- ``TEST_XFSTESTS_OPTIONS`` - Command line options to provide to + ``xfstest`` +- ``TEST_ZFSTESTS_SKIP`` - Determines if ``zfs-tests`` testing is + skipped +- ``TEST_ZFSTESTS_DIR`` - Directory to store files and loopback devices +- ``TEST_ZFSTESTS_DISKS`` - Space delimited list of disks that + ``zfs-tests`` is allowed to use +- ``TEST_ZFSTESTS_DISKSIZE`` - File size of file based vdevs used by + ``zfs-tests`` +- ``TEST_ZFSTESTS_ITERS`` - Number of times ``test-runner`` should + execute its set of tests +- ``TEST_ZFSTESTS_OPTIONS`` - Options to provide ``zfs-tests`` +- ``TEST_ZFSTESTS_RUNFILE`` - The runfile to use when running + ``zfs-tests`` +- ``TEST_ZFSTESTS_TAGS`` - List of tags to provide to ``test-runner`` +- ``TEST_ZFSSTRESS_SKIP`` - Determines if ``zfsstress`` testing is + skipped +- ``TEST_ZFSSTRESS_URL`` - URL to download ``zfsstress`` from +- ``TEST_ZFSSTRESS_VER`` - Name of the tarball to download from + ``TEST_ZFSSTRESS_URL`` +- ``TEST_ZFSSTRESS_RUNTIME`` - Duration to run ``runstress.sh`` +- ``TEST_ZFSSTRESS_POOL`` - Name of pool to create and use for + ``zfsstress`` testing +- ``TEST_ZFSSTRESS_FS`` - Name of dataset for use during ``zfsstress`` + tests +- ``TEST_ZFSSTRESS_FSOPT`` - File system options to provide to + ``zfsstress`` +- ``TEST_ZFSSTRESS_VDEV`` - Directory to store vdevs for use during + ``zfsstress`` tests +- ``TEST_ZFSSTRESS_OPTIONS`` - Command line options to provide to + ``runstress.sh`` diff --git a/docs/Building-ZFS.rst b/docs/Building-ZFS.rst new file mode 100644 index 0000000..1fae4d6 --- /dev/null +++ b/docs/Building-ZFS.rst @@ -0,0 +1,243 @@ +GitHub Repositories +~~~~~~~~~~~~~~~~~~~ + +The official source for ZFS on Linux is maintained at GitHub by the +`zfsonlinux `__ organization. The +project consists of two primary git repositories named +`spl `__ and +`zfs `__, both are required to build +ZFS on Linux. + +**NOTE:** The SPL was merged in to the +`zfs `__ repository, the last major +release with a separate SPL is ``0.7``. + +- **SPL**: The SPL is thin shim layer which is responsible for + implementing the fundamental interfaces required by OpenZFS. It's + this layer which allows OpenZFS to be used across multiple platforms. + +- **ZFS**: The ZFS repository contains a copy of the upstream OpenZFS + code which has been adapted and extended for Linux. The vast majority + of the core OpenZFS code is self-contained and can be used without + modification. + +Installing Dependencies +~~~~~~~~~~~~~~~~~~~~~~~ + +The first thing you'll need to do is prepare your environment by +installing a full development tool chain. In addition, development +headers for both the kernel and the following libraries must be +available. It is important to note that if the development kernel +headers for the currently running kernel aren't installed, the modules +won't compile properly. + +The following dependencies should be installed to build the latest ZFS +0.8 release. + +- **RHEL/CentOS 7**: + +.. code:: sh + + sudo yum install epel-release gcc make autoconf automake libtool rpm-build dkms libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python python2-devel python-setuptools python-cffi libffi-devel + +- **RHEL/CentOS 8, Fedora**: + +.. code:: sh + + sudo dnf install gcc make autoconf automake libtool rpm-build dkms libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python3 python3-devel python3-setuptools python3-cffi libffi-devel + +- **Debian, Ubuntu**: + +.. code:: sh + + sudo apt install build-essential autoconf automake libtool gawk alien fakeroot dkms libblkid-dev uuid-dev libudev-dev libssl-dev zlib1g-dev libaio-dev libattr1-dev libelf-dev linux-headers-$(uname -r) python3 python3-dev python3-setuptools python3-cffi libffi-dev + +Build Options +~~~~~~~~~~~~~ + +There are two options for building ZFS on Linux, the correct one largely +depends on your requirements. + +- **Packages**: Often it can be useful to build custom packages from + git which can be installed on a system. This is the best way to + perform integration testing with systemd, dracut, and udev. The + downside to using packages it is greatly increases the time required + to build, install, and test a change. + +- **In-tree**: Development can be done entirely in the SPL and ZFS + source trees. This speeds up development by allowing developers to + rapidly iterate on a patch. When working in-tree developers can + leverage incremental builds, load/unload kernel modules, execute + utilities, and verify all their changes with the ZFS Test Suite. + +The remainder of this page focuses on the **in-tree** option which is +the recommended method of development for the majority of changes. See +the [[custom-packages]] page for additional information on building +custom packages. + +Developing In-Tree +~~~~~~~~~~~~~~~~~~ + +Clone from GitHub +^^^^^^^^^^^^^^^^^ + +Start by cloning the SPL and ZFS repositories from GitHub. The +repositories have a **master** branch for development and a series of +**\*-release** branches for tagged releases. After checking out the +repository your clone will default to the master branch. Tagged releases +may be built by checking out spl/zfs-x.y.z tags with matching version +numbers or matching release branches. Avoid using mismatched versions, +this can result build failures due to interface changes. + +**NOTE:** SPL was merged in to the +`zfs `__ repository, last release +with separate SPL is ``0.7``. + +:: + + git clone https://github.com/zfsonlinux/zfs + +If you need 0.7 release or older: + +:: + + git clone https://github.com/zfsonlinux/spl + +Configure and Build +^^^^^^^^^^^^^^^^^^^ + +For developers working on a change always create a new topic branch +based off of master. This will make it easy to open a pull request with +your change latter. The master branch is kept stable with extensive +`regression testing `__ of every pull +request before and after it's merged. Every effort is made to catch +defects as early as possible and to keep them out of the tree. +Developers should be comfortable frequently rebasing their work against +the latest master branch. + +If you want to build 0.7 release or older, you should compile SPL first: + +:: + + cd ./spl + git checkout master + sh autogen.sh + ./configure + make -s -j$(nproc) + +In this example we'll use the master branch and walk through a stock +**in-tree** build, so we don't need to build SPL separately. Start by +checking out the desired branch then build the ZFS and SPL source in the +tradition autotools fashion. + +:: + + cd ./zfs + git checkout master + sh autogen.sh + ./configure + make -s -j$(nproc) + +| **tip:** ``--with-linux=PATH`` and ``--with-linux-obj=PATH`` can be + passed to configure to specify a kernel installed in a non-default + location. This option is also supported when building ZFS. +| **tip:** ``--enable-debug`` can be set to enable all ASSERTs and + additional correctness tests. This option is also supported when + building ZFS. +| **tip:** for version ``<=0.7`` ``--with-spl=PATH`` and + ``--with-spl-obj=PATH``, where ``PATH`` is a full path, can be passed + to configure if it is unable to locate the SPL. + +**Optional** Build packages + +:: + + make deb #example for Debian/Ubuntu + +Install +^^^^^^^ + +You can run ``zfs-tests.sh`` without installing ZFS, see below. If you +have reason to install ZFS after building it, pay attention to how your +distribution handles kernel modules. On Ubuntu, for example, the modules +from this repository install in the ``extra`` kernel module path, which +is not in the standard ``depmod`` search path. Therefore, for the +duration of your testing, edit ``/etc/depmod.d/ubuntu.conf`` and add +``extra`` to the beginning of the search path. + +You may then install using +``sudo make install; sudo ldconfig; sudo depmod``. You'd uninstall with +``sudo make uninstall; sudo ldconfig; sudo depmod``. + +.. _running-zloopsh-and-zfs-testssh: + +Running zloop.sh and zfs-tests.sh +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If you wish to run the ZFS Test Suite (ZTS), then ``ksh`` and a few +additional utilities must be installed. + +- **RHEL/CentOS 7:** + +.. code:: sh + + sudo yum install ksh bc fio acl sysstat mdadm lsscsi parted attr dbench nfs-utils samba rng-tools pax perf + +- **RHEL/CentOS 8, Fedora:** + +.. code:: sh + + sudo dnf install ksh bc fio acl sysstat mdadm lsscsi parted attr dbench nfs-utils samba rng-tools pax perf + +- **Debian, Ubuntu:** + +.. code:: sh + + sudo apt install ksh bc fio acl sysstat mdadm lsscsi parted attr dbench nfs-kernel-server samba rng-tools pax linux-tools-common selinux-utils quota + +There are a few helper scripts provided in the top-level scripts +directory designed to aid developers working with in-tree builds. + +- **zfs-helper.sh:** Certain functionality (i.e. /dev/zvol/) depends on + the ZFS provided udev helper scripts being installed on the system. + This script can be used to create symlinks on the system from the + installation location to the in-tree helper. These links must be in + place to successfully run the ZFS Test Suite. The **-i** and **-r** + options can be used to install and remove the symlinks. + +:: + + sudo ./scripts/zfs-helpers.sh -i + +- **zfs.sh:** The freshly built kernel modules can be loaded using + ``zfs.sh``. This script can latter be used to unload the kernel + modules with the **-u** option. + +:: + + sudo ./scripts/zfs.sh + +- **zloop.sh:** A wrapper to run ztest repeatedly with randomized + arguments. The ztest command is a user space stress test designed to + detect correctness issues by concurrently running a random set of + test cases. If a crash is encountered, the ztest logs, any associated + vdev files, and core file (if one exists) are collected and moved to + the output directory for analysis. + +:: + + sudo ./scripts/zloop.sh + +- **zfs-tests.sh:** A wrapper which can be used to launch the ZFS Test + Suite. Three loopback devices are created on top of sparse files + located in ``/var/tmp/`` and used for the regression test. Detailed + directions for the ZFS Test Suite can be found in the + `README `__ + located in the top-level tests directory. + +:: + + ./scripts/zfs-tests.sh -vx + +**tip:** The **delegate** tests will be skipped unless group read +permission is set on the zfs directory and its parents. diff --git a/docs/Checksums.rst b/docs/Checksums.rst new file mode 100644 index 0000000..a4028a5 --- /dev/null +++ b/docs/Checksums.rst @@ -0,0 +1,124 @@ +Checksums and Their Use in ZFS +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +End-to-end checksums are a key feature of ZFS and an important +differentiator for ZFS over other RAID implementations and filesystems. +Advantages of end-to-end checksums include: + +- detects data corruption upon reading from media +- blocks that are detected as corrupt are automatically repaired if + possible, by using the RAID protection in suitably configured pools, + or redundant copies (see the zfs ``copies`` property) +- periodic scrubs can check data to detect and repair latent media + degradation (bit rot) and corruption from other sources +- checksums on ZFS replication streams, ``zfs send`` and + ``zfs receive``, ensure the data received is not corrupted by + intervening storage or transport mechanisms + +Checksum Algorithms +^^^^^^^^^^^^^^^^^^^ + +The checksum algorithms in ZFS can be changed for datasets (filesystems +or volumes). The checksum algorithm used for each block is stored in the +block pointer (metadata). The block checksum is calculated when the +block is written, so changing the algorithm only affects writes +occurring after the change. + +The checksum algorithm for a dataset can be changed by setting the +``checksum`` property: + +.. code:: bash + + zfs set checksum=sha256 pool_name/dataset_name + ++-----------+-----------------+-----------------+-----------------+ +| Checksum | Ok for dedup | Compatible with | Notes | +| | and nopwrite? | other ZFS | | +| | | i | | +| | | mplementations? | | ++===========+=================+=================+=================+ +| on | see notes | yes | ``on`` is a | +| | | | short hand for | +| | | | ``fletcher4`` | +| | | | for non-deduped | +| | | | datasets and | +| | | | ``sha256`` for | +| | | | deduped | +| | | | datasets | ++-----------+-----------------+-----------------+-----------------+ +| off | no | yes | Do not do use | +| | | | ``off`` | ++-----------+-----------------+-----------------+-----------------+ +| fletcher2 | no | yes | Deprecated | +| | | | implementation | +| | | | of Fletcher | +| | | | checksum, use | +| | | | ``fletcher4`` | +| | | | instead | ++-----------+-----------------+-----------------+-----------------+ +| fletcher4 | no | yes | Fletcher | +| | | | algorithm, also | +| | | | used for | +| | | | ``zfs send`` | +| | | | streams | ++-----------+-----------------+-----------------+-----------------+ +| sha256 | yes | yes | Default for | +| | | | deduped | +| | | | datasets | ++-----------+-----------------+-----------------+-----------------+ +| noparity | no | yes | Do not use | +| | | | ``noparity`` | ++-----------+-----------------+-----------------+-----------------+ +| sha512 | yes | requires pool | salted | +| | | feature | ``sha512`` | +| | | ``org.i | currently not | +| | | llumos:sha512`` | supported for | +| | | | any filesystem | +| | | | on the boot | +| | | | pools | ++-----------+-----------------+-----------------+-----------------+ +| skein | yes | requires pool | salted | +| | | feature | ``skein`` | +| | | ``org. | currently not | +| | | illumos:skein`` | supported for | +| | | | any filesystem | +| | | | on the boot | +| | | | pools | ++-----------+-----------------+-----------------+-----------------+ +| edonr | yes | requires pool | salted | +| | | feature | ``edonr`` | +| | | ``org. | currently not | +| | | illumos:edonr`` | supported for | +| | | | any filesystem | +| | | | on the boot | +| | | | pools | ++-----------+-----------------+-----------------+-----------------+ + +Checksum Accelerators +^^^^^^^^^^^^^^^^^^^^^ + +ZFS has the ability to offload checksum operations to the Intel +QuickAssist Technology (QAT) adapters. + +Checksum Microbenchmarks +^^^^^^^^^^^^^^^^^^^^^^^^ + +Some ZFS features use microbenchmarks when the ``zfs.ko`` kernel module +is loaded to determine the optimal algorithm for checksums. The results +of the microbenchmarks are observable in the ``/proc/spl/kstat/zfs`` +directory. The winning algorithm is reported as the "fastest" and +becomes the default. The default can be overridden by setting zfs module +parameters. + +========= ==================================== ======================== +Checksum Results Filename ``zfs`` module parameter +========= ==================================== ======================== +Fletcher4 /proc/spl/kstat/zfs/fletcher_4_bench zfs_fletcher_4_impl +========= ==================================== ======================== + +Disabling Checksums +^^^^^^^^^^^^^^^^^^^ + +While it may be tempting to disable checksums to improve CPU +performance, it is widely considered by the ZFS community to be an +extrodinarily bad idea. Don't disable checksums. diff --git a/docs/Custom-Packages.rst b/docs/Custom-Packages.rst new file mode 100644 index 0000000..3f0efbf --- /dev/null +++ b/docs/Custom-Packages.rst @@ -0,0 +1,204 @@ +The following instructions assume you are building from an official +`release tarball `__ +(version 0.8.0 or newer) or directly from the `git +repository `__. Most users should not +need to do this and should preferentially use the distribution packages. +As a general rule the distribution packages will be more tightly +integrated, widely tested, and better supported. However, if your +distribution of choice doesn't provide packages, or you're a developer +and want to roll your own, here's how to do it. + +The first thing to be aware of is that the build system is capable of +generating several different types of packages. Which type of package +you choose depends on what's supported on your platform and exactly what +your needs are. + +- **DKMS** packages contain only the source code and scripts for + rebuilding the kernel modules. When the DKMS package is installed + kernel modules will be built for all available kernels. Additionally, + when the kernel is upgraded new kernel modules will be automatically + built for that kernel. This is particularly convenient for desktop + systems which receive frequent kernel updates. The downside is that + because the DKMS packages build the kernel modules from source a full + development environment is required which may not be appropriate for + large deployments. + +- **kmods** packages are binary kernel modules which are compiled + against a specific version of the kernel. This means that if you + update the kernel you must compile and install a new kmod package. If + you don't frequently update your kernel, or if you're managing a + large number of systems, then kmod packages are a good choice. + +- **kABI-tracking kmod** Packages are similar to standard binary kmods + and may be used with Enterprise Linux distributions like Red Hat and + CentOS. These distributions provide a stable kABI (Kernel Application + Binary Interface) which allows the same binary modules to be used + with new versions of the distribution provided kernel. + +By default the build system will generate user packages and both DKMS +and kmod style kernel packages if possible. The user packages can be +used with either set of kernel packages and do not need to be rebuilt +when the kernel is updated. You can also streamline the build process by +building only the DKMS or kmod packages as shown below. + +Be aware that when building directly from a git repository you must +first run the *autogen.sh* script to create the *configure* script. This +will require installing the GNU autotools packages for your +distribution. To perform any of the builds, you must install all the +necessary development tools and headers for your distribution. + +It is important to note that if the development kernel headers for the +currently running kernel aren't installed, the modules won't compile +properly. + +- `Red Hat, CentOS and Fedora <#red-hat-centos-and-fedora>`__ +- `Debian and Ubuntu <#debian-and-ubuntu>`__ + +RHEL, CentOS and Fedora +----------------------- + +Make sure that the required packages are installed to build the latest +ZFS 0.8 release: + +- **RHEL/CentOS 7**: + +.. code:: sh + + sudo yum install epel-release gcc make autoconf automake libtool rpm-build dkms libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python python2-devel python-setuptools python-cffi libffi-devel + +- **RHEL/CentOS 8, Fedora**: + +.. code:: sh + + sudo dnf install gcc make autoconf automake libtool rpm-build kernel-rpm-macros dkms libtirpc-devel libblkid-devel libuuid-devel libudev-devel openssl-devel zlib-devel libaio-devel libattr-devel elfutils-libelf-devel kernel-devel-$(uname -r) python3 python3-devel python3-setuptools python3-cffi libffi-devel + +`Get the source code <#get-the-source-code>`__. + +DKMS +~~~~ + +Building rpm-based DKMS and user packages can be done as follows: + +.. code:: sh + + $ cd zfs + $ ./configure + $ make -j1 rpm-utils rpm-dkms + $ sudo yum localinstall *.$(uname -p).rpm *.noarch.rpm + +kmod +~~~~ + +The key thing to know when building a kmod package is that a specific +Linux kernel must be specified. At configure time the build system will +make an educated guess as to which kernel you want to build against. +However, if configure is unable to locate your kernel development +headers, or you want to build against a different kernel, you must +specify the exact path with the *--with-linux* and *--with-linux-obj* +options. + +.. code:: sh + + $ cd zfs + $ ./configure + $ make -j1 rpm-utils rpm-kmod + $ sudo yum localinstall *.$(uname -p).rpm + +kABI-tracking kmod +~~~~~~~~~~~~~~~~~~ + +The process for building kABI-tracking kmods is almost identical to for +building normal kmods. However, it will only produce binaries which can +be used by multiple kernels if the distribution supports a stable kABI. +In order to request kABI-tracking package the *--with-spec=redhat* +option must be passed to configure. + +**NOTE:** This type of package is not available for Fedora. + +.. code:: sh + + $ cd zfs + $ ./configure --with-spec=redhat + $ make -j1 rpm-utils rpm-kmod + $ sudo yum localinstall *.$(uname -p).rpm + +Debian and Ubuntu +----------------- + +Make sure that the required packages are installed: + +.. code:: sh + + sudo apt install build-essential autoconf automake libtool gawk alien fakeroot dkms libblkid-dev uuid-dev libudev-dev libssl-dev zlib1g-dev libaio-dev libattr1-dev libelf-dev linux-headers-$(uname -r) python3 python3-dev python3-setuptools python3-cffi libffi-dev + +`Get the source code <#get-the-source-code>`__. + +.. _kmod-1: + +kmod +~~~~ + +The key thing to know when building a kmod package is that a specific +Linux kernel must be specified. At configure time the build system will +make an educated guess as to which kernel you want to build against. +However, if configure is unable to locate your kernel development +headers, or you want to build against a different kernel, you must +specify the exact path with the *--with-linux* and *--with-linux-obj* +options. + +.. code:: sh + + $ cd zfs + $ ./configure + $ make -j1 deb-utils deb-kmod + $ for file in *.deb; do sudo gdebi -q --non-interactive $file; done + +.. _dkms-1: + +DKMS +~~~~ + +Building deb-based DKMS and user packages can be done as follows: + +.. code:: sh + + $ sudo apt-get install dkms + $ cd zfs + $ ./configure + $ make -j1 deb-utils deb-dkms + $ for file in *.deb; do sudo gdebi -q --non-interactive $file; done + +Get the Source Code +------------------- + +Released Tarball +~~~~~~~~~~~~~~~~ + +The released tarball contains the latest fully tested and released +version of ZFS. This is the preferred source code location for use in +production systems. If you want to use the official released tarballs, +then use the following commands to fetch and prepare the source. + +.. code:: sh + + $ wget http://archive.zfsonlinux.org/downloads/zfsonlinux/zfs/zfs-x.y.z.tar.gz + $ tar -xzf zfs-x.y.z.tar.gz + +Git Master Branch +~~~~~~~~~~~~~~~~~ + +The Git *master* branch contains the latest version of the software, and +will probably contain fixes that, for some reason, weren't included in +the released tarball. This is the preferred source code location for +developers who intend to modify ZFS. If you would like to use the git +version, you can clone it from Github and prepare the source like this. + +.. code:: sh + + $ git clone https://github.com/zfsonlinux/zfs.git + $ cd zfs + $ ./autogen.sh + +Once the source has been prepared you'll need to decide what kind of +packages you're building and jump the to appropriate section above. Note +that not all package types are supported for all platforms. diff --git a/docs/Debian-Buster-Encrypted-Root-on-ZFS.rst b/docs/Debian-Buster-Encrypted-Root-on-ZFS.rst new file mode 100644 index 0000000..f3813cf --- /dev/null +++ b/docs/Debian-Buster-Encrypted-Root-on-ZFS.rst @@ -0,0 +1,47 @@ +This experimental guide has been made official at [[Debian Buster Root +on ZFS]]. + +If you have an existing system installed from the experimental guide, +adjust your sources: + +:: + + vi /etc/apt/sources.list.d/buster-backports.list + deb http://deb.debian.org/debian buster-backports main contrib + deb-src http://deb.debian.org/debian buster-backports main contrib + + vi /etc/apt/preferences.d/90_zfs + Package: libnvpair1linux libuutil1linux libzfs2linux libzpool2linux zfs-dkms zfs-initramfs zfs-test zfsutils-linux zfs-zed + Pin: release n=buster-backports + Pin-Priority: 990 + +This will allow you to upgrade from the locally-built packages to the +official buster-backports packages. + +You should set a root password before upgrading: + +:: + + passwd + +Apply updates: + +:: + + apt update + apt dist-upgrade + +Reboot: + +:: + + reboot + +If the bpool fails to import, then enter the rescue shell (which +requires a root password) and run: + +:: + + zpool import -f bpool + zpool export bpool + reboot diff --git a/docs/Debian-Buster-Root-on-ZFS.rst b/docs/Debian-Buster-Root-on-ZFS.rst new file mode 100644 index 0000000..334c40d --- /dev/null +++ b/docs/Debian-Buster-Root-on-ZFS.rst @@ -0,0 +1,1152 @@ +Caution +~~~~~~~ + +- This HOWTO uses a whole physical disk. +- Do not use these instructions for dual-booting. +- Backup your data. Any existing data will be lost. + +System Requirements +~~~~~~~~~~~~~~~~~~~ + +- `64-bit Debian GNU/Linux Buster Live CD w/ GUI (e.g. gnome + iso) `__ +- `A 64-bit kernel is strongly + encouraged. `__ +- Installing on a drive which presents 4KiB logical sectors (a “4Kn” + drive) only works with UEFI booting. This not unique to ZFS. `GRUB + does not and will not work on 4Kn with legacy (BIOS) + booting. `__ + +Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of +memory is recommended for normal performance in basic workloads. If you +wish to use deduplication, you will need `massive amounts of +RAM `__. Enabling +deduplication is a permanent change that cannot be easily reverted. + +Support +------- + +If you need help, reach out to the community using the `zfs-discuss +mailing list `__ +or IRC at #zfsonlinux on `freenode `__. If you +have a bug report or feature request related to this HOWTO, please `file +a new issue `__ and +mention @rlaager. + +Contributing +------------ + +Edit permission on this wiki is restricted. Also, GitHub wikis do not +support pull requests. However, you can clone the wiki using git. + +1) ``git clone https://github.com/zfsonlinux/zfs.wiki.git`` +2) Make your changes. +3) Use ``git diff > my-changes.patch`` to create a patch. (Advanced git + users may wish to ``git commit`` to a branch and + ``git format-patch``.) +4) `File a new issue `__, + mention @rlaager, and attach the patch. + +Encryption +---------- + +This guide supports three different encryption options: unencrypted, +LUKS (full-disk encryption), and ZFS native encryption. With any option, +all ZFS features are fully available. + +Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance. + +LUKS encrypts almost everything: the OS, swap, home directories, and +anything else. The only unencrypted data is the bootloader, kernel, and +initrd. The system cannot boot without the passphrase being entered at +the console. Performance is good, but LUKS sits underneath ZFS, so if +multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk. + +ZFS native encryption encrypts the data and most metadata in the root +pool. It does not encrypt dataset or snapshot names or properties. The +boot pool is not encrypted at all, but it only contains the bootloader, +kernel, and initrd. (Unless you put a password in ``/etc/fstab``, the +initrd is unlikely to contain sensitive data.) The system cannot boot +without the passphrase being entered at the console. Performance is +good. As the encryption happens in ZFS, even if multiple disks (mirror +or raidz topologies) are used, the data only has to be encrypted once. + +Step 1: Prepare The Install Environment +--------------------------------------- + +1.1 Boot the Debian GNU/Linux Live CD. If prompted, login with the +username ``user`` and password ``live``. Connect your system to the +Internet as appropriate (e.g. join your WiFi network). + +1.2 Optional: Install and start the OpenSSH server in the Live CD +environment: + +If you have a second system, using SSH to access the target system can +be convenient. + +:: + + sudo apt update + sudo apt install --yes openssh-server + sudo systemctl restart ssh + +**Hint:** You can find your IP address with +``ip addr show scope global | grep inet``. Then, from your main machine, +connect with ``ssh user@IP``. + +1.3 Become root: + +:: + + sudo -i + +1.4 Setup and update the repositories: + +:: + + echo deb http://deb.debian.org/debian buster contrib >> /etc/apt/sources.list + echo deb http://deb.debian.org/debian buster-backports main contrib >> /etc/apt/sources.list + apt update + +1.5 Install ZFS in the Live CD environment: + +:: + + apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-$(uname -r) + apt install --yes -t buster-backports --no-install-recommends zfs-dkms + modprobe zfs + apt install --yes -t buster-backports zfsutils-linux + +- The dkms dependency is installed manually just so it comes from + buster and not buster-backports. This is not critical. +- We need to get the module built and loaded before installing + zfsutils-linux or `zfs-mount.service will fail to + start `__. + +Step 2: Disk Formatting +----------------------- + +2.1 Set a variable with the disk name: + +:: + + DISK=/dev/disk/by-id/scsi-SATA_disk1 + +Always use the long ``/dev/disk/by-id/*`` aliases with ZFS. Using the +``/dev/sd*`` device nodes directly can cause sporadic import failures, +especially on systems that have more than one storage pool. + +**Hints:** + +- ``ls -la /dev/disk/by-id`` will list the aliases. +- Are you doing this in a virtual machine? If your virtual disk is + missing from ``/dev/disk/by-id``, use ``/dev/vda`` if you are using + KVM with virtio; otherwise, read the + `troubleshooting <#troubleshooting>`__ section. + +2.2 If you are re-using a disk, clear it as necessary: + +If the disk was previously used in an MD array, zero the superblock: + +:: + + apt install --yes mdadm + mdadm --zero-superblock --force $DISK + +Clear the partition table: + +:: + + sgdisk --zap-all $DISK + +2.3 Partition your disk(s): + +Run this if you need legacy (BIOS) booting: + +:: + + sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK + +Run this for UEFI booting (for use now or in the future): + +:: + + sgdisk -n2:1M:+512M -t2:EF00 $DISK + +Run this for the boot pool: + +:: + + sgdisk -n3:0:+1G -t3:BF01 $DISK + +Choose one of the following options: + +2.3a Unencrypted or ZFS native encryption: + +:: + + sgdisk -n4:0:0 -t4:BF01 $DISK + +2.3b LUKS: + +:: + + sgdisk -n4:0:0 -t4:8300 $DISK + +If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool. + +2.4 Create the boot pool: + +:: + + zpool create -o ashift=12 -d \ + -o feature@async_destroy=enabled \ + -o feature@bookmarks=enabled \ + -o feature@embedded_data=enabled \ + -o feature@empty_bpobj=enabled \ + -o feature@enabled_txg=enabled \ + -o feature@extensible_dataset=enabled \ + -o feature@filesystem_limits=enabled \ + -o feature@hole_birth=enabled \ + -o feature@large_blocks=enabled \ + -o feature@lz4_compress=enabled \ + -o feature@spacemap_histogram=enabled \ + -o feature@userobj_accounting=enabled \ + -o feature@zpool_checkpoint=enabled \ + -o feature@spacemap_v2=enabled \ + -o feature@project_quota=enabled \ + -o feature@resilver_defer=enabled \ + -o feature@allocation_classes=enabled \ + -O acltype=posixacl -O canmount=off -O compression=lz4 -O devices=off \ + -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt bpool ${DISK}-part3 + +You should not need to customize any of the options for the boot pool. + +GRUB does not support all of the zpool features. See +``spa_feature_names`` in +`grub-core/fs/zfs/zfs.c `__. +This step creates a separate boot pool for ``/boot`` with the features +limited to only those that GRUB supports, allowing the root pool to use +any/all features. Note that GRUB opens the pool read-only, so all +read-only compatible features are "supported" by GRUB. + +**Hints:** + +- If you are creating a mirror or raidz topology, create the pool using + ``zpool create ... bpool mirror /dev/disk/by-id/scsi-SATA_disk1-part3 /dev/disk/by-id/scsi-SATA_disk2-part3`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). +- The pool name is arbitrary. If changed, the new name must be used + consistently. The ``bpool`` convention originated in this HOWTO. + +2.5 Create the root pool: + +Choose one of the following options: + +2.5a Unencrypted: + +:: + + zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt rpool ${DISK}-part4 + +2.5b LUKS: + +:: + + apt install --yes cryptsetup + cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4 + cryptsetup luksOpen ${DISK}-part4 luks1 + zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt rpool /dev/mapper/luks1 + +2.5c ZFS native encryption: + +:: + + zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O encryption=aes-256-gcm -O keylocation=prompt -O keyformat=passphrase \ + -O mountpoint=/ -R /mnt rpool ${DISK}-part4 + +- The use of ``ashift=12`` is recommended here because many drives + today have 4KiB (or larger) physical sectors, even though they + present 512B logical sectors. Also, a future replacement drive may + have 4KiB physical sectors (in which case ``ashift=12`` is desirable) + or 4KiB logical sectors (in which case ``ashift=12`` is required). +- Setting ``-O acltype=posixacl`` enables POSIX ACLs globally. If you + do not want this, remove that option, but later add + ``-o acltype=posixacl`` (note: lowercase "o") to the ``zfs create`` + for ``/var/log``, as `journald requires + ACLs `__ +- Setting ``normalization=formD`` eliminates some corner cases relating + to UTF-8 filename normalization. It also implies ``utf8only=on``, + which means that only UTF-8 filenames are allowed. If you care to + support non-UTF-8 filenames, do not use this option. For a discussion + of why requiring UTF-8 filenames may be a bad idea, see `The problems + with enforced UTF-8 only + filenames `__. +- Setting ``relatime=on`` is a middle ground between classic POSIX + ``atime`` behavior (with its significant performance impact) and + ``atime=off`` (which provides the best performance by completely + disabling atime updates). Since Linux 2.6.30, ``relatime`` has been + the default for other filesystems. See `RedHat's + documentation `__ + for further information. +- Setting ``xattr=sa`` `vastly improves the performance of extended + attributes `__. + Inside ZFS, extended attributes are used to implement POSIX ACLs. + Extended attributes can also be used by user-space applications. + `They are used by some desktop GUI + applications. `__ + `They can be used by Samba to store Windows ACLs and DOS attributes; + they are required for a Samba Active Directory domain + controller. `__ + Note that ```xattr=sa`` is + Linux-specific. `__ + If you move your ``xattr=sa`` pool to another OpenZFS implementation + besides ZFS-on-Linux, extended attributes will not be readable + (though your data will be). If portability of extended attributes is + important to you, omit the ``-O xattr=sa`` above. Even if you do not + want ``xattr=sa`` for the whole pool, it is probably fine to use it + for ``/var/log``. +- Make sure to include the ``-part4`` portion of the drive path. If you + forget that, you are specifying the whole disk, which ZFS will then + re-partition, and you will lose the bootloader partition(s). +- For LUKS, the key size chosen is 512 bits. However, XTS mode requires + two keys, so the LUKS key is split in half. Thus, ``-s 512`` means + AES-256. +- ZFS native encryption uses ``aes-256-ccm`` by default. `AES-GCM seems + to be generally preferred over + AES-CCM `__, + `is faster + now `__, + and `will be even faster in the + future `__. +- Your passphrase will likely be the weakest link. Choose wisely. See + `section 5 of the cryptsetup + FAQ `__ + for guidance. + +**Hints:** + +- If you are creating a mirror or raidz topology, create the pool using + ``zpool create ... rpool mirror /dev/disk/by-id/scsi-SATA_disk1-part4 /dev/disk/by-id/scsi-SATA_disk2-part4`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). For LUKS, use + ``/dev/mapper/luks1``, ``/dev/mapper/luks2``, etc., which you will + have to create using ``cryptsetup``. +- The pool name is arbitrary. If changed, the new name must be used + consistently. On systems that can automatically install to ZFS, the + root pool is named ``rpool`` by default. + +Step 3: System Installation +--------------------------- + +3.1 Create filesystem datasets to act as containers: + +:: + + zfs create -o canmount=off -o mountpoint=none rpool/ROOT + zfs create -o canmount=off -o mountpoint=none bpool/BOOT + +On Solaris systems, the root filesystem is cloned and the suffix is +incremented for major system changes through ``pkg image-update`` or +``beadm``. Similar functionality for APT is possible but currently +unimplemented. Even without such a tool, it can still be used for +manually created clones. + +3.2 Create filesystem datasets for the root and boot filesystems: + +:: + + zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian + zfs mount rpool/ROOT/debian + + zfs create -o canmount=noauto -o mountpoint=/boot bpool/BOOT/debian + zfs mount bpool/BOOT/debian + +With ZFS, it is not normally necessary to use a mount command (either +``mount`` or ``zfs mount``). This situation is an exception because of +``canmount=noauto``. + +3.3 Create datasets: + +:: + + zfs create rpool/home + zfs create -o mountpoint=/root rpool/home/root + zfs create -o canmount=off rpool/var + zfs create -o canmount=off rpool/var/lib + zfs create rpool/var/log + zfs create rpool/var/spool + +The datasets below are optional, depending on your preferences and/or +software choices. + +If you wish to exclude these from snapshots: + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/var/cache + zfs create -o com.sun:auto-snapshot=false rpool/var/tmp + chmod 1777 /mnt/var/tmp + +If you use /opt on this system: + +:: + + zfs create rpool/opt + +If you use /srv on this system: + +:: + + zfs create rpool/srv + +If you use /usr/local on this system: + +:: + + zfs create -o canmount=off rpool/usr + zfs create rpool/usr/local + +If this system will have games installed: + +:: + + zfs create rpool/var/games + +If this system will store local email in /var/mail: + +:: + + zfs create rpool/var/mail + +If this system will use Snap packages: + +:: + + zfs create rpool/var/snap + +If you use /var/www on this system: + +:: + + zfs create rpool/var/www + +If this system will use GNOME: + +:: + + zfs create rpool/var/lib/AccountsService + +If this system will use Docker (which manages its own datasets & +snapshots): + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker + +If this system will use NFS (locking): + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs + +A tmpfs is recommended later, but if you want a separate dataset for +/tmp: + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/tmp + chmod 1777 /mnt/tmp + +The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data such as logs (in ``/var/log``). This will be especially +important if/when a ``beadm`` or similar utility is integrated. The +``com.sun.auto-snapshot`` setting is used by some ZFS snapshot utilities +to exclude transient data. + +If you do nothing extra, ``/tmp`` will be stored as part of the root +filesystem. Alternatively, you can create a separate dataset for +``/tmp``, as shown above. This keeps the ``/tmp`` data out of snapshots +of your root filesystem. It also allows you to set a quota on +``rpool/tmp``, if you want to limit the maximum space used. Otherwise, +you can use a tmpfs (RAM filesystem) later. + +3.4 Install the minimal system: + +:: + + debootstrap buster /mnt + zfs set devices=off rpool + +The ``debootstrap`` command leaves the new system in an unconfigured +state. An alternative to using ``debootstrap`` is to copy the entirety +of a working system into the new ZFS root. + +Step 4: System Configuration +---------------------------- + +4.1 Configure the hostname (change ``HOSTNAME`` to the desired +hostname). + +:: + + echo HOSTNAME > /mnt/etc/hostname + + vi /mnt/etc/hosts + Add a line: + 127.0.1.1 HOSTNAME + or if the system has a real name in DNS: + 127.0.1.1 FQDN HOSTNAME + +**Hint:** Use ``nano`` if you find ``vi`` confusing. + +4.2 Configure the network interface: + +Find the interface name: + +:: + + ip addr show + +Adjust NAME below to match your interface name: + +:: + + vi /mnt/etc/network/interfaces.d/NAME + auto NAME + iface NAME inet dhcp + +Customize this file if the system is not a DHCP client. + +4.3 Configure the package sources: + +:: + + vi /mnt/etc/apt/sources.list + deb http://deb.debian.org/debian buster main contrib + deb-src http://deb.debian.org/debian buster main contrib + + vi /mnt/etc/apt/sources.list.d/buster-backports.list + deb http://deb.debian.org/debian buster-backports main contrib + deb-src http://deb.debian.org/debian buster-backports main contrib + + vi /mnt/etc/apt/preferences.d/90_zfs + Package: libnvpair1linux libuutil1linux libzfs2linux libzfslinux-dev libzpool2linux python3-pyzfs pyzfs-doc spl spl-dkms zfs-dkms zfs-dracut zfs-initramfs zfs-test zfsutils-linux zfsutils-linux-dev zfs-zed + Pin: release n=buster-backports + Pin-Priority: 990 + +4.4 Bind the virtual filesystems from the LiveCD environment to the new +system and ``chroot`` into it: + +:: + + mount --rbind /dev /mnt/dev + mount --rbind /proc /mnt/proc + mount --rbind /sys /mnt/sys + chroot /mnt /usr/bin/env DISK=$DISK bash --login + +**Note:** This is using ``--rbind``, not ``--bind``. + +4.5 Configure a basic system environment: + +:: + + ln -s /proc/self/mounts /etc/mtab + apt update + + apt install --yes locales + dpkg-reconfigure locales + +Even if you prefer a non-English system language, always ensure that +``en_US.UTF-8`` is available. + +:: + + dpkg-reconfigure tzdata + +4.6 Install ZFS in the chroot environment for the new system: + +:: + + apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64 + apt install --yes zfs-initramfs + +4.7 For LUKS installs only, setup crypttab: + +:: + + apt install --yes cryptsetup + + echo luks1 UUID=$(blkid -s UUID -o value ${DISK}-part4) none \ + luks,discard,initramfs > /etc/crypttab + +- The use of ``initramfs`` is a work-around for `cryptsetup does not + support + ZFS `__. + +**Hint:** If you are creating a mirror or raidz topology, repeat the +``/etc/crypttab`` entries for ``luks2``, etc. adjusting for each disk. + +4.8 Install GRUB + +Choose one of the following options: + +4.8a Install GRUB for legacy (BIOS) booting + +:: + + apt install --yes grub-pc + +Install GRUB to the disk(s), not the partition(s). + +4.8b Install GRUB for UEFI booting + +:: + + apt install dosfstools + mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2 + mkdir /boot/efi + echo PARTUUID=$(blkid -s PARTUUID -o value ${DISK}-part2) \ + /boot/efi vfat nofail,x-systemd.device-timeout=1 0 1 >> /etc/fstab + mount /boot/efi + apt install --yes grub-efi-amd64 shim-signed + +- The ``-s 1`` for ``mkdosfs`` is only necessary for drives which + present 4 KiB logical sectors (“4Kn” drives) to meet the minimum + cluster size (given the partition size of 512 MiB) for FAT32. It also + works fine on drives which present 512 B sectors. + +**Note:** If you are creating a mirror or raidz topology, this step only +installs GRUB on the first disk. The other disk(s) will be handled +later. + +4.9 Set a root password + +:: + + passwd + +4.10 Enable importing bpool + +This ensures that ``bpool`` is always imported, regardless of whether +``/etc/zfs/zpool.cache`` exists, whether it is in the cachefile or not, +or whether ``zfs-import-scan.service`` is enabled. + +:: + + vi /etc/systemd/system/zfs-import-bpool.service + [Unit] + DefaultDependencies=no + Before=zfs-import-scan.service + Before=zfs-import-cache.service + + [Service] + Type=oneshot + RemainAfterExit=yes + ExecStart=/sbin/zpool import -N -o cachefile=none bpool + + [Install] + WantedBy=zfs-import.target + +:: + + systemctl enable zfs-import-bpool.service + +4.11 Optional (but recommended): Mount a tmpfs to /tmp + +If you chose to create a ``/tmp`` dataset above, skip this step, as they +are mutually exclusive choices. Otherwise, you can put ``/tmp`` on a +tmpfs (RAM filesystem) by enabling the ``tmp.mount`` unit. + +:: + + cp /usr/share/systemd/tmp.mount /etc/systemd/system/ + systemctl enable tmp.mount + +4.12 Optional (but kindly requested): Install popcon + +The ``popularity-contest`` package reports the list of packages install +on your system. Showing that ZFS is popular may be helpful in terms of +long-term attention from the distro. + +:: + + apt install --yes popularity-contest + +Choose Yes at the prompt. + +Step 5: GRUB Installation +------------------------- + +5.1 Verify that the ZFS boot filesystem is recognized: + +:: + + grub-probe /boot + +5.2 Refresh the initrd files: + +:: + + update-initramfs -u -k all + +**Note:** When using LUKS, this will print "WARNING could not determine +root device from /etc/fstab". This is because `cryptsetup does not +support +ZFS `__. + +5.3 Workaround GRUB's missing zpool-features support: + +:: + + vi /etc/default/grub + Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian" + +5.4 Optional (but highly recommended): Make debugging GRUB easier: + +:: + + vi /etc/default/grub + Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT + Uncomment: GRUB_TERMINAL=console + Save and quit. + +Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired. + +5.5 Update the boot configuration: + +:: + + update-grub + +**Note:** Ignore errors from ``osprober``, if present. + +5.6 Install the boot loader + +5.6a For legacy (BIOS) booting, install GRUB to the MBR: + +:: + + grub-install $DISK + +Note that you are installing GRUB to the whole disk, not a partition. + +If you are creating a mirror or raidz topology, repeat the +``grub-install`` command for each disk in the pool. + +5.6b For UEFI booting, install GRUB: + +:: + + grub-install --target=x86_64-efi --efi-directory=/boot/efi \ + --bootloader-id=debian --recheck --no-floppy + +It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later. + +5.7 Verify that the ZFS module is installed: + +:: + + ls /boot/grub/*/zfs.mod + +5.8 Fix filesystem mount ordering + +Until there is support for mounting ``/boot`` in the initramfs, we also +need to mount that, because it was marked ``canmount=noauto``. Also, +with UEFI, we need to ensure it is mounted before its child filesystem +``/boot/efi``. + +We need to activate ``zfs-mount-generator``. This makes systemd aware of +the separate mountpoints, which is important for things like +``/var/log`` and ``/var/tmp``. In turn, ``rsyslog.service`` depends on +``var-log.mount`` by way of ``local-fs.target`` and services using the +``PrivateTmp`` feature of systemd automatically use +``After=var-tmp.mount``. + +For UEFI booting, unmount /boot/efi first: + +:: + + umount /boot/efi + +Everything else applies to both BIOS and UEFI booting: + +:: + + zfs set mountpoint=legacy bpool/BOOT/debian + echo bpool/BOOT/debian /boot zfs \ + nodev,relatime,x-systemd.requires=zfs-import-bpool.service 0 0 >> /etc/fstab + + mkdir /etc/zfs/zfs-list.cache + touch /etc/zfs/zfs-list.cache/rpool + ln -s /usr/lib/zfs-linux/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d + zed -F & + +Verify that zed updated the cache by making sure this is not empty: + +:: + + cat /etc/zfs/zfs-list.cache/rpool + +If it is empty, force a cache update and check again: + +:: + + zfs set canmount=noauto rpool/ROOT/debian + +Stop zed: + +:: + + fg + Press Ctrl-C. + +Fix the paths to eliminate /mnt: + +:: + + sed -Ei "s|/mnt/?|/|" /etc/zfs/zfs-list.cache/rpool + +Step 6: First Boot +------------------ + +6.1 Snapshot the initial installation: + +:: + + zfs snapshot bpool/BOOT/debian@install + zfs snapshot rpool/ROOT/debian@install + +In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space. + +6.2 Exit from the ``chroot`` environment back to the LiveCD environment: + +:: + + exit + +6.3 Run these commands in the LiveCD environment to unmount all +filesystems: + +:: + + mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + zpool export -a + +6.4 Reboot: + +:: + + reboot + +6.5 Wait for the newly installed system to boot normally. Login as root. + +6.6 Create a user account: + +:: + + zfs create rpool/home/YOURUSERNAME + adduser YOURUSERNAME + cp -a /etc/skel/. /home/YOURUSERNAME + chown -R YOURUSERNAME:YOURUSERNAME /home/YOURUSERNAME + +6.7 Add your user account to the default set of groups for an +administrator: + +:: + + usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video YOURUSERNAME + +6.8 Mirror GRUB + +If you installed to multiple disks, install GRUB on the additional +disks: + +6.8a For legacy (BIOS) booting: + +:: + + dpkg-reconfigure grub-pc + Hit enter until you get to the device selection screen. + Select (using the space bar) all of the disks (not partitions) in your pool. + +6.8b UEFI + +:: + + umount /boot/efi + +For the second and subsequent disks (increment debian-2 to -3, etc.): + +:: + + dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \ + of=/dev/disk/by-id/scsi-SATA_disk2-part2 + efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \ + -p 2 -L "debian-2" -l '\EFI\debian\grubx64.efi' + + mount /boot/efi + +Step 7: (Optional) Configure Swap +--------------------------------- + +**Caution**: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. This issue is currently being investigated in: +`https://github.com/zfsonlinux/zfs/issues/7734 `__ + +7.1 Create a volume dataset (zvol) for use as a swap device: + +:: + + zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \ + -o logbias=throughput -o sync=always \ + -o primarycache=metadata -o secondarycache=none \ + -o com.sun:auto-snapshot=false rpool/swap + +You can adjust the size (the ``4G`` part) to your needs. + +The compression algorithm is set to ``zle`` because it is the cheapest +available algorithm. As this guide recommends ``ashift=12`` (4 kiB +blocks on disk), the common case of a 4 kiB page size means that no +compression algorithm can reduce I/O. The exception is all-zero pages, +which are dropped by ZFS; but some form of compression has to be enabled +to get this behavior. + +7.2 Configure the swap device: + +**Caution**: Always use long ``/dev/zvol`` aliases in configuration +files. Never use a short ``/dev/zdX`` device name. + +:: + + mkswap -f /dev/zvol/rpool/swap + echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab + echo RESUME=none > /etc/initramfs-tools/conf.d/resume + +The ``RESUME=none`` is necessary to disable resuming from hibernation. +This does not work, as the zvol is not present (because the pool has not +yet been imported) at the time the resume script runs. If it is not +disabled, the boot process hangs for 30 seconds waiting for the swap +zvol to appear. + +7.3 Enable the swap device: + +:: + + swapon -av + +Step 8: Full Software Installation +---------------------------------- + +8.1 Upgrade the minimal system: + +:: + + apt dist-upgrade --yes + +8.2 Install a regular set of software: + +:: + + tasksel + +8.3 Optional: Disable log compression: + +As ``/var/log`` is already compressed by ZFS, logrotate’s compression is +going to burn CPU and disk I/O for (in most cases) very little gain. +Also, if you are making snapshots of ``/var/log``, logrotate’s +compression will actually waste space, as the uncompressed data will +live on in the snapshot. You can edit the files in ``/etc/logrotate.d`` +by hand to comment out ``compress``, or use this loop (copy-and-paste +highly recommended): + +:: + + for file in /etc/logrotate.d/* ; do + if grep -Eq "(^|[^#y])compress" "$file" ; then + sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file" + fi + done + +8.4 Reboot: + +:: + + reboot + +Step 9: Final Cleanup +~~~~~~~~~~~~~~~~~~~~~ + +9.1 Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally. + +9.2 Optional: Delete the snapshots of the initial installation: + +:: + + sudo zfs destroy bpool/BOOT/debian@install + sudo zfs destroy rpool/ROOT/debian@install + +9.3 Optional: Disable the root password + +:: + + sudo usermod -p '*' root + +9.4 Optional: Re-enable the graphical boot process: + +If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer. + +:: + + sudo vi /etc/default/grub + Add quiet to GRUB_CMDLINE_LINUX_DEFAULT + Comment out GRUB_TERMINAL=console + Save and quit. + + sudo update-grub + +**Note:** Ignore errors from ``osprober``, if present. + +9.5 Optional: For LUKS installs only, backup the LUKS header: + +:: + + sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \ + --header-backup-file luks1-header.dat + +Store that backup somewhere safe (e.g. cloud storage). It is protected +by your LUKS passphrase, but you may wish to use additional encryption. + +**Hint:** If you created a mirror or raidz topology, repeat this for +each LUKS volume (``luks2``, etc.). + +Troubleshooting +--------------- + +Rescuing using a Live CD +~~~~~~~~~~~~~~~~~~~~~~~~ + +Go through `Step 1: Prepare The Install +Environment <#step-1-prepare-the-install-environment>`__. + +For LUKS, first unlock the disk(s): + +:: + + apt install --yes cryptsetup + cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1 + Repeat for additional disks, if this is a mirror or raidz topology. + +Mount everything correctly: + +:: + + zpool export -a + zpool import -N -R /mnt rpool + zpool import -N -R /mnt bpool + zfs load-key -a + zfs mount rpool/ROOT/debian + zfs mount -a + +If needed, you can chroot into your installed environment: + +:: + + mount --rbind /dev /mnt/dev + mount --rbind /proc /mnt/proc + mount --rbind /sys /mnt/sys + chroot /mnt /bin/bash --login + mount /boot + mount -a + +Do whatever you need to do to fix your system. + +When done, cleanup: + +:: + + exit + mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + zpool export -a + reboot + +MPT2SAS +~~~~~~~ + +Most problem reports for this tutorial involve ``mpt2sas`` hardware that +does slow asynchronous drive initialization, like some IBM M1015 or +OEM-branded cards that have been flashed to the reference LSI firmware. + +The basic problem is that disks on these controllers are not visible to +the Linux kernel until after the regular system is started, and ZoL does +not hotplug pool members. See +`https://github.com/zfsonlinux/zfs/issues/330 `__. + +Most LSI cards are perfectly compatible with ZoL. If your card has this +glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X in +/etc/default/zfs. The system will wait X seconds for all drives to +appear before importing the pool. + +Areca +~~~~~ + +Systems that require the ``arcsas`` blob driver should add it to the +``/etc/initramfs-tools/modules`` file and run +``update-initramfs -u -k all``. + +Upgrade or downgrade the Areca driver if something like +``RIP: 0010:[] [] native_read_tsc+0x6/0x20`` +appears anywhere in kernel log. ZoL is unstable on systems that emit +this error message. + +VMware +~~~~~~ + +- Set ``disk.EnableUUID = "TRUE"`` in the vmx file or vsphere + configuration. Doing this ensures that ``/dev/disk`` aliases are + created in the guest. + +QEMU/KVM/XEN +~~~~~~~~~~~~ + +Set a unique serial number on each virtual disk using libvirt or qemu +(e.g. ``-drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890``). + +To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host: + +:: + + sudo apt install ovmf + + sudo vi /etc/libvirt/qemu.conf + Uncomment these lines: + nvram = [ + "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd", + "/usr/share/OVMF/OVMF_CODE.secboot.fd:/usr/share/OVMF/OVMF_VARS.fd", + "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd", + "/usr/share/AAVMF/AAVMF32_CODE.fd:/usr/share/AAVMF/AAVMF32_VARS.fd" + ] + + sudo systemctl restart libvirtd.service diff --git a/docs/Debian-GNU-Linux-initrd-documentation.rst b/docs/Debian-GNU-Linux-initrd-documentation.rst new file mode 100644 index 0000000..00a638f --- /dev/null +++ b/docs/Debian-GNU-Linux-initrd-documentation.rst @@ -0,0 +1,122 @@ +Supported boot parameters +========================= + +- rollback= Do a rollback of specified snapshot. +- zfs_debug= Debug the initrd script +- zfs_force= Force importing the pool. Should not be + necessary. +- zfs= Don't try to import ANY pool, mount ANY filesystem or + even load the module. +- rpool= Use this pool for root pool. +- bootfs=/ Use this dataset for root filesystem. +- root=/ Use this dataset for root filesystem. +- root=ZFS=/ Use this dataset for root filesystem. +- root=zfs:/ Use this dataset for root filesystem. +- root=zfs:AUTO Try to detect both pool and rootfs + +In all these cases, could also be @. + +The reason there are so many supported boot options to get the root +filesystem, is that there are a lot of different ways too boot ZFS out +there, and I wanted to make sure I supported them all. + +Pool imports +============ + +Import using /dev/disk/by-\* +---------------------------- + +The initrd will, if the variable USE_DISK_BY_ID is set in the file +/etc/default/zfs, to import using the /dev/disk/by-\* links. It will try +to import in this order: + +1. /dev/disk/by-vdev +2. /dev/disk/by-\* +3. /dev + +Import using cache file +----------------------- + +If all of these imports fail (or if USE_DISK_BY_ID is unset), it will +then try to import using the cache file. + +Last ditch attempt at importing +------------------------------- + +If that ALSO fails, it will try one more time, without any -d or -c +options. + +Booting +======= + +Booting from snapshot: +---------------------- + +Enter the snapshot for the root= parameter like in this example: + +:: + + linux /ROOT/debian-1@/boot/vmlinuz-3.2.0-4-amd64 root=ZFS=rpool/ROOT/debian-1@some_snapshot ro boot=zfs $bootfs quiet + +This will clone the snapshot rpool/ROOT/debian-1@some_snapshot into the +filesystem rpool/ROOT/debian-1_some_snapshot and use that as root +filesystem. The original filesystem and snapshot is left alone in this +case. + +**BEWARE** that it will first destroy, blindingly, the +rpool/ROOT/debian-1_some_snapshot filesystem before trying to clone the +snapshot into it again. So if you've booted from the same snapshot +previously and done some changes in that root filesystem, they will be +undone by the destruction of the filesystem. + +Snapshot rollback +----------------- + +From version 0.6.4-1-3 it is now also possible to specify rollback=1 to +do a rollback of the snapshot instead of cloning it. **BEWARE** that +this will destroy *all* snapshots done after the specified snapshot! + +Select snapshot dynamically +--------------------------- + +From version 0.6.4-1-3 it is now also possible to specify a NULL +snapshot name (such as root=rpool/ROOT/debian-1@) and if so, the initrd +script will discover all snapshots below that filesystem (sans the at), +and output a list of snapshot for the user to choose from. + +Booting from native encrypted filesystem +---------------------------------------- + +Although there is currently no support for native encryption in ZFS On +Linux, there is a patch floating around 'out there' and the initrd +supports loading key and unlock such encrypted filesystem. + +Separated filesystems +--------------------- + +Descended filesystems +~~~~~~~~~~~~~~~~~~~~~ + +If there are separate filesystems (for example a separate dataset for +/usr), the snapshot boot code will try to find the snapshot under each +filesystems and clone (or rollback) them. + +Example: + +:: + + rpool/ROOT/debian-1@some_snapshot + rpool/ROOT/debian-1/usr@some_snapshot + +These will create the following filesystems respectively (if not doing a +rollback): + +:: + + rpool/ROOT/debian-1_some_snapshot + rpool/ROOT/debian-1/usr_some_snapshot + +The initrd code will use the mountpoint option (if any) in the original +(without the snapshot part) dataset to find *where* it should mount the +dataset. Or it will use the name of the dataset below the root +filesystem (rpool/ROOT/debian-1 in this example) for the mount point. diff --git a/docs/Debian-Stretch-Root-on-ZFS.rst b/docs/Debian-Stretch-Root-on-ZFS.rst new file mode 100644 index 0000000..5157a8a --- /dev/null +++ b/docs/Debian-Stretch-Root-on-ZFS.rst @@ -0,0 +1,1052 @@ +Newer release available +~~~~~~~~~~~~~~~~~~~~~~~ + +- See [[Debian Buster Root on ZFS]] for new installs. + +Caution +~~~~~~~ + +- This HOWTO uses a whole physical disk. +- Do not use these instructions for dual-booting. +- Backup your data. Any existing data will be lost. + +System Requirements +~~~~~~~~~~~~~~~~~~~ + +- `64-bit Debian GNU/Linux Stretch Live + CD `__ +- `A 64-bit kernel is strongly + encouraged. `__ +- Installing on a drive which presents 4KiB logical sectors (a “4Kn” + drive) only works with UEFI booting. This not unique to ZFS. `GRUB + does not and will not work on 4Kn with legacy (BIOS) + booting. `__ + +Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of +memory is recommended for normal performance in basic workloads. If you +wish to use deduplication, you will need `massive amounts of +RAM `__. Enabling +deduplication is a permanent change that cannot be easily reverted. + +Support +------- + +If you need help, reach out to the community using the `zfs-discuss +mailing list `__ +or IRC at #zfsonlinux on `freenode `__. If you +have a bug report or feature request related to this HOWTO, please `file +a new issue `__ and +mention @rlaager. + +Contributing +------------ + +Edit permission on this wiki is restricted. Also, GitHub wikis do not +support pull requests. However, you can clone the wiki using git. + +1) ``git clone https://github.com/zfsonlinux/zfs.wiki.git`` +2) Make your changes. +3) Use ``git diff > my-changes.patch`` to create a patch. (Advanced git + users may wish to ``git commit`` to a branch and + ``git format-patch``.) +4) `File a new issue `__, + mention @rlaager, and attach the patch. + +Encryption +---------- + +This guide supports two different encryption options: unencrypted and +LUKS (full-disk encryption). ZFS native encryption has not yet been +released. With either option, all ZFS features are fully available. + +Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance. + +LUKS encrypts almost everything: the OS, swap, home directories, and +anything else. The only unencrypted data is the bootloader, kernel, and +initrd. The system cannot boot without the passphrase being entered at +the console. Performance is good, but LUKS sits underneath ZFS, so if +multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk. + +Step 1: Prepare The Install Environment +--------------------------------------- + +1.1 Boot the Debian GNU/Linux Live CD. If prompted, login with the +username ``user`` and password ``live``. Connect your system to the +Internet as appropriate (e.g. join your WiFi network). + +1.2 Optional: Install and start the OpenSSH server in the Live CD +environment: + +If you have a second system, using SSH to access the target system can +be convenient. + +:: + + $ sudo apt update + $ sudo apt install --yes openssh-server + $ sudo systemctl restart ssh + +**Hint:** You can find your IP address with +``ip addr show scope global | grep inet``. Then, from your main machine, +connect with ``ssh user@IP``. + +1.3 Become root: + +:: + + $ sudo -i + +1.4 Setup and update the repositories: + +:: + + # echo deb http://deb.debian.org/debian stretch contrib >> /etc/apt/sources.list + # echo deb http://deb.debian.org/debian stretch-backports main contrib >> /etc/apt/sources.list + # apt update + +1.5 Install ZFS in the Live CD environment: + +:: + + # apt install --yes debootstrap gdisk dkms dpkg-dev linux-headers-$(uname -r) + # apt install --yes -t stretch-backports zfs-dkms + # modprobe zfs + +- The dkms dependency is installed manually just so it comes from + stretch and not stretch-backports. This is not critical. + +Step 2: Disk Formatting +----------------------- + +2.1 If you are re-using a disk, clear it as necessary: + +:: + + If the disk was previously used in an MD array, zero the superblock: + # apt install --yes mdadm + # mdadm --zero-superblock --force /dev/disk/by-id/scsi-SATA_disk1 + + Clear the partition table: + # sgdisk --zap-all /dev/disk/by-id/scsi-SATA_disk1 + +2.2 Partition your disk(s): + +:: + + Run this if you need legacy (BIOS) booting: + # sgdisk -a1 -n1:24K:+1000K -t1:EF02 /dev/disk/by-id/scsi-SATA_disk1 + + Run this for UEFI booting (for use now or in the future): + # sgdisk -n2:1M:+512M -t2:EF00 /dev/disk/by-id/scsi-SATA_disk1 + + Run this for the boot pool: + # sgdisk -n3:0:+1G -t3:BF01 /dev/disk/by-id/scsi-SATA_disk1 + +Choose one of the following options: + +2.2a Unencrypted: + +:: + + # sgdisk -n4:0:0 -t4:BF01 /dev/disk/by-id/scsi-SATA_disk1 + +2.2b LUKS: + +:: + + # sgdisk -n4:0:0 -t4:8300 /dev/disk/by-id/scsi-SATA_disk1 + +Always use the long ``/dev/disk/by-id/*`` aliases with ZFS. Using the +``/dev/sd*`` device nodes directly can cause sporadic import failures, +especially on systems that have more than one storage pool. + +**Hints:** + +- ``ls -la /dev/disk/by-id`` will list the aliases. +- Are you doing this in a virtual machine? If your virtual disk is + missing from ``/dev/disk/by-id``, use ``/dev/vda`` if you are using + KVM with virtio; otherwise, read the + `troubleshooting <#troubleshooting>`__ section. +- If you are creating a mirror or raidz topology, repeat the + partitioning commands for all the disks which will be part of the + pool. + +2.3 Create the boot pool: + +:: + + # zpool create -o ashift=12 -d \ + -o feature@async_destroy=enabled \ + -o feature@bookmarks=enabled \ + -o feature@embedded_data=enabled \ + -o feature@empty_bpobj=enabled \ + -o feature@enabled_txg=enabled \ + -o feature@extensible_dataset=enabled \ + -o feature@filesystem_limits=enabled \ + -o feature@hole_birth=enabled \ + -o feature@large_blocks=enabled \ + -o feature@lz4_compress=enabled \ + -o feature@spacemap_histogram=enabled \ + -o feature@userobj_accounting=enabled \ + -O acltype=posixacl -O canmount=off -O compression=lz4 -O devices=off \ + -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt \ + bpool /dev/disk/by-id/scsi-SATA_disk1-part3 + +You should not need to customize any of the options for the boot pool. + +GRUB does not support all of the zpool features. See +``spa_feature_names`` in +`grub-core/fs/zfs/zfs.c `__. +This step creates a separate boot pool for ``/boot`` with the features +limited to only those that GRUB supports, allowing the root pool to use +any/all features. Note that GRUB opens the pool read-only, so all +read-only compatible features are "supported" by GRUB. + +**Hints:** + +- If you are creating a mirror or raidz topology, create the pool using + ``zpool create ... bpool mirror /dev/disk/by-id/scsi-SATA_disk1-part3 /dev/disk/by-id/scsi-SATA_disk2-part3`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). +- The pool name is arbitrary. If changed, the new name must be used + consistently. The ``bpool`` convention originated in this HOWTO. + +2.4 Create the root pool: + +Choose one of the following options: + +2.4a Unencrypted: + +:: + + # zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt \ + rpool /dev/disk/by-id/scsi-SATA_disk1-part4 + +2.4b LUKS: + +:: + + # apt install --yes cryptsetup + # cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 \ + /dev/disk/by-id/scsi-SATA_disk1-part4 + # cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1 + # zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt \ + rpool /dev/mapper/luks1 + +- The use of ``ashift=12`` is recommended here because many drives + today have 4KiB (or larger) physical sectors, even though they + present 512B logical sectors. Also, a future replacement drive may + have 4KiB physical sectors (in which case ``ashift=12`` is desirable) + or 4KiB logical sectors (in which case ``ashift=12`` is required). +- Setting ``-O acltype=posixacl`` enables POSIX ACLs globally. If you + do not want this, remove that option, but later add + ``-o acltype=posixacl`` (note: lowercase "o") to the ``zfs create`` + for ``/var/log``, as `journald requires + ACLs `__ +- Setting ``normalization=formD`` eliminates some corner cases relating + to UTF-8 filename normalization. It also implies ``utf8only=on``, + which means that only UTF-8 filenames are allowed. If you care to + support non-UTF-8 filenames, do not use this option. For a discussion + of why requiring UTF-8 filenames may be a bad idea, see `The problems + with enforced UTF-8 only + filenames `__. +- Setting ``relatime=on`` is a middle ground between classic POSIX + ``atime`` behavior (with its significant performance impact) and + ``atime=off`` (which provides the best performance by completely + disabling atime updates). Since Linux 2.6.30, ``relatime`` has been + the default for other filesystems. See `RedHat's + documentation `__ + for further information. +- Setting ``xattr=sa`` `vastly improves the performance of extended + attributes `__. + Inside ZFS, extended attributes are used to implement POSIX ACLs. + Extended attributes can also be used by user-space applications. + `They are used by some desktop GUI + applications. `__ + `They can be used by Samba to store Windows ACLs and DOS attributes; + they are required for a Samba Active Directory domain + controller. `__ + Note that ```xattr=sa`` is + Linux-specific. `__ + If you move your ``xattr=sa`` pool to another OpenZFS implementation + besides ZFS-on-Linux, extended attributes will not be readable + (though your data will be). If portability of extended attributes is + important to you, omit the ``-O xattr=sa`` above. Even if you do not + want ``xattr=sa`` for the whole pool, it is probably fine to use it + for ``/var/log``. +- Make sure to include the ``-part4`` portion of the drive path. If you + forget that, you are specifying the whole disk, which ZFS will then + re-partition, and you will lose the bootloader partition(s). +- For LUKS, the key size chosen is 512 bits. However, XTS mode requires + two keys, so the LUKS key is split in half. Thus, ``-s 512`` means + AES-256. +- Your passphrase will likely be the weakest link. Choose wisely. See + `section 5 of the cryptsetup + FAQ `__ + for guidance. + +**Hints:** + +- If you are creating a mirror or raidz topology, create the pool using + ``zpool create ... rpool mirror /dev/disk/by-id/scsi-SATA_disk1-part4 /dev/disk/by-id/scsi-SATA_disk2-part4`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). For LUKS, use + ``/dev/mapper/luks1``, ``/dev/mapper/luks2``, etc., which you will + have to create using ``cryptsetup``. +- The pool name is arbitrary. If changed, the new name must be used + consistently. On systems that can automatically install to ZFS, the + root pool is named ``rpool`` by default. + +Step 3: System Installation +--------------------------- + +3.1 Create filesystem datasets to act as containers: + +:: + + # zfs create -o canmount=off -o mountpoint=none rpool/ROOT + # zfs create -o canmount=off -o mountpoint=none bpool/BOOT + +On Solaris systems, the root filesystem is cloned and the suffix is +incremented for major system changes through ``pkg image-update`` or +``beadm``. Similar functionality for APT is possible but currently +unimplemented. Even without such a tool, it can still be used for +manually created clones. + +3.2 Create filesystem datasets for the root and boot filesystems: + +:: + + # zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/debian + # zfs mount rpool/ROOT/debian + + # zfs create -o canmount=noauto -o mountpoint=/boot bpool/BOOT/debian + # zfs mount bpool/BOOT/debian + +With ZFS, it is not normally necessary to use a mount command (either +``mount`` or ``zfs mount``). This situation is an exception because of +``canmount=noauto``. + +3.3 Create datasets: + +:: + + # zfs create rpool/home + # zfs create -o mountpoint=/root rpool/home/root + # zfs create -o canmount=off rpool/var + # zfs create -o canmount=off rpool/var/lib + # zfs create rpool/var/log + # zfs create rpool/var/spool + + The datasets below are optional, depending on your preferences and/or + software choices: + + If you wish to exclude these from snapshots: + # zfs create -o com.sun:auto-snapshot=false rpool/var/cache + # zfs create -o com.sun:auto-snapshot=false rpool/var/tmp + # chmod 1777 /mnt/var/tmp + + If you use /opt on this system: + # zfs create rpool/opt + + If you use /srv on this system: + # zfs create rpool/srv + + If you use /usr/local on this system: + # zfs create -o canmount=off rpool/usr + # zfs create rpool/usr/local + + If this system will have games installed: + # zfs create rpool/var/games + + If this system will store local email in /var/mail: + # zfs create rpool/var/mail + + If this system will use Snap packages: + # zfs create rpool/var/snap + + If you use /var/www on this system: + # zfs create rpool/var/www + + If this system will use GNOME: + # zfs create rpool/var/lib/AccountsService + + If this system will use Docker (which manages its own datasets & snapshots): + # zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker + + If this system will use NFS (locking): + # zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs + + A tmpfs is recommended later, but if you want a separate dataset for /tmp: + # zfs create -o com.sun:auto-snapshot=false rpool/tmp + # chmod 1777 /mnt/tmp + +The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data such as logs (in ``/var/log``). This will be especially +important if/when a ``beadm`` or similar utility is integrated. The +``com.sun.auto-snapshot`` setting is used by some ZFS snapshot utilities +to exclude transient data. + +If you do nothing extra, ``/tmp`` will be stored as part of the root +filesystem. Alternatively, you can create a separate dataset for +``/tmp``, as shown above. This keeps the ``/tmp`` data out of snapshots +of your root filesystem. It also allows you to set a quota on +``rpool/tmp``, if you want to limit the maximum space used. Otherwise, +you can use a tmpfs (RAM filesystem) later. + +3.4 Install the minimal system: + +:: + + # debootstrap stretch /mnt + # zfs set devices=off rpool + +The ``debootstrap`` command leaves the new system in an unconfigured +state. An alternative to using ``debootstrap`` is to copy the entirety +of a working system into the new ZFS root. + +Step 4: System Configuration +---------------------------- + +4.1 Configure the hostname (change ``HOSTNAME`` to the desired +hostname). + +:: + + # echo HOSTNAME > /mnt/etc/hostname + + # vi /mnt/etc/hosts + Add a line: + 127.0.1.1 HOSTNAME + or if the system has a real name in DNS: + 127.0.1.1 FQDN HOSTNAME + +**Hint:** Use ``nano`` if you find ``vi`` confusing. + +4.2 Configure the network interface: + +:: + + Find the interface name: + # ip addr show + + # vi /mnt/etc/network/interfaces.d/NAME + auto NAME + iface NAME inet dhcp + +Customize this file if the system is not a DHCP client. + +4.3 Configure the package sources: + +:: + + # vi /mnt/etc/apt/sources.list + deb http://deb.debian.org/debian stretch main contrib + deb-src http://deb.debian.org/debian stretch main contrib + + # vi /mnt/etc/apt/sources.list.d/stretch-backports.list + deb http://deb.debian.org/debian stretch-backports main contrib + deb-src http://deb.debian.org/debian stretch-backports main contrib + + # vi /mnt/etc/apt/preferences.d/90_zfs + Package: libnvpair1linux libuutil1linux libzfs2linux libzpool2linux spl-dkms zfs-dkms zfs-test zfsutils-linux zfsutils-linux-dev zfs-zed + Pin: release n=stretch-backports + Pin-Priority: 990 + +4.4 Bind the virtual filesystems from the LiveCD environment to the new +system and ``chroot`` into it: + +:: + + # mount --rbind /dev /mnt/dev + # mount --rbind /proc /mnt/proc + # mount --rbind /sys /mnt/sys + # chroot /mnt /bin/bash --login + +**Note:** This is using ``--rbind``, not ``--bind``. + +4.5 Configure a basic system environment: + +:: + + # ln -s /proc/self/mounts /etc/mtab + # apt update + + # apt install --yes locales + # dpkg-reconfigure locales + +Even if you prefer a non-English system language, always ensure that +``en_US.UTF-8`` is available. + +:: + + # dpkg-reconfigure tzdata + +4.6 Install ZFS in the chroot environment for the new system: + +:: + + # apt install --yes dpkg-dev linux-headers-amd64 linux-image-amd64 + # apt install --yes zfs-initramfs + +4.7 For LUKS installs only, setup crypttab: + +:: + + # apt install --yes cryptsetup + + # echo luks1 UUID=$(blkid -s UUID -o value \ + /dev/disk/by-id/scsi-SATA_disk1-part4) none \ + luks,discard,initramfs > /etc/crypttab + +- The use of ``initramfs`` is a work-around for `cryptsetup does not + support + ZFS `__. + +**Hint:** If you are creating a mirror or raidz topology, repeat the +``/etc/crypttab`` entries for ``luks2``, etc. adjusting for each disk. + +4.8 Install GRUB + +Choose one of the following options: + +4.8a Install GRUB for legacy (BIOS) booting + +:: + + # apt install --yes grub-pc + +Install GRUB to the disk(s), not the partition(s). + +4.8b Install GRUB for UEFI booting + +:: + + # apt install dosfstools + # mkdosfs -F 32 -s 1 -n EFI /dev/disk/by-id/scsi-SATA_disk1-part2 + # mkdir /boot/efi + # echo PARTUUID=$(blkid -s PARTUUID -o value \ + /dev/disk/by-id/scsi-SATA_disk1-part2) \ + /boot/efi vfat nofail,x-systemd.device-timeout=1 0 1 >> /etc/fstab + # mount /boot/efi + # apt install --yes grub-efi-amd64 shim + +- The ``-s 1`` for ``mkdosfs`` is only necessary for drives which + present 4 KiB logical sectors (“4Kn” drives) to meet the minimum + cluster size (given the partition size of 512 MiB) for FAT32. It also + works fine on drives which present 512 B sectors. + +**Note:** If you are creating a mirror or raidz topology, this step only +installs GRUB on the first disk. The other disk(s) will be handled +later. + +4.9 Set a root password + +:: + + # passwd + +4.10 Enable importing bpool + +This ensures that ``bpool`` is always imported, regardless of whether +``/etc/zfs/zpool.cache`` exists, whether it is in the cachefile or not, +or whether ``zfs-import-scan.service`` is enabled. + +:: + + # vi /etc/systemd/system/zfs-import-bpool.service + [Unit] + DefaultDependencies=no + Before=zfs-import-scan.service + Before=zfs-import-cache.service + + [Service] + Type=oneshot + RemainAfterExit=yes + ExecStart=/sbin/zpool import -N -o cachefile=none bpool + + [Install] + WantedBy=zfs-import.target + + # systemctl enable zfs-import-bpool.service + +4.11 Optional (but recommended): Mount a tmpfs to /tmp + +If you chose to create a ``/tmp`` dataset above, skip this step, as they +are mutually exclusive choices. Otherwise, you can put ``/tmp`` on a +tmpfs (RAM filesystem) by enabling the ``tmp.mount`` unit. + +:: + + # cp /usr/share/systemd/tmp.mount /etc/systemd/system/ + # systemctl enable tmp.mount + +4.12 Optional (but kindly requested): Install popcon + +The ``popularity-contest`` package reports the list of packages install +on your system. Showing that ZFS is popular may be helpful in terms of +long-term attention from the distro. + +:: + + # apt install --yes popularity-contest + +Choose Yes at the prompt. + +Step 5: GRUB Installation +------------------------- + +5.1 Verify that the ZFS boot filesystem is recognized: + +:: + + # grub-probe /boot + zfs + +5.2 Refresh the initrd files: + +:: + + # update-initramfs -u -k all + update-initramfs: Generating /boot/initrd.img-4.9.0-8-amd64 + +**Note:** When using LUKS, this will print "WARNING could not determine +root device from /etc/fstab". This is because `cryptsetup does not +support +ZFS `__. + +5.3 Workaround GRUB's missing zpool-features support: + +:: + + # vi /etc/default/grub + Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/debian" + +5.4 Optional (but highly recommended): Make debugging GRUB easier: + +:: + + # vi /etc/default/grub + Remove quiet from: GRUB_CMDLINE_LINUX_DEFAULT + Uncomment: GRUB_TERMINAL=console + Save and quit. + +Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired. + +5.5 Update the boot configuration: + +:: + + # update-grub + Generating grub configuration file ... + Found linux image: /boot/vmlinuz-4.9.0-8-amd64 + Found initrd image: /boot/initrd.img-4.9.0-8-amd64 + done + +**Note:** Ignore errors from ``osprober``, if present. + +5.6 Install the boot loader + +5.6a For legacy (BIOS) booting, install GRUB to the MBR: + +:: + + # grub-install /dev/disk/by-id/scsi-SATA_disk1 + Installing for i386-pc platform. + Installation finished. No error reported. + +Do not reboot the computer until you get exactly that result message. +Note that you are installing GRUB to the whole disk, not a partition. + +If you are creating a mirror or raidz topology, repeat the +``grub-install`` command for each disk in the pool. + +5.6b For UEFI booting, install GRUB: + +:: + + # grub-install --target=x86_64-efi --efi-directory=/boot/efi \ + --bootloader-id=debian --recheck --no-floppy + +5.7 Verify that the ZFS module is installed: + +:: + + # ls /boot/grub/*/zfs.mod + +5.8 Fix filesystem mount ordering + +`Until ZFS gains a systemd mount +generator `__, there are +races between mounting filesystems and starting certain daemons. In +practice, the issues (e.g. +`#5754 `__) seem to be +with certain filesystems in ``/var``, specifically ``/var/log`` and +``/var/tmp``. Setting these to use ``legacy`` mounting, and listing them +in ``/etc/fstab`` makes systemd aware that these are separate +mountpoints. In turn, ``rsyslog.service`` depends on ``var-log.mount`` +by way of ``local-fs.target`` and services using the ``PrivateTmp`` +feature of systemd automatically use ``After=var-tmp.mount``. + +Until there is support for mounting ``/boot`` in the initramfs, we also +need to mount that, because it was marked ``canmount=noauto``. Also, +with UEFI, we need to ensure it is mounted before its child filesystem +``/boot/efi``. + +``rpool`` is guaranteed to be imported by the initramfs, so there is no +point in adding ``x-systemd.requires=zfs-import.target`` to those +filesystems. + +:: + + For UEFI booting, unmount /boot/efi first: + # umount /boot/efi + + Everything else applies to both BIOS and UEFI booting: + + # zfs set mountpoint=legacy bpool/BOOT/debian + # echo bpool/BOOT/debian /boot zfs \ + nodev,relatime,x-systemd.requires=zfs-import-bpool.service 0 0 >> /etc/fstab + + # zfs set mountpoint=legacy rpool/var/log + # echo rpool/var/log /var/log zfs nodev,relatime 0 0 >> /etc/fstab + + # zfs set mountpoint=legacy rpool/var/spool + # echo rpool/var/spool /var/spool zfs nodev,relatime 0 0 >> /etc/fstab + + If you created a /var/tmp dataset: + # zfs set mountpoint=legacy rpool/var/tmp + # echo rpool/var/tmp /var/tmp zfs nodev,relatime 0 0 >> /etc/fstab + + If you created a /tmp dataset: + # zfs set mountpoint=legacy rpool/tmp + # echo rpool/tmp /tmp zfs nodev,relatime 0 0 >> /etc/fstab + +Step 6: First Boot +------------------ + +6.1 Snapshot the initial installation: + +:: + + # zfs snapshot bpool/BOOT/debian@install + # zfs snapshot rpool/ROOT/debian@install + +In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space. + +6.2 Exit from the ``chroot`` environment back to the LiveCD environment: + +:: + + # exit + +6.3 Run these commands in the LiveCD environment to unmount all +filesystems: + +:: + + # mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + # zpool export -a + +6.4 Reboot: + +:: + + # reboot + +6.5 Wait for the newly installed system to boot normally. Login as root. + +6.6 Create a user account: + +:: + + # zfs create rpool/home/YOURUSERNAME + # adduser YOURUSERNAME + # cp -a /etc/skel/.[!.]* /home/YOURUSERNAME + # chown -R YOURUSERNAME:YOURUSERNAME /home/YOURUSERNAME + +6.7 Add your user account to the default set of groups for an +administrator: + +:: + + # usermod -a -G audio,cdrom,dip,floppy,netdev,plugdev,sudo,video YOURUSERNAME + +6.8 Mirror GRUB + +If you installed to multiple disks, install GRUB on the additional +disks: + +6.8a For legacy (BIOS) booting: + +:: + + # dpkg-reconfigure grub-pc + Hit enter until you get to the device selection screen. + Select (using the space bar) all of the disks (not partitions) in your pool. + +6.8b UEFI + +:: + + # umount /boot/efi + + For the second and subsequent disks (increment debian-2 to -3, etc.): + # dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \ + of=/dev/disk/by-id/scsi-SATA_disk2-part2 + # efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \ + -p 2 -L "debian-2" -l '\EFI\debian\grubx64.efi' + + # mount /boot/efi + +Step 7: (Optional) Configure Swap +--------------------------------- + +**Caution**: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. This issue is currently being investigated in: +`https://github.com/zfsonlinux/zfs/issues/7734 `__ + +7.1 Create a volume dataset (zvol) for use as a swap device: + +:: + + # zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \ + -o logbias=throughput -o sync=always \ + -o primarycache=metadata -o secondarycache=none \ + -o com.sun:auto-snapshot=false rpool/swap + +You can adjust the size (the ``4G`` part) to your needs. + +The compression algorithm is set to ``zle`` because it is the cheapest +available algorithm. As this guide recommends ``ashift=12`` (4 kiB +blocks on disk), the common case of a 4 kiB page size means that no +compression algorithm can reduce I/O. The exception is all-zero pages, +which are dropped by ZFS; but some form of compression has to be enabled +to get this behavior. + +7.2 Configure the swap device: + +**Caution**: Always use long ``/dev/zvol`` aliases in configuration +files. Never use a short ``/dev/zdX`` device name. + +:: + + # mkswap -f /dev/zvol/rpool/swap + # echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab + # echo RESUME=none > /etc/initramfs-tools/conf.d/resume + +The ``RESUME=none`` is necessary to disable resuming from hibernation. +This does not work, as the zvol is not present (because the pool has not +yet been imported) at the time the resume script runs. If it is not +disabled, the boot process hangs for 30 seconds waiting for the swap +zvol to appear. + +7.3 Enable the swap device: + +:: + + # swapon -av + +Step 8: Full Software Installation +---------------------------------- + +8.1 Upgrade the minimal system: + +:: + + # apt dist-upgrade --yes + +8.2 Install a regular set of software: + +:: + + # tasksel + +8.3 Optional: Disable log compression: + +As ``/var/log`` is already compressed by ZFS, logrotate’s compression is +going to burn CPU and disk I/O for (in most cases) very little gain. +Also, if you are making snapshots of ``/var/log``, logrotate’s +compression will actually waste space, as the uncompressed data will +live on in the snapshot. You can edit the files in ``/etc/logrotate.d`` +by hand to comment out ``compress``, or use this loop (copy-and-paste +highly recommended): + +:: + + # for file in /etc/logrotate.d/* ; do + if grep -Eq "(^|[^#y])compress" "$file" ; then + sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file" + fi + done + +8.4 Reboot: + +:: + + # reboot + +Step 9: Final Cleanup +~~~~~~~~~~~~~~~~~~~~~ + +9.1 Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally. + +9.2 Optional: Delete the snapshots of the initial installation: + +:: + + $ sudo zfs destroy bpool/BOOT/debian@install + $ sudo zfs destroy rpool/ROOT/debian@install + +9.3 Optional: Disable the root password + +:: + + $ sudo usermod -p '*' root + +9.4 Optional: Re-enable the graphical boot process: + +If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer. + +:: + + $ sudo vi /etc/default/grub + Add quiet to GRUB_CMDLINE_LINUX_DEFAULT + Comment out GRUB_TERMINAL=console + Save and quit. + + $ sudo update-grub + +**Note:** Ignore errors from ``osprober``, if present. + +9.5 Optional: For LUKS installs only, backup the LUKS header: + +:: + + $ sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \ + --header-backup-file luks1-header.dat + +Store that backup somewhere safe (e.g. cloud storage). It is protected +by your LUKS passphrase, but you may wish to use additional encryption. + +**Hint:** If you created a mirror or raidz topology, repeat this for +each LUKS volume (``luks2``, etc.). + +Troubleshooting +--------------- + +Rescuing using a Live CD +~~~~~~~~~~~~~~~~~~~~~~~~ + +Go through `Step 1: Prepare The Install +Environment <#step-1-prepare-the-install-environment>`__. + +This will automatically import your pool. Export it and re-import it to +get the mounts right: + +:: + + For LUKS, first unlock the disk(s): + # apt install --yes cryptsetup + # cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1 + Repeat for additional disks, if this is a mirror or raidz topology. + + # zpool export -a + # zpool import -N -R /mnt rpool + # zpool import -N -R /mnt bpool + # zfs mount rpool/ROOT/debian + # zfs mount -a + +If needed, you can chroot into your installed environment: + +:: + + # mount --rbind /dev /mnt/dev + # mount --rbind /proc /mnt/proc + # mount --rbind /sys /mnt/sys + # chroot /mnt /bin/bash --login + # mount /boot + # mount -a + +Do whatever you need to do to fix your system. + +When done, cleanup: + +:: + + # exit + # mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + # zpool export -a + # reboot + +MPT2SAS +~~~~~~~ + +Most problem reports for this tutorial involve ``mpt2sas`` hardware that +does slow asynchronous drive initialization, like some IBM M1015 or +OEM-branded cards that have been flashed to the reference LSI firmware. + +The basic problem is that disks on these controllers are not visible to +the Linux kernel until after the regular system is started, and ZoL does +not hotplug pool members. See +`https://github.com/zfsonlinux/zfs/issues/330 `__. + +Most LSI cards are perfectly compatible with ZoL. If your card has this +glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X in +/etc/default/zfs. The system will wait X seconds for all drives to +appear before importing the pool. + +Areca +~~~~~ + +Systems that require the ``arcsas`` blob driver should add it to the +``/etc/initramfs-tools/modules`` file and run +``update-initramfs -u -k all``. + +Upgrade or downgrade the Areca driver if something like +``RIP: 0010:[] [] native_read_tsc+0x6/0x20`` +appears anywhere in kernel log. ZoL is unstable on systems that emit +this error message. + +VMware +~~~~~~ + +- Set ``disk.EnableUUID = "TRUE"`` in the vmx file or vsphere + configuration. Doing this ensures that ``/dev/disk`` aliases are + created in the guest. + +QEMU/KVM/XEN +~~~~~~~~~~~~ + +Set a unique serial number on each virtual disk using libvirt or qemu +(e.g. ``-drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890``). + +To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host: + +:: + + $ sudo apt install ovmf + $ sudo vi /etc/libvirt/qemu.conf + Uncomment these lines: + nvram = [ + "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd", + "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd" + ] + $ sudo service libvirt-bin restart diff --git a/docs/Debian.rst b/docs/Debian.rst new file mode 100644 index 0000000..a2bdd76 --- /dev/null +++ b/docs/Debian.rst @@ -0,0 +1,67 @@ +Offical ZFS on Linux +`DKMS `__ +style packages are available from the `Debian GNU/Linux +repository `__ for the +following configurations. The packages previously hosted at +archive.zfsonlinux.org will not be updated and are not recommended for +new installations. + +**Debian Releases:** Stretch (oldstable), Buster (stable), and newer +(testing, sid) **Architectures:** amd64 + +Table of contents +================= + +- `Installation <#installation>`__ +- `Related Links <#related-links>`__ + +Installation +------------ + +For Debian Buster, ZFS packages are included in the `contrib +repository `__. + +If you want to boot from ZFS, see [[Debian Buster Root on ZFS]] instead. +For troubleshooting existing installations on Stretch, see [[Debian +Stretch Root on ZFS]]. + +The `backports +repository `__ often +provides newer releases of ZFS. You can use it as follows: + +Add the backports repository: + +:: + + # vi /etc/apt/sources.list.d/buster-backports.list + deb http://deb.debian.org/debian buster-backports main contrib + deb-src http://deb.debian.org/debian buster-backports main contrib + + # vi /etc/apt/preferences.d/90_zfs + Package: libnvpair1linux libuutil1linux libzfs2linux libzpool2linux spl-dkms zfs-dkms zfs-test zfsutils-linux zfsutils-linux-dev zfs-zed + Pin: release n=buster-backports + Pin-Priority: 990 + +Update the list of packages: + +:: + + # apt update + +Install the kernel headers and other dependencies: + +:: + + # apt install --yes dpkg-dev linux-headers-$(uname -r) linux-image-amd64 + +Install the zfs packages: + +:: + + # apt-get install zfs-dkms zfsutils-linux + +Related Links +------------- + +- [[Debian GNU Linux initrd documentation]] +- [[Debian Buster Root on ZFS]] diff --git a/docs/Debugging.rst b/docs/Debugging.rst new file mode 100644 index 0000000..6efabb8 --- /dev/null +++ b/docs/Debugging.rst @@ -0,0 +1,2 @@ +The future home for documenting ZFS on Linux development and debugging +techniques. diff --git a/docs/Developer-Resources.rst b/docs/Developer-Resources.rst new file mode 100644 index 0000000..e786d1e --- /dev/null +++ b/docs/Developer-Resources.rst @@ -0,0 +1,16 @@ +Developer Resources +=================== + +| [[Custom Packages]] +| [[Building ZFS]] +| `Buildbot + Status `__ +| `Buildbot + Options `__ +| `OpenZFS + Tracking `__ +| [[OpenZFS Patches]] +| [[OpenZFS Exceptions]] +| `OpenZFS + Documentation `__ +| [[Git and GitHub for beginners]] diff --git a/docs/FAQ.rst b/docs/FAQ.rst new file mode 100644 index 0000000..7f244a7 --- /dev/null +++ b/docs/FAQ.rst @@ -0,0 +1,741 @@ +Table Of Contents +----------------- + +- `What is ZFS on Linux <#what-is-zfs-on-linux>`__ +- `Hardware Requirements <#hardware-requirements>`__ +- `Do I have to use ECC memory for + ZFS? <#do-i-have-to-use-ecc-memory-for-zfs>`__ +- `Installation <#installation>`__ +- `Supported Architectures <#supported-architectures>`__ +- `Supported Kernels <#supported-kernels>`__ +- `32-bit vs 64-bit Systems <#32-bit-vs-64-bit-systems>`__ +- `Booting from ZFS <#booting-from-zfs>`__ +- `Selecting /dev/ names when creating a + pool <#selecting-dev-names-when-creating-a-pool>`__ +- `Setting up the /etc/zfs/vdev_id.conf + file <#setting-up-the-etczfsvdev_idconf-file>`__ +- `Changing /dev/ names on an existing + pool <#changing-dev-names-on-an-existing-pool>`__ +- `The /etc/zfs/zpool.cache file <#the-etczfszpoolcache-file>`__ +- `Generating a new /etc/zfs/zpool.cache + file <#generating-a-new-etczfszpoolcache-file>`__ +- `Sending and Receiving Streams <#sending-and-receiving-streams>`__ + + - `hole_birth Bugs <#hole_birth-bugs>`__ + - `Sending Large Blocks <#sending-large-blocks>`__ + +- `CEPH/ZFS <#cephzfs>`__ + + - `ZFS Configuration <#zfs-configuration>`__ + - `CEPH Configuration (ceph.conf} <#ceph-configuration-cephconf>`__ + - `Other General Guidelines <#other-general-guidelines>`__ + +- `Performance Considerations <#performance-considerations>`__ +- `Advanced Format Disks <#advanced-format-disks>`__ +- `ZVOL used space larger than + expected <#ZVOL-used-space-larger-than-expected>`__ +- `Using a zvol for a swap device <#using-a-zvol-for-a-swap-device>`__ +- `Using ZFS on Xen Hypervisor or Xen + Dom0 <#using-zfs-on-xen-hypervisor-or-xen-dom0>`__ +- `udisks2 creates /dev/mapper/ entries for + zvol <#udisks2-creating-devmapper-entries-for-zvol>`__ +- `Licensing <#licensing>`__ +- `Reporting a problem <#reporting-a-problem>`__ +- `Does ZFS on Linux have a Code of + Conduct? <#does-zfs-on-linux-have-a-code-of-conduct>`__ + +What is ZFS on Linux +-------------------- + +The ZFS on Linux project is an implementation of +`OpenZFS `__ designed to work in a +Linux environment. OpenZFS is an outstanding storage platform that +encompasses the functionality of traditional filesystems, volume +managers, and more, with consistent reliability, functionality and +performance across all distributions. Additional information about +OpenZFS can be found in the `OpenZFS wikipedia +article `__. + +Hardware Requirements +--------------------- + +Because ZFS was originally designed for Sun Solaris it was long +considered a filesystem for large servers and for companies that could +afford the best and most powerful hardware available. But since the +porting of ZFS to numerous OpenSource platforms (The BSDs, Illumos and +Linux - under the umbrella organization "OpenZFS"), these requirements +have been lowered. + +The suggested hardware requirements are: + +- ECC memory. This isn't really a requirement, but it's highly + recommended. +- 8GB+ of memory for the best performance. It's perfectly possible to + run with 2GB or less (and people do), but you'll need more if using + deduplication. + +Do I have to use ECC memory for ZFS? +------------------------------------ + +Using ECC memory for OpenZFS is strongly recommended for enterprise +environments where the strongest data integrity guarantees are required. +Without ECC memory rare random bit flips caused by cosmic rays or by +faulty memory can go undetected. If this were to occur OpenZFS (or any +other filesystem) will write the damaged data to disk and be unable to +automatically detect the corruption. + +Unfortunately, ECC memory is not always supported by consumer grade +hardware. And even when it is ECC memory will be more expensive. For +home users the additional safety brought by ECC memory might not justify +the cost. It's up to you to determine what level of protection your data +requires. + +Installation +------------ + +ZFS on Linux is available for all major Linux distributions. Refer to +the [[getting started]] section of the wiki for links to installations +instructions for many popular distributions. If your distribution isn't +listed you can always build ZFS on Linux from the latest official +`tarball `__. + +Supported Architectures +----------------------- + +ZFS on Linux is regularly compiled for the following architectures: +x86_64, x86, aarch64, arm, ppc64, ppc. + +Supported Kernels +----------------- + +The `notes `__ for a given +ZFS on Linux release will include a range of supported kernels. Point +releases will be tagged as needed in order to support the *stable* +kernel available from `kernel.org `__. The +oldest supported kernel is 2.6.32 due to its prominence in Enterprise +Linux distributions. + +.. _32-bit-vs-64-bit-systems: + +32-bit vs 64-bit Systems +------------------------ + +You are **strongly** encouraged to use a 64-bit kernel. ZFS on Linux +will build for 32-bit kernels but you may encounter stability problems. + +ZFS was originally developed for the Solaris kernel which differs from +the Linux kernel in several significant ways. Perhaps most importantly +for ZFS it is common practice in the Solaris kernel to make heavy use of +the virtual address space. However, use of the virtual address space is +strongly discouraged in the Linux kernel. This is particularly true on +32-bit architectures where the virtual address space is limited to 100M +by default. Using the virtual address space on 64-bit Linux kernels is +also discouraged but the address space is so much larger than physical +memory it is less of an issue. + +If you are bumping up against the virtual memory limit on a 32-bit +system you will see the following message in your system logs. You can +increase the virtual address size with the boot option ``vmalloc=512M``. + +:: + + vmap allocation for size 4198400 failed: use vmalloc= to increase size. + +However, even after making this change your system will likely not be +entirely stable. Proper support for 32-bit systems is contingent upon +the OpenZFS code being weaned off its dependence on virtual memory. This +will take some time to do correctly but it is planned for OpenZFS. This +change is also expected to improve how efficiently OpenZFS manages the +ARC cache and allow for tighter integration with the standard Linux page +cache. + +Booting from ZFS +---------------- + +Booting from ZFS on Linux is possible and many people do it. There are +excellent walk throughs available for [[Debian]], [[Ubuntu]] and +`Gentoo `__. + +Selecting /dev/ names when creating a pool +------------------------------------------ + +There are different /dev/ names that can be used when creating a ZFS +pool. Each option has advantages and drawbacks, the right choice for +your ZFS pool really depends on your requirements. For development and +testing using /dev/sdX naming is quick and easy. A typical home server +might prefer /dev/disk/by-id/ naming for simplicity and readability. +While very large configurations with multiple controllers, enclosures, +and switches will likely prefer /dev/disk/by-vdev naming for maximum +control. But in the end, how you choose to identify your disks is up to +you. + +- **/dev/sdX, /dev/hdX:** Best for development/test pools + + - Summary: The top level /dev/ names are the default for consistency + with other ZFS implementations. They are available under all Linux + distributions and are commonly used. However, because they are not + persistent they should only be used with ZFS for development/test + pools. + - Benefits:This method is easy for a quick test, the names are + short, and they will be available on all Linux distributions. + - Drawbacks:The names are not persistent and will change depending + on what order they disks are detected in. Adding or removing + hardware for your system can easily cause the names to change. You + would then need to remove the zpool.cache file and re-import the + pool using the new names. + - Example: ``zpool create tank sda sdb`` + +- **/dev/disk/by-id/:** Best for small pools (less than 10 disks) + + - Summary: This directory contains disk identifiers with more human + readable names. The disk identifier usually consists of the + interface type, vendor name, model number, device serial number, + and partition number. This approach is more user friendly because + it simplifies identifying a specific disk. + - Benefits: Nice for small systems with a single disk controller. + Because the names are persistent and guaranteed not to change, it + doesn't matter how the disks are attached to the system. You can + take them all out, randomly mixed them up on the desk, put them + back anywhere in the system and your pool will still be + automatically imported correctly. + - Drawbacks: Configuring redundancy groups based on physical + location becomes difficult and error prone. + - Example: + ``zpool create tank scsi-SATA_Hitachi_HTS7220071201DP1D10DGG6HMRP`` + +- **/dev/disk/by-path/:** Good for large pools (greater than 10 disks) + + - Summary: This approach is to use device names which include the + physical cable layout in the system, which means that a particular + disk is tied to a specific location. The name describes the PCI + bus number, as well as enclosure names and port numbers. This + allows the most control when configuring a large pool. + - Benefits: Encoding the storage topology in the name is not only + helpful for locating a disk in large installations. But it also + allows you to explicitly layout your redundancy groups over + multiple adapters or enclosures. + - Drawbacks: These names are long, cumbersome, and difficult for a + human to manage. + - Example: + ``zpool create tank pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-1:0:0:0`` + +- **/dev/disk/by-vdev/:** Best for large pools (greater than 10 disks) + + - Summary: This approach provides administrative control over device + naming using the configuration file /etc/zfs/vdev_id.conf. Names + for disks in JBODs can be generated automatically to reflect their + physical location by enclosure IDs and slot numbers. The names can + also be manually assigned based on existing udev device links, + including those in /dev/disk/by-path or /dev/disk/by-id. This + allows you to pick your own unique meaningful names for the disks. + These names will be displayed by all the zfs utilities so it can + be used to clarify the administration of a large complex pool. See + the vdev_id and vdev_id.conf man pages for further details. + - Benefits: The main benefit of this approach is that it allows you + to choose meaningful human-readable names. Beyond that, the + benefits depend on the naming method employed. If the names are + derived from the physical path the benefits of /dev/disk/by-path + are realized. On the other hand, aliasing the names based on drive + identifiers or WWNs has the same benefits as using + /dev/disk/by-id. + - Drawbacks: This method relies on having a /etc/zfs/vdev_id.conf + file properly configured for your system. To configure this file + please refer to section `Setting up the /etc/zfs/vdev_id.conf + file <#setting-up-the-etczfsvdev_idconf-file>`__. As with + benefits, the drawbacks of /dev/disk/by-id or /dev/disk/by-path + may apply depending on the naming method employed. + - Example: ``zpool create tank mirror A1 B1 mirror A2 B2`` + +.. _setting-up-the-etczfsvdev_idconf-file: + +Setting up the /etc/zfs/vdev_id.conf file +----------------------------------------- + +In order to use /dev/disk/by-vdev/ naming the ``/etc/zfs/vdev_id.conf`` +must be configured. The format of this file is described in the +vdev_id.conf man page. Several examples follow. + +A non-multipath configuration with direct-attached SAS enclosures and an +arbitrary slot re-mapping. + +:: + + multipath no + topology sas_direct + phys_per_port 4 + + # PCI_SLOT HBA PORT CHANNEL NAME + channel 85:00.0 1 A + channel 85:00.0 0 B + + # Linux Mapped + # Slot Slot + slot 0 2 + slot 1 6 + slot 2 0 + slot 3 3 + slot 4 5 + slot 5 7 + slot 6 4 + slot 7 1 + +A SAS-switch topology. Note that the channel keyword takes only two +arguments in this example. + +:: + + topology sas_switch + + # SWITCH PORT CHANNEL NAME + channel 1 A + channel 2 B + channel 3 C + channel 4 D + +A multipath configuration. Note that channel names have multiple +definitions - one per physical path. + +:: + + multipath yes + + # PCI_SLOT HBA PORT CHANNEL NAME + channel 85:00.0 1 A + channel 85:00.0 0 B + channel 86:00.0 1 A + channel 86:00.0 0 B + +A configuration using device link aliases. + +:: + + # by-vdev + # name fully qualified or base name of device link + alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca + alias d2 wwn-0x5000c5002def789e + +After defining the new disk names run ``udevadm trigger`` to prompt udev +to parse the configuration file. This will result in a new +/dev/disk/by-vdev directory which is populated with symlinks to /dev/sdX +names. Following the first example above, you could then create the new +pool of mirrors with the following command: + +:: + + $ zpool create tank \ + mirror A0 B0 mirror A1 B1 mirror A2 B2 mirror A3 B3 \ + mirror A4 B4 mirror A5 B5 mirror A6 B6 mirror A7 B7 + + $ zpool status + pool: tank + state: ONLINE + scan: none requested + config: + + NAME STATE READ WRITE CKSUM + tank ONLINE 0 0 0 + mirror-0 ONLINE 0 0 0 + A0 ONLINE 0 0 0 + B0 ONLINE 0 0 0 + mirror-1 ONLINE 0 0 0 + A1 ONLINE 0 0 0 + B1 ONLINE 0 0 0 + mirror-2 ONLINE 0 0 0 + A2 ONLINE 0 0 0 + B2 ONLINE 0 0 0 + mirror-3 ONLINE 0 0 0 + A3 ONLINE 0 0 0 + B3 ONLINE 0 0 0 + mirror-4 ONLINE 0 0 0 + A4 ONLINE 0 0 0 + B4 ONLINE 0 0 0 + mirror-5 ONLINE 0 0 0 + A5 ONLINE 0 0 0 + B5 ONLINE 0 0 0 + mirror-6 ONLINE 0 0 0 + A6 ONLINE 0 0 0 + B6 ONLINE 0 0 0 + mirror-7 ONLINE 0 0 0 + A7 ONLINE 0 0 0 + B7 ONLINE 0 0 0 + + errors: No known data errors + +Changing /dev/ names on an existing pool +---------------------------------------- + +Changing the /dev/ names on an existing pool can be done by simply +exporting the pool and re-importing it with the -d option to specify +which new names should be used. For example, to use the custom names in +/dev/disk/by-vdev: + +:: + + $ zpool export tank + $ zpool import -d /dev/disk/by-vdev tank + +.. _the-etczfszpoolcache-file: + +The /etc/zfs/zpool.cache file +----------------------------- + +Whenever a pool is imported on the system it will be added to the +``/etc/zfs/zpool.cache file``. This file stores pool configuration +information, such as the device names and pool state. If this file +exists when running the ``zpool import`` command then it will be used to +determine the list of pools available for import. When a pool is not +listed in the cache file it will need to be detected and imported using +the ``zpool import -d /dev/disk/by-id`` command. + +.. _generating-a-new-etczfszpoolcache-file: + +Generating a new /etc/zfs/zpool.cache file +------------------------------------------ + +The ``/etc/zfs/zpool.cache`` file will be automatically updated when +your pool configuration is changed. However, if for some reason it +becomes stale you can force the generation of a new +``/etc/zfs/zpool.cache`` file by setting the cachefile property on the +pool. + +:: + + $ zpool set cachefile=/etc/zfs/zpool.cache tank + +Conversely the cache file can be disabled by setting ``cachefile=none``. +This is useful for failover configurations where the pool should always +be explicitly imported by the failover software. + +:: + + $ zpool set cachefile=none tank + +Sending and Receiving Streams +----------------------------- + +hole_birth Bugs +~~~~~~~~~~~~~~~ + +The hole_birth feature has/had bugs, the result of which is that, if you +do a ``zfs send -i`` (or ``-R``, since it uses ``-i``) from an affected +dataset, the receiver *will not see any checksum or other errors, but +will not match the source*. + +ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring the +faulty metadata which causes this issue *on the sender side*. + +For more details, see the [[hole_birth FAQ]]. + +Sending Large Blocks +~~~~~~~~~~~~~~~~~~~~ + +When sending incremental streams which contain large blocks (>128K) the +``--large-block`` flag must be specified. Inconsist use of the flag +between incremental sends can result in files being incorrectly zeroed +when they are received. Raw encrypted send/recvs automatically imply the +``--large-block`` flag and are therefore unaffected. + +For more details, see `issue +6224 `__. + +CEPH/ZFS +-------- + +There is a lot of tuning that can be done that's dependent on the +workload that is being put on CEPH/ZFS, as well as some general +guidelines. Some are as follow; + +ZFS Configuration +~~~~~~~~~~~~~~~~~ + +The CEPH filestore back-end heavily relies on xattrs, for optimal +performance all CEPH workloads will benefit from the following ZFS +dataset parameters + +- ``xattr=sa`` +- ``dnodesize=auto`` + +Beyond that typically rbd/cephfs focused workloads benefit from small +recordsize({16K-128K), while objectstore/s3/rados focused workloads +benefit from large recordsize (128K-1M). + +.. _ceph-configuration-cephconf: + +CEPH Configuration (ceph.conf} +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Additionally CEPH sets various values internally for handling xattrs +based on the underlying filesystem. As CEPH only officially +supports/detects XFS and BTRFS, for all other filesystems it falls back +to rather `limited "safe" +values `__. +On newer releases need for larger xattrs will prevent OSD's from even +starting. + +The officially recommended workaround (`see +here `__) +has some severe downsides, and more specifically is geared toward +filesystems with "limited" xattr support such as ext4. + +ZFS does not have a limit internally to xattrs length, as such we can +treat it similarly to how CEPH treats XFS. We can set overrides to set 3 +internal values to the same as those used with XFS(`see +here `__ +and +`here `__) +and allow it be used without the severe limitations of the "official" +workaround. + +:: + + [osd] + filestore_max_inline_xattrs = 10 + filestore_max_inline_xattr_size = 65536 + filestore_max_xattr_value_size = 65536 + +Other General Guidelines +~~~~~~~~~~~~~~~~~~~~~~~~ + +- Use a separate journal device. Do not don't collocate CEPH journal on + ZFS dataset if at all possible, this will quickly lead to terrible + fragmentation, not to mention terrible performance upfront even + before fragmentation (CEPH journal does a dsync for every write). +- Use a SLOG device, even with a separate CEPH journal device. For some + workloads, skipping SLOG and setting ``logbias=throughput`` may be + acceptable. +- Use a high-quality SLOG/CEPH journal device, consumer based SSD, or + even NVMe WILL NOT DO (Samsung 830, 840, 850, etc) for a variety of + reasons. CEPH will kill them quickly, on-top of the performance being + quite low in this use. Generally recommended are [Intel DC S3610, + S3700, S3710, P3600, P3700], or [Samsung SM853, SM863], or better. +- If using an high quality SSD or NVMe device(as mentioned above), you + CAN share SLOG and CEPH Journal to good results on single device. A + ratio of 4 HDDs to 1 SSD (Intel DC S3710 200GB), with each SSD + partitioned (remember to align!) to 4x10GB (for ZIL/SLOG) + 4x20GB + (for CEPH journal) has been reported to work well. + +Again - CEPH + ZFS will KILL a consumer based SSD VERY quickly. Even +ignoring the lack of power-loss protection, and endurance ratings, you +will be very disappointed with performance of consumer based SSD under +such a workload. + +Performance Considerations +-------------------------- + +To achieve good performance with your pool there are some easy best +practices you should follow. Additionally, it should be made clear that +the ZFS on Linux implementation has not yet been optimized for +performance. As the project matures we can expect performance to +improve. + +- **Evenly balance your disk across controllers:** Often the limiting + factor for performance is not the disk but the controller. By + balancing your disks evenly across controllers you can often improve + throughput. +- **Create your pool using whole disks:** When running zpool create use + whole disk names. This will allow ZFS to automatically partition the + disk to ensure correct alignment. It will also improve + interoperability with other OpenZFS implementations which honor the + wholedisk property. +- **Have enough memory:** A minimum of 2GB of memory is recommended for + ZFS. Additional memory is strongly recommended when the compression + and deduplication features are enabled. +- **Improve performance by setting ashift=12:** You may be able to + improve performance for some workloads by setting ``ashift=12``. This + tuning can only be set when block devices are first added to a pool, + such as when the pool is first created or when a new vdev is added to + the pool. This tuning parameter can result in a decrease of capacity + for RAIDZ configuratons. + +Advanced Format Disks +--------------------- + +Advanced Format (AF) is a new disk format which natively uses a 4,096 +byte, instead of 512 byte, sector size. To maintain compatibility with +legacy systems many AF disks emulate a sector size of 512 bytes. By +default, ZFS will automatically detect the sector size of the drive. +This combination can result in poorly aligned disk accesses which will +greatly degrade the pool performance. + +Therefore, the ability to set the ashift property has been added to the +zpool command. This allows users to explicitly assign the sector size +when devices are first added to a pool (typically at pool creation time +or adding a vdev to the pool). The ashift values range from 9 to 16 with +the default value 0 meaning that zfs should auto-detect the sector size. +This value is actually a bit shift value, so an ashift value for 512 +bytes is 9 (2^9 = 512) while the ashift value for 4,096 bytes is 12 +(2^12 = 4,096). + +To force the pool to use 4,096 byte sectors at pool creation time, you +may run: + +:: + + $ zpool create -o ashift=12 tank mirror sda sdb + +To force the pool to use 4,096 byte sectors when adding a vdev to a +pool, you may run: + +:: + + $ zpool add -o ashift=12 tank mirror sdc sdd + +ZVOL used space larger than expected +------------------------------------ + +| Depending on the filesystem used on the zvol (e.g. ext4) and the usage + (e.g. deletion and creation of many files) the ``used`` and + ``referenced`` properties reported by the zvol may be larger than the + "actual" space that is being used as reported by the consumer. +| This can happen due to the way some filesystems work, in which they + prefer to allocate files in new untouched blocks rather than the + fragmented used blocks marked as free. This forces zfs to reference + all blocks that the underlying filesystem has ever touched. +| This is in itself not much of a problem, as when the ``used`` property + reaches the configured ``volsize`` the underlying filesystem will + start reusing blocks. But the problem arises if it is desired to + snapshot the zvol, as the space referenced by the snapshots will + contain the unused blocks. + +| This issue can be prevented, by using the ``fstrim`` command to allow + the kernel to specify to zfs which blocks are unused. +| Executing a ``fstrim`` command before a snapshot is taken will ensure + a minimum snapshot size. +| Adding the ``discard`` option for the mounted ZVOL in ``\etc\fstab`` + effectively enables the Linux kernel to issue the trim commands + continuously, without the need to execute fstrim on-demand. + +Using a zvol for a swap device +------------------------------ + +You may use a zvol as a swap device but you'll need to configure it +appropriately. + +**CAUTION:** for now swap on zvol may lead to deadlock, in this case +please send your logs +`here `__. + +- Set the volume block size to match your systems page size. This + tuning prevents ZFS from having to perform read-modify-write options + on a larger block while the system is already low on memory. +- Set the ``logbias=throughput`` and ``sync=always`` properties. Data + written to the volume will be flushed immediately to disk freeing up + memory as quickly as possible. +- Set ``primarycache=metadata`` to avoid keeping swap data in RAM via + the ARC. +- Disable automatic snapshots of the swap device. + +:: + + $ zfs create -V 4G -b $(getconf PAGESIZE) \ + -o logbias=throughput -o sync=always \ + -o primarycache=metadata \ + -o com.sun:auto-snapshot=false rpool/swap + +Using ZFS on Xen Hypervisor or Xen Dom0 +--------------------------------------- + +It is usually recommended to keep virtual machine storage and hypervisor +pools, quite separate. Although few people have managed to successfully +deploy and run ZFS on Linux using the same machine configured as Dom0. +There are few caveats: + +- Set a fair amount of memory in grub.conf, dedicated to Dom0. + + - dom0_mem=16384M,max:16384M + +- Allocate no more of 30-40% of Dom0's memory to ZFS in + ``/etc/modprobe.d/zfs.conf``. + + - options zfs zfs_arc_max=6442450944 + +- Disable Xen's auto-ballooning in ``/etc/xen/xl.conf`` +- Watch out for any Xen bugs, such as `this + one `__ related to + ballooning + +udisks2 creating /dev/mapper/ entries for zvol +---------------------------------------------- + +To prevent udisks2 from creating /dev/mapper entries that must be +manually removed or maintained during zvol remove / rename, create a +udev rule such as ``/etc/udev/rules.d/80-udisks2-ignore-zfs.rules`` with +the following contents: + +:: + + ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_FS_TYPE}=="zfs_member", ENV{ID_PART_ENTRY_TYPE}=="6a898cc3-1dd2-11b2-99a6-080020736631", ENV{UDISKS_IGNORE}="1" + +Licensing +--------- + +ZFS is licensed under the Common Development and Distribution License +(`CDDL `__), +and the Linux kernel is licensed under the GNU General Public License +Version 2 (`GPLv2 `__). While +both are free open source licenses they are restrictive licenses. The +combination of them causes problems because it prevents using pieces of +code exclusively available under one license with pieces of code +exclusively available under the other in the same binary. In the case of +the kernel, this prevents us from distributing ZFS on Linux as part of +the kernel binary. However, there is nothing in either license that +prevents distributing it in the form of a binary module or in the form +of source code. + +Additional reading and opinions: + +- `Software Freedom Law + Center `__ +- `Software Freedom + Conservancy `__ +- `Free Software + Foundation `__ +- `Encouraging closed source + modules `__ + +Reporting a problem +------------------- + +You can open a new issue and search existing issues using the public +`issue tracker `__. The issue +tracker is used to organize outstanding bug reports, feature requests, +and other development tasks. Anyone may post comments after signing up +for a github account. + +Please make sure that what you're actually seeing is a bug and not a +support issue. If in doubt, please ask on the mailing list first, and if +you're then asked to file an issue, do so. + +When opening a new issue include this information at the top of the +issue: + +- What distribution you're using and the version. +- What spl/zfs packages you're using and the version. +- Describe the problem you're observing. +- Describe how to reproduce the problem. +- Including any warning/errors/backtraces from the system logs. + +When a new issue is opened it's not uncommon for a developer to request +additional information about the problem. In general, the more detail +you share about a problem the quicker a developer can resolve it. For +example, providing a simple test case is always exceptionally helpful. +Be prepared to work with the developer looking in to your bug in order +to get it resolved. They may ask for information like: + +- Your pool configuration as reported by ``zdb`` or ``zpool status``. +- Your hardware configuration, such as + + - Number of CPUs. + - Amount of memory. + - Whether your system has ECC memory. + - Whether it is running under a VMM/Hypervisor. + - Kernel version. + - Values of the spl/zfs module parameters. + +- Stack traces which may be logged to ``dmesg``. + +Does ZFS on Linux have a Code of Conduct? +----------------------------------------- + +Yes, the ZFS on Linux community has a code of conduct. See the `Code of +Conduct `__ for details. diff --git a/docs/Fedora.rst b/docs/Fedora.rst new file mode 100644 index 0000000..3e764cd --- /dev/null +++ b/docs/Fedora.rst @@ -0,0 +1,69 @@ +Only +`DKMS `__ +style packages can be provided for Fedora from the official +zfsonlinux.org repository. This is because Fedora is a fast moving +distribution which does not provide a stable kABI. These packages track +the official ZFS on Linux tags and are updated as new versions are +released. Packages are available for the following configurations: + +| **Fedora Releases:** 30, 31 +| **Architectures:** x86_64 + +To simplify installation a zfs-release package is provided which +includes a zfs.repo configuration file and the ZFS on Linux public +signing key. All official ZFS on Linux packages are signed using this +key, and by default both yum and dnf will verify a package's signature +before allowing it be to installed. Users are strongly encouraged to +verify the authenticity of the ZFS on Linux public key using the +fingerprint listed here. + +| **Location:** /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux +| **Fedora 30 Package:** + `http://download.zfsonlinux.org/fedora/zfs-release.fc30.noarch.rpm `__ +| **Fedora 31 Package:** + `http://download.zfsonlinux.org/fedora/zfs-release.fc31.noarch.rpm `__ +| **Fedora 32 Package:** + `http://download.zfsonlinux.org/fedora/zfs-release.fc32.noarch.rpm `__ +| **Download from:** + `pgp.mit.edu `__ +| **Fingerprint:** C93A FFFD 9F3F 7B03 C310 CEB6 A9D5 A1C0 F14A B620 + +.. code:: sh + + $ sudo dnf install http://download.zfsonlinux.org/fedora/zfs-release$(rpm -E %dist).noarch.rpm + $ gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux + pub 2048R/F14AB620 2013-03-21 ZFS on Linux + Key fingerprint = C93A FFFD 9F3F 7B03 C310 CEB6 A9D5 A1C0 F14A B620 + sub 2048R/99685629 2013-03-21 + +The ZFS on Linux packages should be installed with ``dnf`` on Fedora. +Note that it is important to make sure that the matching *kernel-devel* +package is installed for the running kernel since DKMS requires it to +build ZFS. + +.. code:: sh + + $ sudo dnf install kernel-devel zfs + +If the Fedora provided *zfs-fuse* package is already installed on the +system. Then the ``dnf swap`` command should be used to replace the +existing fuse packages with the ZFS on Linux packages. + +.. code:: sh + + $ sudo dnf swap zfs-fuse zfs + +Testing Repositories +-------------------- + +In addition to the primary *zfs* repository a *zfs-testing* repository +is available. This repository, which is disabled by default, contains +the latest version of ZFS on Linux which is under active development. +These packages are made available in order to get feedback from users +regarding the functionality and stability of upcoming releases. These +packages **should not** be used on production systems. Packages from the +testing repository can be installed as follows. + +:: + + $ sudo dnf --enablerepo=zfs-testing install kernel-devel zfs diff --git a/docs/Getting-Started.rst b/docs/Getting-Started.rst new file mode 100644 index 0000000..c8c30d4 --- /dev/null +++ b/docs/Getting-Started.rst @@ -0,0 +1,14 @@ +To get started with OpenZFS refer to the provided documentation for your +distribution. It will cover the recommended installation method and any +distribution specific information. First time OpenZFS users are +encouraged to check out Aaron Toponce's `excellent +documentation `__. + +| `ArchLinux `__ +| [[Debian]] +| [[Fedora]] +| `FreeBSD `__ +| `Gentoo `__ +| `openSUSE `__ +| [[RHEL and CentOS]] +| [[Ubuntu]] diff --git a/docs/Git-and-GitHub-for-beginners.rst b/docs/Git-and-GitHub-for-beginners.rst new file mode 100644 index 0000000..a553be5 --- /dev/null +++ b/docs/Git-and-GitHub-for-beginners.rst @@ -0,0 +1,210 @@ +Git and GitHub for beginners (ZoL edition) +========================================== + +This is a very basic rundown of how to use Git and GitHub to make +changes. + +Recommended reading: `ZFS on Linux +CONTRIBUTING.md `__ + +First time setup +================ + +If you've never used Git before, you'll need a little setup to start +things off. + +:: + + git config --global user.name "My Name" + git config --global user.email myemail@noreply.non + +Cloning the initial repository +============================== + +The easiest way to get started is to click the fork icon at the top of +the main repository page. From there you need to download a copy of the +forked repository to your computer: + +:: + + git clone https://github.com//zfs.git + +This sets the "origin" repository to your fork. This will come in handy +when creating pull requests. To make pulling from the "upstream" +repository as changes are made, it is very useful to establish the +upstream repository as another remote (man git-remote): + +:: + + cd zfs + git remote add upstream https://github.com/zfsonlinux/zfs.git + +Preparing and making changes +============================ + +In order to make changes it is recommended to make a branch, this lets +you work on several unrelated changes at once. It is also not +recommended to make changes to the master branch unless you own the +repository. + +:: + + git checkout -b my-new-branch + +From here you can make your changes and move on to the next step. + +Recommended reading: `C Style and Coding Standards for +SunOS `__, +`ZFS on Linux Developer +Resources `__, +`OpenZFS Developer +Resources `__ + +Testing your patches before pushing +=================================== + +Before committing and pushing, you may want to test your patches. There +are several tests you can run against your branch such as style +checking, and functional tests. All pull requests go through these tests +before being pushed to the main repository, however testing locally +takes the load off the build/test servers. This step is optional but +highly recommended, however the test suite should be run on a virtual +machine or a host that currently does not use ZFS. You may need to +install ``shellcheck`` and ``flake8`` to run the ``checkstyle`` +correctly. + +:: + + sh autogen.sh + ./configure + make checkstyle + +Recommended reading: `Building +ZFS `__, `ZFS Test +Suite +README `__ + +Committing your changes to be pushed +==================================== + +When you are done making changes to your branch there are a few more +steps before you can make a pull request. + +:: + + git commit --all --signoff + +This command opens an editor and adds all unstaged files from your +branch. Here you need to describe your change and add a few things: + +:: + + + # Please enter the commit message for your changes. Lines starting + # with '#' will be ignored, and an empty message aborts the commit. + # On branch my-new-branch + # Changes to be committed: + # (use "git reset HEAD ..." to unstage) + # + # modified: hello.c + # + +The first thing we need to add is the commit message. This is what is +displayed on the git log, and should be a short description of the +change. By style guidelines, this has to be less than 72 characters in +length. + +Underneath the commit message you can add a more descriptive text to +your commit. The lines in this section have to be less than 72 +characters. + +When you are done, the commit should look like this: + +:: + + Add hello command + + This is a test commit with a descriptive commit message. + This message can be more than one line as shown here. + + Signed-off-by: My Name + Closes #9998 + Issue #9999 + # Please enter the commit message for your changes. Lines starting + # with '#' will be ignored, and an empty message aborts the commit. + # On branch my-new-branch + # Changes to be committed: + # (use "git reset HEAD ..." to unstage) + # + # modified: hello.c + # + +You can also reference issues and pull requests if you are filing a pull +request for an existing issue as shown above. Save and exit the editor +when you are done. + +Pushing and creating the pull request +===================================== + +Home stretch. You've made your change and made the commit. Now it's time +to push it. + +:: + + git push --set-upstream origin my-new-branch + +This should ask you for your github credentials and upload your changes +to your repository. + +The last step is to either go to your repository or the upstream +repository on GitHub and you should see a button for making a new pull +request for your recently committed branch. + +Correcting issues with your pull request +======================================== + +Sometimes things don't always go as planned and you may need to update +your pull request with a correction to either your commit message, or +your changes. This can be accomplished by re-pushing your branch. If you +need to make code changes or ``git add`` a file, you can do those now, +along with the following: + +:: + + git commit --amend + git push --force + +This will return you to the commit editor screen, and push your changes +over top of the old ones. Do note that this will restart the process of +any build/test servers currently running and excessively pushing can +cause delays in processing of all pull requests. + +Maintaining your repository +=========================== + +When you wish to make changes in the future you will want to have an +up-to-date copy of the upstream repository to make your changes on. Here +is how you keep updated: + +:: + + git checkout master + git pull upstream master + git push origin master + +This will make sure you are on the master branch of the repository, grab +the changes from upstream, then push them back to your repository. + +Final words +=========== + +This is a very basic introduction to Git and GitHub, but should get you +on your way to contributing to many open source projects. Not all +projects have style requirements and some may have different processes +to getting changes committed so please refer to their documentation to +see if you need to do anything different. One topic we have not touched +on is the ``git rebase`` command which is a little more advanced for +this wiki article. + +Additional resources: `Github Help `__, +`Atlassian Git Tutorials `__ diff --git a/docs/HOWTO-install-Debian-GNU-Linux-to-a-Native-ZFS-Root-Filesystem.rst b/docs/HOWTO-install-Debian-GNU-Linux-to-a-Native-ZFS-Root-Filesystem.rst new file mode 100644 index 0000000..1845d61 --- /dev/null +++ b/docs/HOWTO-install-Debian-GNU-Linux-to-a-Native-ZFS-Root-Filesystem.rst @@ -0,0 +1 @@ +This page has moved to [[Debian Jessie Root on ZFS]]. diff --git a/docs/Home.rst b/docs/Home.rst new file mode 100644 index 0000000..407e47d --- /dev/null +++ b/docs/Home.rst @@ -0,0 +1,19 @@ +.. raw:: html + +

[[/img/480px-Open-ZFS-Secondary-Logo-Colour-halfsize.png|alt=openzfs]]

+ +Welcome to the OpenZFS GitHub wiki. This wiki provides documentation for +users and developers working with (or contributing to) the OpenZFS +project. New users or system administrators should refer to the +documentation for their favorite platform to get started. + ++----------------------+----------------------+----------------------+ +| [[Getting Started]] | [[Project and | [[Developer | +| | Community]] | Resources]] | ++======================+======================+======================+ +| How to get started | About the project | Technical | +| with OpenZFS on your | and how to | documentation | +| favorite platform | contribute | discussing the | +| | | OpenZFS | +| | | implementation | ++----------------------+----------------------+----------------------+ diff --git a/docs/License.rst b/docs/License.rst new file mode 100644 index 0000000..564bf81 --- /dev/null +++ b/docs/License.rst @@ -0,0 +1,9 @@ +|Creative Commons License| + +Wiki content is licensed under a `Creative Commons +Attribution-ShareAlike +license `__ unless +otherwise noted. + +.. |Creative Commons License| image:: https://i.creativecommons.org/l/by-sa/3.0/88x31.png + :target: http://creativecommons.org/licenses/by-sa/3.0/ diff --git a/docs/Mailing-Lists.rst b/docs/Mailing-Lists.rst new file mode 100644 index 0000000..28096b3 --- /dev/null +++ b/docs/Mailing-Lists.rst @@ -0,0 +1,31 @@ ++----------------------+----------------------+----------------------+ +|              | Description | List Archive | +|             List     | | | +|                      | | | ++======================+======================+======================+ +| `zfs-annou | A low-traffic list | `arch | +| nce@list.zfsonlinux. | for announcements | ive `__ | +| ups/zfs-announce>`__ | | | ++----------------------+----------------------+----------------------+ +| `zfs-dis | A user discussion | `arc | +| cuss@list.zfsonlinux | list for issues | hive `__ | +| oups/zfs-discuss>`__ | usability | | ++----------------------+----------------------+----------------------+ +| `zfs | A development list | `a | +| -devel@list.zfsonlin | for developers to | rchive `__ | +| groups/zfs-devel>`__ | | | ++----------------------+----------------------+----------------------+ +| `devel | A | `archive `__ | +| iki/Mailing_list>`__ | developers to review | | +| | ZFS code and | | +| | architecture changes | | +| | from all platforms | | ++----------------------+----------------------+----------------------+ diff --git a/docs/OpenZFS-Patches.rst b/docs/OpenZFS-Patches.rst new file mode 100644 index 0000000..4b50ee6 --- /dev/null +++ b/docs/OpenZFS-Patches.rst @@ -0,0 +1,315 @@ +The ZFS on Linux project is an adaptation of the upstream `OpenZFS +repository `__ designed to work in +a Linux environment. This upstream repository acts as a location where +new features, bug fixes, and performance improvements from all the +OpenZFS platforms can be integrated. Each platform is responsible for +tracking the OpenZFS repository and merging the relevant improvements +back in to their release. + +For the ZFS on Linux project this tracking is managed through an +`OpenZFS tracking `__ +page. The page is updated regularly and shows a list of OpenZFS commits +and their status in regard to the ZFS on Linux master branch. + +This page describes the process of applying outstanding OpenZFS commits +to ZFS on Linux and submitting those changes for inclusion. As a +developer this is a great way to familiarize yourself with ZFS on Linux +and to begin quickly making a valuable contribution to the project. The +following guide assumes you have a `github +account `__, +are familiar with git, and are used to developing in a Linux +environment. + +Porting OpenZFS changes to ZFS on Linux +--------------------------------------- + +Setup the Environment +~~~~~~~~~~~~~~~~~~~~~ + +**Clone the source.** Start by making a local clone of the +`spl `__ and +`zfs `__ repositories. + +:: + + $ git clone -o zfsonlinux https://github.com/zfsonlinux/spl.git + $ git clone -o zfsonlinux https://github.com/zfsonlinux/zfs.git + +**Add remote repositories.** Using the GitHub web interface +`fork `__ the +`zfs `__ repository in to your +personal GitHub account. Add your new zfs fork and the +`openzfs `__ repository as remotes +and then fetch both repositories. The OpenZFS repository is large and +the initial fetch may take some time over a slow connection. + +:: + + $ cd zfs + $ git remote add git@github.com:/zfs.git + $ git remote add openzfs https://github.com/openzfs/openzfs.git + $ git fetch --all + +**Build the source.** Compile the spl and zfs master branches. These +branches are always kept stable and this is a useful verification that +you have a full build environment installed and all the required +dependencies are available. This may also speed up the compile time +latter for small patches where incremental builds are an option. + +:: + + $ cd ../spl + $ sh autogen.sh && ./configure --enable-debug && make -s -j$(nproc) + $ + $ cd ../zfs + $ sh autogen.sh && ./configure --enable-debug && make -s -j$(nproc) + +Pick a patch +~~~~~~~~~~~~ + +Consult the `OpenZFS +tracking `__ page and +select a patch which has not yet been applied. For your first patch you +will want to select a small patch to familiarize yourself with the +process. + +Porting a Patch +~~~~~~~~~~~~~~~ + +There are 2 methods: + +- `cherry-pick (easier) <#cherry-pick>`__ +- `manual merge <#manual-merge>`__ + +Please read about `manual merge <#manual-merge>`__ first to learn the +whole process. + +Cherry-pick +^^^^^^^^^^^ + +You can start to +`cherry-pick `__ by your own, +but we have made a special +`script `__, +which tries to +`cherry-pick `__ the patch +automatically and generates the description. + +0) Prepare environment: + +Mandatory git settings (add to ``~/.gitconfig``): + +:: + + [merge] + renameLimit = 999999 + [user] + email = mail@yourmail.com + name = Your Name + +Download the script: + +:: + + wget https://raw.githubusercontent.com/zfsonlinux/zfs-buildbot/master/scripts/openzfs-merge.sh + +1) Run: + +:: + + ./openzfs-merge.sh -d path_to_zfs_folder -c openzfs_commit_hash + +This command will fetch all repositories, create a new branch +``autoport-ozXXXX`` (XXXX - OpenZFS issue number), try to cherry-pick, +compile and check cstyle on success. + +If it succeeds without any merge conflicts - go to ``autoport-ozXXXX`` +branch, it will have ready to pull commit. Congratulations, you can go +to step 7! + +Otherwise you should go to step 2. + +2) Resolve all merge conflicts manually. Easy method - install + `Meld `__ or any other diff tool and run + ``git mergetool``. + +3) Check all compile and cstyle errors (See `Testing a + patch <#testing-a-patch>`__). + +4) Commit your changes with any description. + +5) Update commit description (last commit will be changed): + +:: + + ./openzfs-merge.sh -d path_to_zfs_folder -g openzfs_commit_hash + +6) Add any porting notes (if you have modified something): + ``git commit --amend`` + +7) Push your commit to github: + ``git push autoport-ozXXXX`` + +8) Create a pull request to ZoL master branch. + +9) Go to `Testing a patch <#testing-a-patch>`__ section. + +Manual merge +^^^^^^^^^^^^ + +**Create a new branch.** It is important to create a new branch for +every commit you port to ZFS on Linux. This will allow you to easily +submit your work as a GitHub pull request and it makes it possible to +work on multiple OpenZFS changes concurrently. All development branches +need to be based off of the ZFS master branch and it's helpful to name +the branches after the issue number you're working on. + +:: + + $ git checkout -b openzfs- master + +**Generate a patch.** One of the first things you'll notice about the +ZFS on Linux repository is that it is laid out differently than the +OpenZFS repository. Organizationally it is much flatter, this is +possible because it only contains the code for OpenZFS not an entire OS. +That means that in order to apply a patch from OpenZFS the path names in +the patch must be changed. A script called zfs2zol-patch.sed has been +provided to perform this translation. Use the ``git format-patch`` +command and this script to generate a patch. + +:: + + $ git format-patch --stdout ^.. | \ + ./scripts/zfs2zol-patch.sed >openzfs-.diff + +**Apply the patch.** In many cases the generated patch will apply +cleanly to the repository. However, it's important to keep in mind the +zfs2zol-patch.sed script only translates the paths. There are often +additional reasons why a patch might not apply. In some cases hunks of +the patch may not be applicable to Linux and should be dropped. In other +cases a patch may depend on other changes which must be applied first. +The changes may also conflict with Linux specific modifications. In all +of these cases the patch will need to be manually modified to apply +cleanly while preserving the its original intent. + +:: + + $ git am ./openzfs-.diff + +**Update the commit message.** By using ``git format-patch`` to generate +the patch and then ``git am`` to apply it the original comment and +authorship will be preserved. However, due to the formatting of the +OpenZFS commit you will likely find that the entire commit comment has +been squashed in to the subject line. Use ``git commit --amend`` to +cleanup the comment and be careful to follow `these standard +guidelines `__. + +The summary line of an OpenZFS commit is often very long and you should +truncate it to 50 characters. This is useful because it preserves the +correct formatting of ``git log --pretty=oneline`` command. Make sure to +leave a blank line between the summary and body of the commit. Then +include the full OpenZFS commit message wrapping any lines which exceed +72 characters. Finally, add a ``Ported-by`` tag with your contact +information and both a ``OpenZFS-issue`` and ``OpenZFS-commit`` tag with +appropriate links. You'll want to verify your commit contains all of the +following information: + +- The subject line from the original OpenZFS patch in the form: + "OpenZFS - short description". +- The original patch authorship should be preserved. +- The OpenZFS commit message. +- The following tags: + + - **Authored by:** Original patch author + - **Reviewed by:** All OpenZFS reviewers from the original patch. + - **Approved by:** All OpenZFS reviewers from the original patch. + - **Ported-by:** Your name and email address. + - **OpenZFS-issue:** https ://www.illumos.org/issues/issue + - **OpenZFS-commit:** https + ://github.com/openzfs/openzfs/commit/hash + +- **Porting Notes:** An optional section describing any changes + required when porting. + +For example, OpenZFS issue 6873 was `applied to +Linux `__ from this +upstream `OpenZFS +commit `__. + +:: + + OpenZFS 6873 - zfs_destroy_snaps_nvl leaks errlist + + Authored by: Chris Williamson + Reviewed by: Matthew Ahrens + Reviewed by: Paul Dagnelie + Ported-by: Denys Rtveliashvili + + lzc_destroy_snaps() returns an nvlist in errlist. + zfs_destroy_snaps_nvl() should nvlist_free() it before returning. + + OpenZFS-issue: https://www.illumos.org/issues/6873 + OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ee06391 + +Testing a Patch +~~~~~~~~~~~~~~~ + +**Build the source.** Verify the patched source compiles without errors +and all warnings are resolved. + +:: + + $ make -s -j$(nproc) + +**Run the style checker.** Verify the patched source passes the style +checker, the command should return without printing any output. + +:: + + $ make cstyle + +**Open a Pull Request.** When your patch builds cleanly and passes the +style checks `open a new pull +request `__. +The pull request will be queued for `automated +testing `__. As part of the +testing the change is built for a wide range of Linux distributions and +a battery of functional and stress tests are run to detect regressions. + +:: + + $ git push openzfs- + +**Fix any issues.** Testing takes approximately 2 hours to fully +complete and the results are posted in the GitHub `pull +request `__. All the tests +are expected to pass and you should investigate and resolve any test +failures. The `test +scripts `__ +are all available and designed to run locally in order reproduce an +issue. Once you've resolved the issue force update the pull request to +trigger a new round of testing. Iterate until all the tests are passing. + +:: + + # Fix issue, amend commit, force update branch. + $ git commit --amend + $ git push --force openzfs- + +Merging the Patch +~~~~~~~~~~~~~~~~~ + +**Review.** Lastly one of the ZFS on Linux maintainers will make a final +review of the patch and may request additional changes. Once the +maintainer is happy with the final version of the patch they will add +their signed-off-by, merge it to the master branch, mark it complete on +the tracking page, and thank you for your contribution to the project! + +Porting ZFS on Linux changes to OpenZFS +--------------------------------------- + +Often an issue will be first fixed in ZFS on Linux or a new feature +developed. Changes which are not Linux specific should be submitted +upstream to the OpenZFS GitHub repository for review. The process for +this is described in the `OpenZFS +README `__. diff --git a/docs/OpenZFS-Tracking.rst b/docs/OpenZFS-Tracking.rst new file mode 100644 index 0000000..6d7cf37 --- /dev/null +++ b/docs/OpenZFS-Tracking.rst @@ -0,0 +1,2 @@ +This page is obsolete, use +`http://build.zfsonlinux.org/openzfs-tracking.html `__ diff --git a/docs/OpenZFS-exceptions.rst b/docs/OpenZFS-exceptions.rst new file mode 100644 index 0000000..6f74066 --- /dev/null +++ b/docs/OpenZFS-exceptions.rst @@ -0,0 +1,569 @@ +Commit exceptions used to explicitly reference a given Linux commit. +These exceptions are useful for a variety of reasons. + +**This page is used to generate**\ `OpenZFS +Tracking `__\ **page.** + +Format: +^^^^^^^ + +- ``|-|`` - The OpenZFS commit isn't applicable + to Linux, or the OpenZFS -> ZFS on Linux commit matching is unable to + associate the related commits due to lack of information (denoted by + a -). +- ``||`` - The fix was merged to Linux + prior to their being an OpenZFS issue. +- ``|!|`` - The commit is applicable but not + applied for the reason described in the comment. + ++------------------+-------------------+-----------------------------+ +| OpenZFS issue id | status/ZFS commit | comment | ++==================+===================+=============================+ +| 10500 | 03916905 | | ++------------------+-------------------+-----------------------------+ +| 10154 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 10067 | - | The only ZFS change was to | +| | | zfs remap, which was | +| | | removed on Linux. | ++------------------+-------------------+-----------------------------+ +| 9884 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 9851 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 9683 | - | Not applicable to Linux due | +| | | to devids not being used | ++------------------+-------------------+-----------------------------+ +| 9680 | - | Applied and rolled back in | +| | | OpenZFS, additional changes | +| | | needed. | ++------------------+-------------------+-----------------------------+ +| 9672 | 29445fe3 | | ++------------------+-------------------+-----------------------------+ +| 9626 | 59e6e7ca | | ++------------------+-------------------+-----------------------------+ +| 9635 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 9623 | 22448f08 | | ++------------------+-------------------+-----------------------------+ +| 9621 | 305bc4b3 | | ++------------------+-------------------+-----------------------------+ +| 9539 | 5228cf01 | | ++------------------+-------------------+-----------------------------+ +| 9512 | b4555c77 | | ++------------------+-------------------+-----------------------------+ +| 9487 | 48fbb9dd | | ++------------------+-------------------+-----------------------------+ +| 9466 | 272b5d73 | | ++------------------+-------------------+-----------------------------+ +| 9433 | 0873bb63 | | ++------------------+-------------------+-----------------------------+ +| 9421 | 64c1dcef | | ++------------------+-------------------+-----------------------------+ +| 9237 | - | Introduced by 8567 which | +| | | was never applied to Linux | ++------------------+-------------------+-----------------------------+ +| 9194 | - | Not applicable the '-o | +| | | ashift=value' option is | +| | | provided on Linux | ++------------------+-------------------+-----------------------------+ +| 9077 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 9027 | 4a5d7f82 | | ++------------------+-------------------+-----------------------------+ +| 9018 | 3ec34e55 | | ++------------------+-------------------+-----------------------------+ +| 8984 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 8969 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 8942 | 650258d7 | | ++------------------+-------------------+-----------------------------+ +| 8941 | 390d679a | | ++------------------+-------------------+-----------------------------+ +| 8858 | - | Not applicable to Linux | ++------------------+-------------------+-----------------------------+ +| 8856 | - | Not applicable to Linux due | +| | | to Encryption (b525630) | ++------------------+-------------------+-----------------------------+ +| 8809 | ! | Adding libfakekernel needs | +| | | to be done by refactoring | +| | | existing code. | ++------------------+-------------------+-----------------------------+ +| 8713 | 871e0732 | | ++------------------+-------------------+-----------------------------+ +| 8661 | 1ce23dca | | ++------------------+-------------------+-----------------------------+ +| 8648 | f763c3d1 | | ++------------------+-------------------+-----------------------------+ +| 8602 | a032ac4 | | ++------------------+-------------------+-----------------------------+ +| 8601 | d99a015 | Equivalent fix included in | +| | | initial commit | ++------------------+-------------------+-----------------------------+ +| 8590 | 935e2c2 | | ++------------------+-------------------+-----------------------------+ +| 8569 | - | This change isn't relevant | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 8567 | - | An alternate fix was | +| | | applied for Linux. | ++------------------+-------------------+-----------------------------+ +| 8552 | 935e2c2 | | ++------------------+-------------------+-----------------------------+ +| 8521 | ee6370a7 | | ++------------------+-------------------+-----------------------------+ +| 8502 | ! | Apply when porting OpenZFS | +| | | 7955 | ++------------------+-------------------+-----------------------------+ +| 8477 | 92e43c1 | | ++------------------+-------------------+-----------------------------+ +| 8454 | - | An alternate fix was | +| | | applied for Linux. | ++------------------+-------------------+-----------------------------+ +| 8408 | 5f1346c | | ++------------------+-------------------+-----------------------------+ +| 8379 | - | This change isn't relevant | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 8376 | - | This change isn't relevant | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 8311 | ! | Need to assess | +| | | applicability to Linux. | ++------------------+-------------------+-----------------------------+ +| 8304 | - | This change isn't relevant | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 8300 | 44f09cd | | ++------------------+-------------------+-----------------------------+ +| 8265 | - | The large_dnode feature has | +| | | been implemented for Linux. | ++------------------+-------------------+-----------------------------+ +| 8168 | 78d95ea | | ++------------------+-------------------+-----------------------------+ +| 8138 | 44f09cd | The spelling fix to the zfs | +| | | man page came in with the | +| | | mdoc conversion. | ++------------------+-------------------+-----------------------------+ +| 8108 | - | An equivalent Linux | +| | | specific fix was made. | ++------------------+-------------------+-----------------------------+ +| 8064 | - | This change isn't relevant | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 8021 | 7657def | | ++------------------+-------------------+-----------------------------+ +| 8022 | e55ebf6 | | ++------------------+-------------------+-----------------------------+ +| 8013 | - | The change is illumos | +| | | specific and not applicable | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 7982 | - | The change is illumos | +| | | specific and not applicable | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 7970 | c30e58c | | ++------------------+-------------------+-----------------------------+ +| 7956 | cda0317 | | ++------------------+-------------------+-----------------------------+ +| 7955 | ! | Need to assess | +| | | applicability to Linux. If | +| | | porting, apply 8502. | ++------------------+-------------------+-----------------------------+ +| 7869 | df7eecc | | ++------------------+-------------------+-----------------------------+ +| 7816 | - | The change is illumos | +| | | specific and not applicable | +| | | for Linux. | ++------------------+-------------------+-----------------------------+ +| 7803 | - | This functionality is | +| | | provided by | +| | | ``upda | +| | | te_vdev_config_dev_strs()`` | +| | | on Linux. | ++------------------+-------------------+-----------------------------+ +| 7801 | 0eef1bd | Commit f25efb3 in | +| | | openzfs/master has a small | +| | | change for linting which is | +| | | being ported. | ++------------------+-------------------+-----------------------------+ +| 7779 | - | The change isn't relevant, | +| | | ``zfs_ctldir.c`` was | +| | | rewritten for Linux. | ++------------------+-------------------+-----------------------------+ +| 7740 | 32d41fb | | ++------------------+-------------------+-----------------------------+ +| 7739 | 582cc014 | | ++------------------+-------------------+-----------------------------+ +| 7730 | e24e62a | | ++------------------+-------------------+-----------------------------+ +| 7710 | - | None of the illumos build | +| | | system is used under Linux. | ++------------------+-------------------+-----------------------------+ +| 7602 | 44f09cd | | ++------------------+-------------------+-----------------------------+ +| 7591 | 541a090 | | ++------------------+-------------------+-----------------------------+ +| 7586 | c443487 | | ++------------------+-------------------+-----------------------------+ +| 7570 | - | Due to differences in the | +| | | block layer all discards | +| | | are handled asynchronously | +| | | under Linux. This | +| | | functionality could be | +| | | ported but it's unclear to | +| | | what purpose. | ++------------------+-------------------+-----------------------------+ +| 7542 | - | The Linux libshare code | +| | | differs significantly from | +| | | the upstream OpenZFS code. | +| | | Since this change doesn't | +| | | address a Linux specific | +| | | issue it doesn't need to be | +| | | ported. The eventual plan | +| | | is to retire all of the | +| | | existing libshare code and | +| | | use the ZED to more | +| | | flexibly control filesystem | +| | | sharing. | ++------------------+-------------------+-----------------------------+ +| 7512 | - | None of the illumos build | +| | | system is used under Linux. | ++------------------+-------------------+-----------------------------+ +| 7497 | - | DTrace is isn't readily | +| | | available under Linux. | ++------------------+-------------------+-----------------------------+ +| 7446 | ! | Need to assess | +| | | applicability to Linux. | ++------------------+-------------------+-----------------------------+ +| 7430 | 68cbd56 | | ++------------------+-------------------+-----------------------------+ +| 7402 | 690fe64 | | ++------------------+-------------------+-----------------------------+ +| 7345 | 058ac9b | | ++------------------+-------------------+-----------------------------+ +| 7278 | - | Dynamic ARC tuning is | +| | | handled slightly | +| | | differently under Linux and | +| | | this case is covered by | +| | | arc_tuning_update() | ++------------------+-------------------+-----------------------------+ +| 7238 | - | zvol_swap test already | +| | | disabled in ZoL | ++------------------+-------------------+-----------------------------+ +| 7194 | d7958b4 | | ++------------------+-------------------+-----------------------------+ +| 7164 | b1b85c87 | | ++------------------+-------------------+-----------------------------+ +| 7041 | 33c0819 | | ++------------------+-------------------+-----------------------------+ +| 7016 | d3c2ae1 | | ++------------------+-------------------+-----------------------------+ +| 6914 | - | Under Linux the | +| | | arc_meta_limit can be tuned | +| | | with the | +| | | zfs_arc_meta_limit_percent | +| | | module option. | ++------------------+-------------------+-----------------------------+ +| 6875 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 6843 | f5f087e | | ++------------------+-------------------+-----------------------------+ +| 6841 | 4254acb | | ++------------------+-------------------+-----------------------------+ +| 6781 | 15313c5 | | ++------------------+-------------------+-----------------------------+ +| 6765 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 6764 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 6763 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 6762 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 6648 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6578 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6577 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6575 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6568 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6528 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6494 | - | The ``vdev_disk.c`` and | +| | | ``vdev_file.c`` files have | +| | | been reworked extensively | +| | | for Linux. The proposed | +| | | changes are not needed. | ++------------------+-------------------+-----------------------------+ +| 6468 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6465 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6434 | 472e7c6 | | ++------------------+-------------------+-----------------------------+ +| 6421 | ca0bf58 | | ++------------------+-------------------+-----------------------------+ +| 6418 | 131cc95 | | ++------------------+-------------------+-----------------------------+ +| 6391 | ee06391 | | ++------------------+-------------------+-----------------------------+ +| 6390 | 85802aa | | ++------------------+-------------------+-----------------------------+ +| 6388 | 0de7c55 | | ++------------------+-------------------+-----------------------------+ +| 6386 | 485c581 | | ++------------------+-------------------+-----------------------------+ +| 6385 | f3ad9cd | | ++------------------+-------------------+-----------------------------+ +| 6369 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6368 | 2024041 | | ++------------------+-------------------+-----------------------------+ +| 6346 | 058ac9b | | ++------------------+-------------------+-----------------------------+ +| 6334 | 1a04bab | | ++------------------+-------------------+-----------------------------+ +| 6290 | 017da6 | | ++------------------+-------------------+-----------------------------+ +| 6250 | - | Linux handles crash dumps | +| | | in a fundamentally | +| | | different way than Illumos. | +| | | The proposed changes are | +| | | not needed. | ++------------------+-------------------+-----------------------------+ +| 6249 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6248 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 6220 | - | The b_thawed debug code was | +| | | unused under Linux and | +| | | removed. | ++------------------+-------------------+-----------------------------+ +| 6209 | - | The Linux user space mutex | +| | | implementation is based on | +| | | phtread primitives. | ++------------------+-------------------+-----------------------------+ +| 6095 | f866a4ea | | ++------------------+-------------------+-----------------------------+ +| 6091 | c11f100 | | ++------------------+-------------------+-----------------------------+ +| 5984 | 480f626 | | ++------------------+-------------------+-----------------------------+ +| 5966 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 5961 | 22872ff | | ++------------------+-------------------+-----------------------------+ +| 5882 | 83e9986 | | ++------------------+-------------------+-----------------------------+ +| 5815 | - | This patch could be adapted | +| | | if needed use equivalent | +| | | Linux functionality. | ++------------------+-------------------+-----------------------------+ +| 5770 | c3275b5 | | ++------------------+-------------------+-----------------------------+ +| 5769 | dd26aa5 | | ++------------------+-------------------+-----------------------------+ +| 5768 | - | The change isn't relevant, | +| | | ``zfs_ctldir.c`` was | +| | | rewritten for Linux. | ++------------------+-------------------+-----------------------------+ +| 5766 | 4dd1893 | | ++------------------+-------------------+-----------------------------+ +| 5693 | 0f7d2a4 | | ++------------------+-------------------+-----------------------------+ +| 5692 | ! | This functionality should | +| | | be ported in such a way | +| | | that it can be integrated | +| | | with ``filefrag(8)``. | ++------------------+-------------------+-----------------------------+ +| 5684 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 5410 | 0bf8501 | | ++------------------+-------------------+-----------------------------+ +| 5409 | b23d543 | | ++------------------+-------------------+-----------------------------+ +| 5379 | - | This particular issue never | +| | | impacted Linux due to the | +| | | need for a modified | +| | | zfs_putpage() | +| | | implementation. | ++------------------+-------------------+-----------------------------+ +| 5316 | - | The illumos idmap facility | +| | | isn't available under | +| | | Linux. This patch could | +| | | still be applied to | +| | | minimize code delta or all | +| | | HAVE_IDMAP chunks could be | +| | | removed on Linux for better | +| | | readability. | ++------------------+-------------------+-----------------------------+ +| 5313 | ec8501e | | ++------------------+-------------------+-----------------------------+ +| 5312 | ! | This change should be made | +| | | but the ideal time to do it | +| | | is when the spl repository | +| | | is folded in to the zfs | +| | | repository (planned for | +| | | 0.8). At this time we'll | +| | | want to cleanup many of the | +| | | includes. | ++------------------+-------------------+-----------------------------+ +| 5219 | ef56b07 | | ++------------------+-------------------+-----------------------------+ +| 5179 | 3f4058c | | ++------------------+-------------------+-----------------------------+ +| 5149 | - | Equivalent Linux | +| | | functionality is provided | +| | | by the | +| | | ``zvol_max_discard_blocks`` | +| | | module option. | ++------------------+-------------------+-----------------------------+ +| 5148 | - | Discards are handled | +| | | differently under Linux, | +| | | there is no DKIOCFREE | +| | | ioctl. | ++------------------+-------------------+-----------------------------+ +| 5136 | e8b96c6 | | ++------------------+-------------------+-----------------------------+ +| 4752 | aa9af22 | | ++------------------+-------------------+-----------------------------+ +| 4745 | 411bf20 | | ++------------------+-------------------+-----------------------------+ +| 4698 | 4fcc437 | | ++------------------+-------------------+-----------------------------+ +| 4620 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 4573 | 10b7549 | | ++------------------+-------------------+-----------------------------+ +| 4571 | 6e1b9d0 | | ++------------------+-------------------+-----------------------------+ +| 4570 | b1d13a6 | | ++------------------+-------------------+-----------------------------+ +| 4391 | 78e2739 | | ++------------------+-------------------+-----------------------------+ +| 4465 | cda0317 | | ++------------------+-------------------+-----------------------------+ +| 4263 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 4242 | - | Neither vnodes or their | +| | | associated events exist | +| | | under Linux. | ++------------------+-------------------+-----------------------------+ +| 4206 | 2820bc4 | | ++------------------+-------------------+-----------------------------+ +| 4188 | 2e7b765 | | ++------------------+-------------------+-----------------------------+ +| 4181 | 44f09cd | | ++------------------+-------------------+-----------------------------+ +| 4161 | - | The Linux user space | +| | | reader/writer | +| | | implementation is based on | +| | | phtread primitives. | ++------------------+-------------------+-----------------------------+ +| 4128 | ! | The | +| | | ldi_ev_register_callbacks() | +| | | interface doesn't exist | +| | | under Linux. It may be | +| | | possible to receive similar | +| | | notifications via the scsi | +| | | error handlers or possibly | +| | | a different interface. | ++------------------+-------------------+-----------------------------+ +| 4072 | - | None of the illumos build | +| | | system is used under Linux. | ++------------------+-------------------+-----------------------------+ +| 3947 | 7f9d994 | | ++------------------+-------------------+-----------------------------+ +| 3928 | - | Neither vnodes or their | +| | | associated events exist | +| | | under Linux. | ++------------------+-------------------+-----------------------------+ +| 3871 | d1d7e268 | | ++------------------+-------------------+-----------------------------+ +| 3747 | 090ff09 | | ++------------------+-------------------+-----------------------------+ +| 3705 | - | The Linux implementation | +| | | uses the lz4 workspace kmem | +| | | cache to resolve the stack | +| | | issue. | ++------------------+-------------------+-----------------------------+ +| 3606 | c5b247f | | ++------------------+-------------------+-----------------------------+ +| 3580 | - | Linux provides generic | +| | | ioctl handlers get/set | +| | | block device information. | ++------------------+-------------------+-----------------------------+ +| 3543 | 8dca0a9 | | ++------------------+-------------------+-----------------------------+ +| 3512 | 67629d0 | | ++------------------+-------------------+-----------------------------+ +| 3507 | 43a696e | | ++------------------+-------------------+-----------------------------+ +| 3444 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 3371 | 44f09cd | | ++------------------+-------------------+-----------------------------+ +| 3311 | 6bb24f4 | | ++------------------+-------------------+-----------------------------+ +| 3301 | - | The Linux implementation of | +| | | ``vdev_disk.c`` does not | +| | | include this comment. | ++------------------+-------------------+-----------------------------+ +| 3258 | 9d81146 | | ++------------------+-------------------+-----------------------------+ +| 3254 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 3246 | cc92e9d | | ++------------------+-------------------+-----------------------------+ +| 2933 | - | None of the illumos build | +| | | system is used under Linux. | ++------------------+-------------------+-----------------------------+ +| 2897 | fb82700 | | ++------------------+-------------------+-----------------------------+ +| 2665 | 32a9872 | | ++------------------+-------------------+-----------------------------+ +| 2130 | 460a021 | | ++------------------+-------------------+-----------------------------+ +| 1974 | - | This change was entirely | +| | | replaced in the ARC | +| | | restructuring. | ++------------------+-------------------+-----------------------------+ +| 1898 | - | The zfs_putpage() function | +| | | was rewritten to properly | +| | | integrate with the Linux | +| | | VM. | ++------------------+-------------------+-----------------------------+ +| 1700 | - | Not applicable to Linux, | +| | | the discard implementation | +| | | is entirely different. | ++------------------+-------------------+-----------------------------+ +| 1618 | ca67b33 | | ++------------------+-------------------+-----------------------------+ +| 1337 | 2402458 | | ++------------------+-------------------+-----------------------------+ +| 1126 | e43b290 | | ++------------------+-------------------+-----------------------------+ +| 763 | 3cee226 | | ++------------------+-------------------+-----------------------------+ +| 742 | ! | WIP to support NFSv4 ACLs | ++------------------+-------------------+-----------------------------+ +| 701 | 460a021 | | ++------------------+-------------------+-----------------------------+ +| 348 | - | The Linux implementation of | +| | | ``vdev_disk.c`` must have | +| | | this differently. | ++------------------+-------------------+-----------------------------+ +| 243 | - | Manual updates have been | +| | | made separately for Linux. | ++------------------+-------------------+-----------------------------+ +| 184 | - | The zfs_putpage() function | +| | | was rewritten to properly | +| | | integrate with the Linux | +| | | VM. | ++------------------+-------------------+-----------------------------+ diff --git a/docs/Project-and-Community.rst b/docs/Project-and-Community.rst new file mode 100644 index 0000000..b52a8a2 --- /dev/null +++ b/docs/Project-and-Community.rst @@ -0,0 +1,24 @@ +OpenZFS is storage software which combines the functionality of +traditional filesystems, volume manager, and more. OpenZFS includes +protection against data corruption, support for high storage capacities, +efficient data compression, snapshots and copy-on-write clones, +continuous integrity checking and automatic repair, remote replication +with ZFS send and receive, and RAID-Z. + +OpenZFS brings together developers from the illumos, Linux, FreeBSD and +OS X platforms, and a wide range of companies -- both online and at the +annual OpenZFS Developer Summit. High-level goals of the project include +raising awareness of the quality, utility and availability of +open-source implementations of ZFS, encouraging open communication about +ongoing efforts toward improving open-source variants of ZFS, and +ensuring consistent reliability, functionality and performance of all +distributions of ZFS. + +| `Admin + Documentation `__ +| [[FAQ]] +| [[Mailing Lists]] +| `Releases `__ +| `Issue Tracker `__ +| `Roadmap `__ +| [[Signing Keys]] diff --git a/docs/RHEL-and-CentOS.rst b/docs/RHEL-and-CentOS.rst new file mode 100644 index 0000000..4c14c22 --- /dev/null +++ b/docs/RHEL-and-CentOS.rst @@ -0,0 +1,166 @@ +`kABI-tracking +kmod `__ +or +`DKMS `__ +style packages are provided for RHEL / CentOS based distributions from +the official zfsonlinux.org repository. These packages track the +official ZFS on Linux tags and are updated as new versions are released. +Packages are available for the following configurations: + +| **EL Releases:** 6.x, 7.x, 8.x +| **Architectures:** x86_64 + +To simplify installation a zfs-release package is provided which +includes a zfs.repo configuration file and the ZFS on Linux public +signing key. All official ZFS on Linux packages are signed using this +key, and by default yum will verify a package's signature before +allowing it be to installed. Users are strongly encouraged to verify the +authenticity of the ZFS on Linux public key using the fingerprint listed +here. + +| **Location:** /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux +| **EL6 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el6.noarch.rpm `__ +| **EL7.5 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el7_5.noarch.rpm `__ +| **EL7.6 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el7_6.noarch.rpm `__ +| **EL7.7 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el7_7.noarch.rpm `__ +| **EL7.8 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el7_8.noarch.rpm `__ +| **EL8.0 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el8_0.noarch.rpm `__ +| **EL8.1 Package:** + `http://download.zfsonlinux.org/epel/zfs-release.el8_1.noarch.rpm `__ +| **Note:** Starting with EL7.7 **zfs-0.8** will become the default, + EL7.6 and older will continue to track the **zfs-0.7** point releases. + +| **Download from:** + `pgp.mit.edu `__ +| **Fingerprint:** C93A FFFD 9F3F 7B03 C310 CEB6 A9D5 A1C0 F14A B620 + +:: + + $ sudo yum install http://download.zfsonlinux.org/epel/zfs-release..noarch.rpm + $ gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux + pub 2048R/F14AB620 2013-03-21 ZFS on Linux + Key fingerprint = C93A FFFD 9F3F 7B03 C310 CEB6 A9D5 A1C0 F14A B620 + sub 2048R/99685629 2013-03-21 + +After installing the zfs-release package and verifying the public key +users can opt to install ether the kABI-tracking kmod or DKMS style +packages. For most users the kABI-tracking kmod packages are recommended +in order to avoid needing to rebuild ZFS for every kernel update. DKMS +packages are recommended for users running a non-distribution kernel or +for users who wish to apply local customizations to ZFS on Linux. + +kABI-tracking kmod +------------------ + +By default the zfs-release package is configured to install DKMS style +packages so they will work with a wide range of kernels. In order to +install the kABI-tracking kmods the default repository in the +*/etc/yum.repos.d/zfs.repo* file must be switch from *zfs* to +*zfs-kmod*. Keep in mind that the kABI-tracking kmods are only verified +to work with the distribution provided kernel. + +.. code:: diff + + # /etc/yum.repos.d/zfs.repo + [zfs] + name=ZFS on Linux for EL 7 - dkms + baseurl=http://download.zfsonlinux.org/epel/7/$basearch/ + -enabled=1 + +enabled=0 + metadata_expire=7d + gpgcheck=1 + gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux + @@ -9,7 +9,7 @@ + [zfs-kmod] + name=ZFS on Linux for EL 7 - kmod + baseurl=http://download.zfsonlinux.org/epel/7/kmod/$basearch/ + -enabled=0 + +enabled=1 + metadata_expire=7d + gpgcheck=1 + gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux + +The ZFS on Linux packages can now be installed using yum. + +:: + + $ sudo yum install zfs + +DKMS +---- + +To install DKMS style packages issue the following yum commands. First +add the `EPEL repository `__ which +provides DKMS by installing the *epel-release* package, then the +*kernel-devel* and *zfs* packages. Note that it is important to make +sure that the matching *kernel-devel* package is installed for the +running kernel since DKMS requires it to build ZFS. + +:: + + $ sudo yum install epel-release + $ sudo yum install "kernel-devel-uname-r == $(uname -r)" zfs + +Important Notices +----------------- + +.. _rhelcentos-7x-kmod-package-upgrade: + +RHEL/CentOS 7.x kmod package upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When updating to a new RHEL/CentOS 7.x release the existing kmod +packages will not work due to upstream kABI changes in the kernel. After +upgrading to 7.x users must uninstall ZFS and then reinstall it as +described in the `kABI-tracking +kmod `__ +section. Compatible kmod packages will be installed from the matching +CentOS 7.x repository. + +:: + + $ sudo yum remove zfs zfs-kmod spl spl-kmod libzfs2 libnvpair1 libuutil1 libzpool2 zfs-release + $ sudo yum install http://download.zfsonlinux.org/epel/zfs-release.el7_6.noarch.rpm + $ sudo yum autoremove + $ sudo yum clean metadata + $ sudo yum install zfs + +Switching from DKMS to kABI-tracking kmod +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When switching from DKMS to kABI-tracking kmods first uninstall the +existing DKMS packages. This should remove the kernel modules for all +installed kernels but in practice it's not always perfectly reliable. +Therefore, it's recommended that you manually remove any remaining ZFS +kernel modules as shown. At this point the kABI-tracking kmods can be +installed as described in the section above. + +:: + + $ sudo yum remove zfs zfs-kmod spl spl-kmod libzfs2 libnvpair1 libuutil1 libzpool2 zfs-release + + $ sudo find /lib/modules/ \( -name "splat.ko" -or -name "zcommon.ko" \ + -or -name "zpios.ko" -or -name "spl.ko" -or -name "zavl.ko" -or \ + -name "zfs.ko" -or -name "znvpair.ko" -or -name "zunicode.ko" \) \ + -exec /bin/rm {} \; + +Testing Repositories +-------------------- + +In addition to the primary *zfs* repository a *zfs-testing* repository +is available. This repository, which is disabled by default, contains +the latest version of ZFS on Linux which is under active development. +These packages are made available in order to get feedback from users +regarding the functionality and stability of upcoming releases. These +packages **should not** be used on production systems. Packages from the +testing repository can be installed as follows. + +:: + + $ sudo yum --enablerepo=zfs-testing install kernel-devel zfs diff --git a/docs/Signing-Keys.rst b/docs/Signing-Keys.rst new file mode 100644 index 0000000..90d1cac --- /dev/null +++ b/docs/Signing-Keys.rst @@ -0,0 +1,61 @@ +All tagged ZFS on Linux +`releases `__ are signed by +the official maintainer for that branch. These signatures are +automatically verified by GitHub and can be checked locally by +downloading the maintainers public key. + +Maintainers +----------- + +Release branch (spl/zfs-*-release) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +| **Maintainer:** `Ned Bass `__ +| **Download:** + `pgp.mit.edu `__ +| **Key ID:** C77B9667 +| **Fingerprint:** 29D5 610E AE29 41E3 55A2 FE8A B974 67AA C77B 9667 + +| **Maintainer:** `Tony Hutter `__ +| **Download:** + `pgp.mit.edu `__ +| **Key ID:** D4598027 +| **Fingerprint:** 4F3B A9AB 6D1F 8D68 3DC2 DFB5 6AD8 60EE D459 8027 + +Master branch (master) +~~~~~~~~~~~~~~~~~~~~~~ + +| **Maintainer:** `Brian Behlendorf `__ +| **Download:** + `pgp.mit.edu `__ +| **Key ID:** C6AF658B +| **Fingerprint:** C33D F142 657E D1F7 C328 A296 0AB9 E991 C6AF 658B + +Checking the Signature of a Git Tag +----------------------------------- + +First import the public key listed above in to your key ring. + +:: + + $ gpg --keyserver pgp.mit.edu --recv C6AF658B + gpg: requesting key C6AF658B from hkp server pgp.mit.edu + gpg: key C6AF658B: "Brian Behlendorf " not changed + gpg: Total number processed: 1 + gpg: unchanged: 1 + +After the pubic key is imported the signature of a git tag can be +verified as shown. + +:: + + $ git tag --verify zfs-0.6.5 + object 7a27ad00ae142b38d4aef8cc0af7a72b4c0e44fe + type commit + tag zfs-0.6.5 + tagger Brian Behlendorf 1441996302 -0700 + + ZFS Version 0.6.5 + gpg: Signature made Fri 11 Sep 2015 11:31:42 AM PDT using DSA key ID C6AF658B + gpg: Good signature from "Brian Behlendorf " + gpg: aka "Brian Behlendorf (LLNL) " diff --git a/docs/Troubleshooting.rst b/docs/Troubleshooting.rst new file mode 100644 index 0000000..ab330c0 --- /dev/null +++ b/docs/Troubleshooting.rst @@ -0,0 +1,107 @@ +DRAFT +===== + +This page contains tips for troubleshooting ZFS on Linux and what info +developers might want for bug triage. + +- `About Log Files <#about-log-files>`__ + + - `Generic Kernel Log <#generic-kernel-log>`__ + - `ZFS Kernel Module Debug + Messages <#zfs-kernel-module-debug-messages>`__ + +- `Unkillable Process <#unkillable-process>`__ +- `ZFS Events <#zfs-events>`__ + +-------------- + +About Log Files +--------------- + +Log files can be very useful for troubleshooting. In some cases, +interesting information is stored in multiple log files that are +correlated to system events. + +Pro tip: logging infrastructure tools like *elasticsearch*, *fluentd*, +*influxdb*, or *splunk* can simplify log analysis and event correlation. + +Generic Kernel Log +~~~~~~~~~~~~~~~~~~ + +Typically, Linux kernel log messages are available from ``dmesg -T``, +``/var/log/syslog``, or where kernel log messages are sent (eg by +``rsyslogd``). + +ZFS Kernel Module Debug Messages +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ZFS kernel modules use an internal log buffer for detailed logging +information. This log information is available in the pseudo file +``/proc/spl/kstat/zfs/dbgmsg`` for ZFS builds where ZFS module parameter +`zfs_dbgmsg_enable = +1 `__ + +-------------- + +Unkillable Process +------------------ + +Symptom: ``zfs`` or ``zpool`` command appear hung, does not return, and +is not killable + +Likely cause: kernel thread hung or panic + +Log files of interest: `Generic Kernel Log <#generic-kernel-log>`__, +`ZFS Kernel Module Debug Messages <#zfs-kernel-module-debug-messages>`__ + +Important information: if a kernel thread is stuck, then a backtrace of +the stuck thread can be in the logs. In some cases, the stuck thread is +not logged until the deadman timer expires. See also `debug +tunables `__ + +-------------- + +ZFS Events +---------- + +ZFS uses an event-based messaging interface for communication of +important events to other consumers running on the system. The ZFS Event +Daemon (zed) is a userland daemon that listens for these events and +processes them. zed is extensible so you can write shell scripts or +other programs that subscribe to events and take action. For example, +the script usually installed at ``/etc/zfs/zed.d/all-syslog.sh`` writes +a formatted event message to ``syslog.`` See the man page for ``zed(8)`` +for more information. + +A history of events is also available via the ``zpool events`` command. +This history begins at ZFS kernel module load and includes events from +any pool. These events are stored in RAM and limited in count to a value +determined by the kernel tunable +`zfs_event_len_max `__. +``zed`` has an internal throttling mechanism to prevent overconsumption +of system resources processing ZFS events. + +More detailed information about events is observable using +``zpool events -v`` The contents of the verbose events is subject to +change, based on the event and information available at the time of the +event. + +Each event has a class identifier used for filtering event types. +Commonly seen events are those related to pool management with class +``sysevent.fs.zfs.*`` including import, export, configuration updates, +and ``zpool history`` updates. + +Events related to errors are reported as class ``ereport.*`` These can +be invaluable for troubleshooting. Some faults can cause multiple +ereports as various layers of the software deal with the fault. For +example, on a simple pool without parity protection, a faulty disk could +cause an ``ereport.io`` during a read from the disk that results in an +``erport.fs.zfs.checksum`` at the pool level. These events are also +reflected by the error counters observed in ``zpool status`` If you see +checksum or read/write errors in ``zpool status`` then there should be +one or more corresponding ereports in the ``zpool events`` output. + +.. _draft-1: + +DRAFT +===== diff --git a/docs/Ubuntu-16.04-Root-on-ZFS.rst b/docs/Ubuntu-16.04-Root-on-ZFS.rst new file mode 100644 index 0000000..29a39f2 --- /dev/null +++ b/docs/Ubuntu-16.04-Root-on-ZFS.rst @@ -0,0 +1,921 @@ +Newer release available +~~~~~~~~~~~~~~~~~~~~~~~ + +- See [[Ubuntu 18.04 Root on ZFS]] for new installs. + +Caution +~~~~~~~ + +- This HOWTO uses a whole physical disk. +- Do not use these instructions for dual-booting. +- Backup your data. Any existing data will be lost. + +System Requirements +~~~~~~~~~~~~~~~~~~~ + +- `64-bit Ubuntu 16.04.5 ("Xenial") Desktop + CD `__ + (*not* the server image) +- `A 64-bit kernel is strongly + encouraged. `__ +- A drive which presents 512B logical sectors. Installing on a drive + which presents 4KiB logical sectors (a “4Kn” drive) should work with + UEFI partitioning, but this has not been tested. + +Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of +memory is recommended for normal performance in basic workloads. If you +wish to use deduplication, you will need `massive amounts of +RAM `__. Enabling +deduplication is a permanent change that cannot be easily reverted. + +Support +------- + +If you need help, reach out to the community using the `zfs-discuss +mailing list `__ +or IRC at #zfsonlinux on `freenode `__. If you +have a bug report or feature request related to this HOWTO, please `file +a new issue `__ and +mention @rlaager. + +Encryption +---------- + +This guide supports the three different Ubuntu encryption options: +unencrypted, LUKS (full-disk encryption), and eCryptfs (home directory +encryption). + +Unencrypted does not encrypt anything, of course. All ZFS features are +fully available. With no encryption happening, this option naturally has +the best performance. + +LUKS encrypts almost everything: the OS, swap, home directories, and +anything else. The only unencrypted data is the bootloader, kernel, and +initrd. The system cannot boot without the passphrase being entered at +the console. All ZFS features are fully available. Performance is good, +but LUKS sits underneath ZFS, so if multiple disks (mirror or raidz +configurations) are used, the data has to be encrypted once per disk. + +eCryptfs protects the contents of the specified home directories. This +guide also recommends encrypted swap when using eCryptfs. Other +operating system directories, which may contain sensitive data, logs, +and/or configuration information, are not encrypted. ZFS compression is +useless on the encrypted home directories. ZFS snapshots are not +automatically and transparently mounted when using eCryptfs, and +manually mounting them requires serious knowledge of eCryptfs +administrative commands. eCryptfs sits above ZFS, so the encryption only +happens once, regardless of the number of disks in the pool. The +performance of eCryptfs may be lower than LUKS in single-disk scenarios. + +If you want encryption, LUKS is recommended. + +Step 1: Prepare The Install Environment +--------------------------------------- + +1.1 Boot the Ubuntu Live CD. Select Try Ubuntu. Connect your system to +the Internet as appropriate (e.g. join your WiFi network). Open a +terminal (press Ctrl-Alt-T). + +1.2 Setup and update the repositories: + +:: + + $ sudo apt-add-repository universe + $ sudo apt update + +1.3 Optional: Start the OpenSSH server in the Live CD environment: + +If you have a second system, using SSH to access the target system can +be convenient. + +:: + + $ passwd + There is no current password; hit enter at that prompt. + $ sudo apt --yes install openssh-server + +**Hint:** You can find your IP address with +``ip addr show scope global | grep inet``. Then, from your main machine, +connect with ``ssh ubuntu@IP``. + +1.4 Become root: + +:: + + $ sudo -i + +1.5 Install ZFS in the Live CD environment: + +:: + + # apt install --yes debootstrap gdisk zfs-initramfs + +**Note:** You can ignore the two error lines about "AppStream". They are +harmless. + +Step 2: Disk Formatting +----------------------- + +2.1 If you are re-using a disk, clear it as necessary: + +:: + + If the disk was previously used in an MD array, zero the superblock: + # apt install --yes mdadm + # mdadm --zero-superblock --force /dev/disk/by-id/scsi-SATA_disk1 + + Clear the partition table: + # sgdisk --zap-all /dev/disk/by-id/scsi-SATA_disk1 + +2.2 Partition your disk: + +:: + + Run this if you need legacy (BIOS) booting: + # sgdisk -a1 -n2:34:2047 -t2:EF02 /dev/disk/by-id/scsi-SATA_disk1 + + Run this for UEFI booting (for use now or in the future): + # sgdisk -n3:1M:+512M -t3:EF00 /dev/disk/by-id/scsi-SATA_disk1 + +Choose one of the following options: + +2.2a Unencrypted or eCryptfs: + +:: + + # sgdisk -n1:0:0 -t1:BF01 /dev/disk/by-id/scsi-SATA_disk1 + +2.2b LUKS: + +:: + + # sgdisk -n4:0:+512M -t4:8300 /dev/disk/by-id/scsi-SATA_disk1 + # sgdisk -n1:0:0 -t1:8300 /dev/disk/by-id/scsi-SATA_disk1 + +Always use the long ``/dev/disk/by-id/*`` aliases with ZFS. Using the +``/dev/sd*`` device nodes directly can cause sporadic import failures, +especially on systems that have more than one storage pool. + +**Hints:** + +- ``ls -la /dev/disk/by-id`` will list the aliases. +- Are you doing this in a virtual machine? If your virtual disk is + missing from ``/dev/disk/by-id``, use ``/dev/vda`` if you are using + KVM with virtio; otherwise, read the + `troubleshooting `__ + section. + +2.3 Create the root pool: + +Choose one of the following options: + +2.3a Unencrypted or eCryptfs: + +:: + + # zpool create -o ashift=12 \ + -O atime=off -O canmount=off -O compression=lz4 -O normalization=formD \ + -O mountpoint=/ -R /mnt \ + rpool /dev/disk/by-id/scsi-SATA_disk1-part1 + +2.3b LUKS: + +:: + + # cryptsetup luksFormat -c aes-xts-plain64 -s 256 -h sha256 \ + /dev/disk/by-id/scsi-SATA_disk1-part1 + # cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part1 luks1 + # zpool create -o ashift=12 \ + -O atime=off -O canmount=off -O compression=lz4 -O normalization=formD \ + -O mountpoint=/ -R /mnt \ + rpool /dev/mapper/luks1 + +**Notes:** + +- The use of ``ashift=12`` is recommended here because many drives + today have 4KiB (or larger) physical sectors, even though they + present 512B logical sectors. Also, a future replacement drive may + have 4KiB physical sectors (in which case ``ashift=12`` is desirable) + or 4KiB logical sectors (in which case ``ashift=12`` is required). +- Setting ``normalization=formD`` eliminates some corner cases relating + to UTF-8 filename normalization. It also implies ``utf8only=on``, + which means that only UTF-8 filenames are allowed. If you care to + support non-UTF-8 filenames, do not use this option. For a discussion + of why requiring UTF-8 filenames may be a bad idea, see `The problems + with enforced UTF-8 only + filenames `__. +- Make sure to include the ``-part1`` portion of the drive path. If you + forget that, you are specifying the whole disk, which ZFS will then + re-partition, and you will lose the bootloader partition(s). +- For LUKS, the key size chosen is 256 bits. However, XTS mode requires + two keys, so the LUKS key is split in half. Thus, ``-s 256`` means + AES-128, which is the LUKS and Ubuntu default. +- Your passphrase will likely be the weakest link. Choose wisely. See + `section 5 of the cryptsetup + FAQ `__ + for guidance. + +**Hints:** + +- The root pool does not have to be a single disk; it can have a mirror + or raidz topology. In that case, repeat the partitioning commands for + all the disks which will be part of the pool. Then, create the pool + using + ``zpool create ... rpool mirror /dev/disk/by-id/scsi-SATA_disk1-part1 /dev/disk/by-id/scsi-SATA_disk2-part1`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). +- The pool name is arbitrary. On systems that can automatically install + to ZFS, the root pool is named ``rpool`` by default. If you work with + multiple systems, it might be wise to use ``hostname``, + ``hostname0``, or ``hostname-1`` instead. + +Step 3: System Installation +--------------------------- + +3.1 Create a filesystem dataset to act as a container: + +:: + + # zfs create -o canmount=off -o mountpoint=none rpool/ROOT + +On Solaris systems, the root filesystem is cloned and the suffix is +incremented for major system changes through ``pkg image-update`` or +``beadm``. Similar functionality for APT is possible but currently +unimplemented. Even without such a tool, it can still be used for +manually created clones. + +3.2 Create a filesystem dataset for the root filesystem of the Ubuntu +system: + +:: + + # zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/ubuntu + # zfs mount rpool/ROOT/ubuntu + +With ZFS, it is not normally necessary to use a mount command (either +``mount`` or ``zfs mount``). This situation is an exception because of +``canmount=noauto``. + +3.3 Create datasets: + +:: + + # zfs create -o setuid=off rpool/home + # zfs create -o mountpoint=/root rpool/home/root + # zfs create -o canmount=off -o setuid=off -o exec=off rpool/var + # zfs create -o com.sun:auto-snapshot=false rpool/var/cache + # zfs create rpool/var/log + # zfs create rpool/var/spool + # zfs create -o com.sun:auto-snapshot=false -o exec=on rpool/var/tmp + + If you use /srv on this system: + # zfs create rpool/srv + + If this system will have games installed: + # zfs create rpool/var/games + + If this system will store local email in /var/mail: + # zfs create rpool/var/mail + + If this system will use NFS (locking): + # zfs create -o com.sun:auto-snapshot=false \ + -o mountpoint=/var/lib/nfs rpool/var/nfs + +The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data such as logs (in ``/var/log``). This will be especially +important if/when a ``beadm`` or similar utility is integrated. Since we +are creating multiple datasets anyway, it is trivial to add some +restrictions (for extra security) at the same time. The +``com.sun.auto-snapshot`` setting is used by some ZFS snapshot utilities +to exclude transient data. + +3.4 For LUKS installs only: + +:: + + # mke2fs -t ext2 /dev/disk/by-id/scsi-SATA_disk1-part4 + # mkdir /mnt/boot + # mount /dev/disk/by-id/scsi-SATA_disk1-part4 /mnt/boot + +3.5 Install the minimal system: + +:: + + # chmod 1777 /mnt/var/tmp + # debootstrap xenial /mnt + # zfs set devices=off rpool + +The ``debootstrap`` command leaves the new system in an unconfigured +state. An alternative to using ``debootstrap`` is to copy the entirety +of a working system into the new ZFS root. + +Step 4: System Configuration +---------------------------- + +4.1 Configure the hostname (change ``HOSTNAME`` to the desired +hostname). + +:: + + # echo HOSTNAME > /mnt/etc/hostname + + # vi /mnt/etc/hosts + Add a line: + 127.0.1.1 HOSTNAME + or if the system has a real name in DNS: + 127.0.1.1 FQDN HOSTNAME + +**Hint:** Use ``nano`` if you find ``vi`` confusing. + +4.2 Configure the network interface: + +:: + + Find the interface name: + # ip addr show + + # vi /mnt/etc/network/interfaces.d/NAME + auto NAME + iface NAME inet dhcp + +Customize this file if the system is not a DHCP client. + +4.3 Configure the package sources: + +:: + + # vi /mnt/etc/apt/sources.list + deb http://archive.ubuntu.com/ubuntu xenial main universe + deb-src http://archive.ubuntu.com/ubuntu xenial main universe + + deb http://security.ubuntu.com/ubuntu xenial-security main universe + deb-src http://security.ubuntu.com/ubuntu xenial-security main universe + + deb http://archive.ubuntu.com/ubuntu xenial-updates main universe + deb-src http://archive.ubuntu.com/ubuntu xenial-updates main universe + +4.4 Bind the virtual filesystems from the LiveCD environment to the new +system and ``chroot`` into it: + +:: + + # mount --rbind /dev /mnt/dev + # mount --rbind /proc /mnt/proc + # mount --rbind /sys /mnt/sys + # chroot /mnt /bin/bash --login + +**Note:** This is using ``--rbind``, not ``--bind``. + +4.5 Configure a basic system environment: + +:: + + # locale-gen en_US.UTF-8 + +Even if you prefer a non-English system language, always ensure that +``en_US.UTF-8`` is available. + +:: + + # echo LANG=en_US.UTF-8 > /etc/default/locale + + # dpkg-reconfigure tzdata + + # ln -s /proc/self/mounts /etc/mtab + # apt update + # apt install --yes ubuntu-minimal + + If you prefer nano over vi, install it: + # apt install --yes nano + +4.6 Install ZFS in the chroot environment for the new system: + +:: + + # apt install --yes --no-install-recommends linux-image-generic + # apt install --yes zfs-initramfs + +4.7 For LUKS installs only: + +:: + + # echo UUID=$(blkid -s UUID -o value \ + /dev/disk/by-id/scsi-SATA_disk1-part4) \ + /boot ext2 defaults 0 2 >> /etc/fstab + + # apt install --yes cryptsetup + + # echo luks1 UUID=$(blkid -s UUID -o value \ + /dev/disk/by-id/scsi-SATA_disk1-part1) none \ + luks,discard,initramfs > /etc/crypttab + + # vi /etc/udev/rules.d/99-local-crypt.rules + ENV{DM_NAME}!="", SYMLINK+="$env{DM_NAME}" + ENV{DM_NAME}!="", SYMLINK+="dm-name-$env{DM_NAME}" + + # ln -s /dev/mapper/luks1 /dev/luks1 + +**Notes:** + +- The use of ``initramfs`` is a work-around for `cryptsetup does not + support + ZFS `__. +- The 99-local-crypt.rules file and symlink in /dev are a work-around + for `grub-probe assuming all devices are in + /dev `__. + +4.8 Install GRUB + +Choose one of the following options: + +4.8a Install GRUB for legacy (MBR) booting + +:: + + # apt install --yes grub-pc + +Install GRUB to the disk(s), not the partition(s). + +4.8b Install GRUB for UEFI booting + +:: + + # apt install dosfstools + # mkdosfs -F 32 -n EFI /dev/disk/by-id/scsi-SATA_disk1-part3 + # mkdir /boot/efi + # echo PARTUUID=$(blkid -s PARTUUID -o value \ + /dev/disk/by-id/scsi-SATA_disk1-part3) \ + /boot/efi vfat nofail,x-systemd.device-timeout=1 0 1 >> /etc/fstab + # mount /boot/efi + # apt install --yes grub-efi-amd64 + +4.9 Setup system groups: + +:: + + # addgroup --system lpadmin + # addgroup --system sambashare + +4.10 Set a root password + +:: + + # passwd + +4.11 Fix filesystem mount ordering + +`Until ZFS gains a systemd mount +generator `__, there are +races between mounting filesystems and starting certain daemons. In +practice, the issues (e.g. +`#5754 `__) seem to be +with certain filesystems in ``/var``, specifically ``/var/log`` and +``/var/tmp``. Setting these to use ``legacy`` mounting, and listing them +in ``/etc/fstab`` makes systemd aware that these are separate +mountpoints. In turn, ``rsyslog.service`` depends on ``var-log.mount`` +by way of ``local-fs.target`` and services using the ``PrivateTmp`` +feature of systemd automatically use ``After=var-tmp.mount``. + +:: + + # zfs set mountpoint=legacy rpool/var/log + # zfs set mountpoint=legacy rpool/var/tmp + # cat >> /etc/fstab << EOF + rpool/var/log /var/log zfs defaults 0 0 + rpool/var/tmp /var/tmp zfs defaults 0 0 + EOF + +Step 5: GRUB Installation +------------------------- + +5.1 Verify that the ZFS root filesystem is recognized: + +:: + + # grub-probe / + zfs + +**Note:** GRUB uses ``zpool status`` in order to determine the location +of devices. `grub-probe assumes all devices are in +/dev `__. +The ``zfs-initramfs`` package `ships udev rules that create +symlinks `__ +to `work around the +problem `__, +but `there have still been reports of +problems `__. +If this happens, you will get an error saying +``grub-probe: error: failed to get canonical path`` and should run the +following: + +:: + + # export ZPOOL_VDEV_NAME_PATH=YES + +5.2 Refresh the initrd files: + +:: + + # update-initramfs -c -k all + update-initramfs: Generating /boot/initrd.img-4.4.0-21-generic + +**Note:** When using LUKS, this will print "WARNING could not determine +root device from /etc/fstab". This is because `cryptsetup does not +support +ZFS `__. + +5.3 Optional (but highly recommended): Make debugging GRUB easier: + +:: + + # vi /etc/default/grub + Comment out: GRUB_HIDDEN_TIMEOUT=0 + Remove quiet and splash from: GRUB_CMDLINE_LINUX_DEFAULT + Uncomment: GRUB_TERMINAL=console + Save and quit. + +Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired. + +5.4 Update the boot configuration: + +:: + + # update-grub + Generating grub configuration file ... + Found linux image: /boot/vmlinuz-4.4.0-21-generic + Found initrd image: /boot/initrd.img-4.4.0-21-generic + done + +5.5 Install the boot loader + +5.5a For legacy (MBR) booting, install GRUB to the MBR: + +:: + + # grub-install /dev/disk/by-id/scsi-SATA_disk1 + Installing for i386-pc platform. + Installation finished. No error reported. + +Do not reboot the computer until you get exactly that result message. +Note that you are installing GRUB to the whole disk, not a partition. + +If you are creating a mirror, repeat the grub-install command for each +disk in the pool. + +5.5b For UEFI booting, install GRUB: + +:: + + # grub-install --target=x86_64-efi --efi-directory=/boot/efi \ + --bootloader-id=ubuntu --recheck --no-floppy + +5.6 Verify that the ZFS module is installed: + +:: + + # ls /boot/grub/*/zfs.mod + +Step 6: First Boot +------------------ + +6.1 Snapshot the initial installation: + +:: + + # zfs snapshot rpool/ROOT/ubuntu@install + +In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space. + +6.2 Exit from the ``chroot`` environment back to the LiveCD environment: + +:: + + # exit + +6.3 Run these commands in the LiveCD environment to unmount all +filesystems: + +:: + + # mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + # zpool export rpool + +6.4 Reboot: + +:: + + # reboot + +6.5 Wait for the newly installed system to boot normally. Login as root. + +6.6 Create a user account: + +Choose one of the following options: + +6.6a Unencrypted or LUKS: + +:: + + # zfs create rpool/home/YOURUSERNAME + # adduser YOURUSERNAME + # cp -a /etc/skel/.[!.]* /home/YOURUSERNAME + # chown -R YOURUSERNAME:YOURUSERNAME /home/YOURUSERNAME + +6.6b eCryptfs: + +:: + + # apt install ecryptfs-utils + + # zfs create -o compression=off -o mountpoint=/home/.ecryptfs/YOURUSERNAME \ + rpool/home/temp-YOURUSERNAME + # adduser --encrypt-home YOURUSERNAME + # zfs rename rpool/home/temp-YOURUSERNAME rpool/home/YOURUSERNAME + +The temporary name for the dataset is required to work-around `a bug in +ecryptfs-setup-private `__. +Otherwise, it will fail with an error saying the home directory is +already mounted; that check is not specific enough in the pattern it +uses. + +**Note:** Automatically mounted snapshots (i.e. the ``.zfs/snapshots`` +directory) will not work through eCryptfs. You can do another eCryptfs +mount manually if you need to access files in a snapshot. A script to +automate the mounting should be possible, but has not yet been +implemented. + +6.7 Add your user account to the default set of groups for an +administrator: + +:: + + # usermod -a -G adm,cdrom,dip,lpadmin,plugdev,sambashare,sudo YOURUSERNAME + +6.8 Mirror GRUB + +If you installed to multiple disks, install GRUB on the additional +disks: + +6.8a For legacy (MBR) booting: + +:: + + # dpkg-reconfigure grub-pc + Hit enter until you get to the device selection screen. + Select (using the space bar) all of the disks (not partitions) in your pool. + +6.8b UEFI + +:: + + # umount /boot/efi + + For the second and subsequent disks (increment ubuntu-2 to -3, etc.): + # dd if=/dev/disk/by-id/scsi-SATA_disk1-part3 \ + of=/dev/disk/by-id/scsi-SATA_disk2-part3 + # efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \ + -p 3 -L "ubuntu-2" -l '\EFI\Ubuntu\grubx64.efi' + + # mount /boot/efi + +Step 7: Configure Swap +---------------------- + +7.1 Create a volume dataset (zvol) for use as a swap device: + +:: + + # zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \ + -o logbias=throughput -o sync=always \ + -o primarycache=metadata -o secondarycache=none \ + -o com.sun:auto-snapshot=false rpool/swap + +You can adjust the size (the ``4G`` part) to your needs. + +The compression algorithm is set to ``zle`` because it is the cheapest +available algorithm. As this guide recommends ``ashift=12`` (4 kiB +blocks on disk), the common case of a 4 kiB page size means that no +compression algorithm can reduce I/O. The exception is all-zero pages, +which are dropped by ZFS; but some form of compression has to be enabled +to get this behavior. + +7.2 Configure the swap device: + +Choose one of the following options: + +7.2a Unencrypted or LUKS: + +**Caution**: Always use long ``/dev/zvol`` aliases in configuration +files. Never use a short ``/dev/zdX`` device name. + +:: + + # mkswap -f /dev/zvol/rpool/swap + # echo /dev/zvol/rpool/swap none swap defaults 0 0 >> /etc/fstab + +7.2b eCryptfs: + +:: + + # apt install cryptsetup + # echo cryptswap1 /dev/zvol/rpool/swap /dev/urandom \ + swap,cipher=aes-xts-plain64:sha256,size=256 >> /etc/crypttab + # systemctl daemon-reload + # systemctl start systemd-cryptsetup@cryptswap1.service + # echo /dev/mapper/cryptswap1 none swap defaults 0 0 >> /etc/fstab + +7.3 Enable the swap device: + +:: + + # swapon -av + +Step 8: Full Software Installation +---------------------------------- + +8.1 Upgrade the minimal system: + +:: + + # apt dist-upgrade --yes + +8.2 Install a regular set of software: + +Choose one of the following options: + +8.2a Install a command-line environment only: + +:: + + # apt install --yes ubuntu-standard + +8.2b Install a full GUI environment: + +:: + + # apt install --yes ubuntu-desktop + +**Hint**: If you are installing a full GUI environment, you will likely +want to manage your network with NetworkManager. In that case, +``rm /etc/network/interfaces.d/eth0``. + +8.3 Optional: Disable log compression: + +As ``/var/log`` is already compressed by ZFS, logrotate’s compression is +going to burn CPU and disk I/O for (in most cases) very little gain. +Also, if you are making snapshots of ``/var/log``, logrotate’s +compression will actually waste space, as the uncompressed data will +live on in the snapshot. You can edit the files in ``/etc/logrotate.d`` +by hand to comment out ``compress``, or use this loop (copy-and-paste +highly recommended): + +:: + + # for file in /etc/logrotate.d/* ; do + if grep -Eq "(^|[^#y])compress" "$file" ; then + sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file" + fi + done + +8.4 Reboot: + +:: + + # reboot + +Step 9: Final Cleanup +~~~~~~~~~~~~~~~~~~~~~ + +9.1 Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally. + +9.2 Optional: Delete the snapshot of the initial installation: + +:: + + $ sudo zfs destroy rpool/ROOT/ubuntu@install + +9.3 Optional: Disable the root password + +:: + + $ sudo usermod -p '*' root + +9.4 Optional: + +If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer. + +:: + + $ sudo vi /etc/default/grub + Uncomment GRUB_HIDDEN_TIMEOUT=0 + Add quiet and splash to GRUB_CMDLINE_LINUX_DEFAULT + Comment out GRUB_TERMINAL=console + Save and quit. + + $ sudo update-grub + +Troubleshooting +--------------- + +Rescuing using a Live CD +~~~~~~~~~~~~~~~~~~~~~~~~ + +Boot the Live CD and open a terminal. + +Become root and install the ZFS utilities: + +:: + + $ sudo -i + # apt update + # apt install --yes zfsutils-linux + +This will automatically import your pool. Export it and re-import it to +get the mounts right: + +:: + + # zpool export -a + # zpool import -N -R /mnt rpool + # zfs mount rpool/ROOT/ubuntu + # zfs mount -a + +If needed, you can chroot into your installed environment: + +:: + + # mount --rbind /dev /mnt/dev + # mount --rbind /proc /mnt/proc + # mount --rbind /sys /mnt/sys + # chroot /mnt /bin/bash --login + +Do whatever you need to do to fix your system. + +When done, cleanup: + +:: + + # mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + # zpool export rpool + # reboot + +MPT2SAS +~~~~~~~ + +Most problem reports for this tutorial involve ``mpt2sas`` hardware that +does slow asynchronous drive initialization, like some IBM M1015 or +OEM-branded cards that have been flashed to the reference LSI firmware. + +The basic problem is that disks on these controllers are not visible to +the Linux kernel until after the regular system is started, and ZoL does +not hotplug pool members. See +`https://github.com/zfsonlinux/zfs/issues/330 `__. + +Most LSI cards are perfectly compatible with ZoL. If your card has this +glitch, try setting rootdelay=X in GRUB_CMDLINE_LINUX. The system will +wait up to X seconds for all drives to appear before importing the pool. + +Areca +~~~~~ + +Systems that require the ``arcsas`` blob driver should add it to the +``/etc/initramfs-tools/modules`` file and run +``update-initramfs -c -k all``. + +Upgrade or downgrade the Areca driver if something like +``RIP: 0010:[] [] native_read_tsc+0x6/0x20`` +appears anywhere in kernel log. ZoL is unstable on systems that emit +this error message. + +VMware +~~~~~~ + +- Set ``disk.EnableUUID = "TRUE"`` in the vmx file or vsphere + configuration. Doing this ensures that ``/dev/disk`` aliases are + created in the guest. + +QEMU/KVM/XEN +~~~~~~~~~~~~ + +Set a unique serial number on each virtual disk using libvirt or qemu +(e.g. ``-drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890``). + +To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host: + +:: + + $ sudo apt install ovmf + $ sudo vi /etc/libvirt/qemu.conf + Uncomment these lines: + nvram = [ + "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd", + "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd" + ] + $ sudo service libvirt-bin restart diff --git a/docs/Ubuntu-18.04-Root-on-ZFS.rst b/docs/Ubuntu-18.04-Root-on-ZFS.rst new file mode 100644 index 0000000..bde9b21 --- /dev/null +++ b/docs/Ubuntu-18.04-Root-on-ZFS.rst @@ -0,0 +1,1133 @@ +Caution +~~~~~~~ + +- This HOWTO uses a whole physical disk. +- Do not use these instructions for dual-booting. +- Backup your data. Any existing data will be lost. + +System Requirements +~~~~~~~~~~~~~~~~~~~ + +- `Ubuntu 18.04.3 ("Bionic") Desktop + CD `__ + (*not* any server images) +- Installing on a drive which presents 4KiB logical sectors (a “4Kn” + drive) only works with UEFI booting. This not unique to ZFS. `GRUB + does not and will not work on 4Kn with legacy (BIOS) + booting. `__ + +Computers that have less than 2 GiB of memory run ZFS slowly. 4 GiB of +memory is recommended for normal performance in basic workloads. If you +wish to use deduplication, you will need `massive amounts of +RAM `__. Enabling +deduplication is a permanent change that cannot be easily reverted. + +Support +------- + +If you need help, reach out to the community using the `zfs-discuss +mailing list `__ +or IRC at #zfsonlinux on `freenode `__. If you +have a bug report or feature request related to this HOWTO, please `file +a new issue `__ and +mention @rlaager. + +Contributing +------------ + +Edit permission on this wiki is restricted. Also, GitHub wikis do not +support pull requests. However, you can clone the wiki using git. + +1) ``git clone https://github.com/zfsonlinux/zfs.wiki.git`` +2) Make your changes. +3) Use ``git diff > my-changes.patch`` to create a patch. (Advanced git + users may wish to ``git commit`` to a branch and + ``git format-patch``.) +4) `File a new issue `__, + mention @rlaager, and attach the patch. + +Encryption +---------- + +This guide supports two different encryption options: unencrypted and +LUKS (full-disk encryption). ZFS native encryption has not yet been +released. With either option, all ZFS features are fully available. + +Unencrypted does not encrypt anything, of course. With no encryption +happening, this option naturally has the best performance. + +LUKS encrypts almost everything: the OS, swap, home directories, and +anything else. The only unencrypted data is the bootloader, kernel, and +initrd. The system cannot boot without the passphrase being entered at +the console. Performance is good, but LUKS sits underneath ZFS, so if +multiple disks (mirror or raidz topologies) are used, the data has to be +encrypted once per disk. + +Step 1: Prepare The Install Environment +--------------------------------------- + +1.1 Boot the Ubuntu Live CD. Select Try Ubuntu. Connect your system to +the Internet as appropriate (e.g. join your WiFi network). Open a +terminal (press Ctrl-Alt-T). + +1.2 Setup and update the repositories: + +:: + + sudo apt-add-repository universe + sudo apt update + +1.3 Optional: Install and start the OpenSSH server in the Live CD +environment: + +If you have a second system, using SSH to access the target system can +be convenient. + +:: + + passwd + There is no current password; hit enter at that prompt. + sudo apt install --yes openssh-server + +**Hint:** You can find your IP address with +``ip addr show scope global | grep inet``. Then, from your main machine, +connect with ``ssh ubuntu@IP``. + +1.4 Become root: + +:: + + sudo -i + +1.5 Install ZFS in the Live CD environment: + +:: + + apt install --yes debootstrap gdisk zfs-initramfs + +Step 2: Disk Formatting +----------------------- + +2.1 Set a variable with the disk name: + +:: + + DISK=/dev/disk/by-id/scsi-SATA_disk1 + +Always use the long ``/dev/disk/by-id/*`` aliases with ZFS. Using the +``/dev/sd*`` device nodes directly can cause sporadic import failures, +especially on systems that have more than one storage pool. + +**Hints:** + +- ``ls -la /dev/disk/by-id`` will list the aliases. +- Are you doing this in a virtual machine? If your virtual disk is + missing from ``/dev/disk/by-id``, use ``/dev/vda`` if you are using + KVM with virtio; otherwise, read the + `troubleshooting <#troubleshooting>`__ section. + +2.2 If you are re-using a disk, clear it as necessary: + +If the disk was previously used in an MD array, zero the superblock: + +:: + + apt install --yes mdadm + mdadm --zero-superblock --force $DISK + +Clear the partition table: + +:: + + sgdisk --zap-all $DISK + +2.3 Partition your disk(s): + +Run this if you need legacy (BIOS) booting: + +:: + + sgdisk -a1 -n1:24K:+1000K -t1:EF02 $DISK + +Run this for UEFI booting (for use now or in the future): + +:: + + sgdisk -n2:1M:+512M -t2:EF00 $DISK + +Run this for the boot pool: + +:: + + sgdisk -n3:0:+1G -t3:BF01 $DISK + +Choose one of the following options: + +2.3a Unencrypted: + +:: + + sgdisk -n4:0:0 -t4:BF01 $DISK + +2.3b LUKS: + +:: + + sgdisk -n4:0:0 -t4:8300 $DISK + +If you are creating a mirror or raidz topology, repeat the partitioning +commands for all the disks which will be part of the pool. + +2.4 Create the boot pool: + +:: + + zpool create -o ashift=12 -d \ + -o feature@async_destroy=enabled \ + -o feature@bookmarks=enabled \ + -o feature@embedded_data=enabled \ + -o feature@empty_bpobj=enabled \ + -o feature@enabled_txg=enabled \ + -o feature@extensible_dataset=enabled \ + -o feature@filesystem_limits=enabled \ + -o feature@hole_birth=enabled \ + -o feature@large_blocks=enabled \ + -o feature@lz4_compress=enabled \ + -o feature@spacemap_histogram=enabled \ + -o feature@userobj_accounting=enabled \ + -O acltype=posixacl -O canmount=off -O compression=lz4 -O devices=off \ + -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt bpool ${DISK}-part3 + +You should not need to customize any of the options for the boot pool. + +GRUB does not support all of the zpool features. See +``spa_feature_names`` in +`grub-core/fs/zfs/zfs.c `__. +This step creates a separate boot pool for ``/boot`` with the features +limited to only those that GRUB supports, allowing the root pool to use +any/all features. Note that GRUB opens the pool read-only, so all +read-only compatible features are "supported" by GRUB. + +**Hints:** + +- If you are creating a mirror or raidz topology, create the pool using + ``zpool create ... bpool mirror /dev/disk/by-id/scsi-SATA_disk1-part3 /dev/disk/by-id/scsi-SATA_disk2-part3`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). +- The pool name is arbitrary. If changed, the new name must be used + consistently. The ``bpool`` convention originated in this HOWTO. + +2.5 Create the root pool: + +Choose one of the following options: + +2.5a Unencrypted: + +:: + + zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt rpool ${DISK}-part4 + +2.5b LUKS: + +:: + + cryptsetup luksFormat -c aes-xts-plain64 -s 512 -h sha256 ${DISK}-part4 + cryptsetup luksOpen ${DISK}-part4 luks1 + zpool create -o ashift=12 \ + -O acltype=posixacl -O canmount=off -O compression=lz4 \ + -O dnodesize=auto -O normalization=formD -O relatime=on -O xattr=sa \ + -O mountpoint=/ -R /mnt rpool /dev/mapper/luks1 + +- The use of ``ashift=12`` is recommended here because many drives + today have 4KiB (or larger) physical sectors, even though they + present 512B logical sectors. Also, a future replacement drive may + have 4KiB physical sectors (in which case ``ashift=12`` is desirable) + or 4KiB logical sectors (in which case ``ashift=12`` is required). +- Setting ``-O acltype=posixacl`` enables POSIX ACLs globally. If you + do not want this, remove that option, but later add + ``-o acltype=posixacl`` (note: lowercase "o") to the ``zfs create`` + for ``/var/log``, as `journald requires + ACLs `__ +- Setting ``normalization=formD`` eliminates some corner cases relating + to UTF-8 filename normalization. It also implies ``utf8only=on``, + which means that only UTF-8 filenames are allowed. If you care to + support non-UTF-8 filenames, do not use this option. For a discussion + of why requiring UTF-8 filenames may be a bad idea, see `The problems + with enforced UTF-8 only + filenames `__. +- Setting ``relatime=on`` is a middle ground between classic POSIX + ``atime`` behavior (with its significant performance impact) and + ``atime=off`` (which provides the best performance by completely + disabling atime updates). Since Linux 2.6.30, ``relatime`` has been + the default for other filesystems. See `RedHat's + documentation `__ + for further information. +- Setting ``xattr=sa`` `vastly improves the performance of extended + attributes `__. + Inside ZFS, extended attributes are used to implement POSIX ACLs. + Extended attributes can also be used by user-space applications. + `They are used by some desktop GUI + applications. `__ + `They can be used by Samba to store Windows ACLs and DOS attributes; + they are required for a Samba Active Directory domain + controller. `__ + Note that ```xattr=sa`` is + Linux-specific. `__ + If you move your ``xattr=sa`` pool to another OpenZFS implementation + besides ZFS-on-Linux, extended attributes will not be readable + (though your data will be). If portability of extended attributes is + important to you, omit the ``-O xattr=sa`` above. Even if you do not + want ``xattr=sa`` for the whole pool, it is probably fine to use it + for ``/var/log``. +- Make sure to include the ``-part4`` portion of the drive path. If you + forget that, you are specifying the whole disk, which ZFS will then + re-partition, and you will lose the bootloader partition(s). +- For LUKS, the key size chosen is 512 bits. However, XTS mode requires + two keys, so the LUKS key is split in half. Thus, ``-s 512`` means + AES-256. +- Your passphrase will likely be the weakest link. Choose wisely. See + `section 5 of the cryptsetup + FAQ `__ + for guidance. + +**Hints:** + +- If you are creating a mirror or raidz topology, create the pool using + ``zpool create ... rpool mirror /dev/disk/by-id/scsi-SATA_disk1-part4 /dev/disk/by-id/scsi-SATA_disk2-part4`` + (or replace ``mirror`` with ``raidz``, ``raidz2``, or ``raidz3`` and + list the partitions from additional disks). For LUKS, use + ``/dev/mapper/luks1``, ``/dev/mapper/luks2``, etc., which you will + have to create using ``cryptsetup``. +- The pool name is arbitrary. If changed, the new name must be used + consistently. On systems that can automatically install to ZFS, the + root pool is named ``rpool`` by default. + +Step 3: System Installation +--------------------------- + +3.1 Create filesystem datasets to act as containers: + +:: + + zfs create -o canmount=off -o mountpoint=none rpool/ROOT + zfs create -o canmount=off -o mountpoint=none bpool/BOOT + +On Solaris systems, the root filesystem is cloned and the suffix is +incremented for major system changes through ``pkg image-update`` or +``beadm``. Similar functionality for APT is possible but currently +unimplemented. Even without such a tool, it can still be used for +manually created clones. + +3.2 Create filesystem datasets for the root and boot filesystems: + +:: + + zfs create -o canmount=noauto -o mountpoint=/ rpool/ROOT/ubuntu + zfs mount rpool/ROOT/ubuntu + + zfs create -o canmount=noauto -o mountpoint=/boot bpool/BOOT/ubuntu + zfs mount bpool/BOOT/ubuntu + +With ZFS, it is not normally necessary to use a mount command (either +``mount`` or ``zfs mount``). This situation is an exception because of +``canmount=noauto``. + +3.3 Create datasets: + +:: + + zfs create rpool/home + zfs create -o mountpoint=/root rpool/home/root + zfs create -o canmount=off rpool/var + zfs create -o canmount=off rpool/var/lib + zfs create rpool/var/log + zfs create rpool/var/spool + +The datasets below are optional, depending on your preferences and/or +software choices. + +If you wish to exclude these from snapshots: + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/var/cache + zfs create -o com.sun:auto-snapshot=false rpool/var/tmp + chmod 1777 /mnt/var/tmp + +If you use /opt on this system: + +:: + + zfs create rpool/opt + +If you use /srv on this system: + +:: + + zfs create rpool/srv + +If you use /usr/local on this system: + +:: + + zfs create -o canmount=off rpool/usr + zfs create rpool/usr/local + +If this system will have games installed: + +:: + + zfs create rpool/var/games + +If this system will store local email in /var/mail: + +:: + + zfs create rpool/var/mail + +If this system will use Snap packages: + +:: + + zfs create rpool/var/snap + +If you use /var/www on this system: + +:: + + zfs create rpool/var/www + +If this system will use GNOME: + +:: + + zfs create rpool/var/lib/AccountsService + +If this system will use Docker (which manages its own datasets & +snapshots): + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/var/lib/docker + +If this system will use NFS (locking): + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/var/lib/nfs + +A tmpfs is recommended later, but if you want a separate dataset for +/tmp: + +:: + + zfs create -o com.sun:auto-snapshot=false rpool/tmp + chmod 1777 /mnt/tmp + +The primary goal of this dataset layout is to separate the OS from user +data. This allows the root filesystem to be rolled back without rolling +back user data such as logs (in ``/var/log``). This will be especially +important if/when a ``beadm`` or similar utility is integrated. The +``com.sun.auto-snapshot`` setting is used by some ZFS snapshot utilities +to exclude transient data. + +If you do nothing extra, ``/tmp`` will be stored as part of the root +filesystem. Alternatively, you can create a separate dataset for +``/tmp``, as shown above. This keeps the ``/tmp`` data out of snapshots +of your root filesystem. It also allows you to set a quota on +``rpool/tmp``, if you want to limit the maximum space used. Otherwise, +you can use a tmpfs (RAM filesystem) later. + +3.4 Install the minimal system: + +:: + + debootstrap bionic /mnt + zfs set devices=off rpool + +The ``debootstrap`` command leaves the new system in an unconfigured +state. An alternative to using ``debootstrap`` is to copy the entirety +of a working system into the new ZFS root. + +Step 4: System Configuration +---------------------------- + +4.1 Configure the hostname (change ``HOSTNAME`` to the desired +hostname). + +:: + + echo HOSTNAME > /mnt/etc/hostname + + vi /mnt/etc/hosts + Add a line: + 127.0.1.1 HOSTNAME + or if the system has a real name in DNS: + 127.0.1.1 FQDN HOSTNAME + +**Hint:** Use ``nano`` if you find ``vi`` confusing. + +4.2 Configure the network interface: + +Find the interface name: + +:: + + ip addr show + +Adjust NAME below to match your interface name: + +:: + + vi /mnt/etc/netplan/01-netcfg.yaml + network: + version: 2 + ethernets: + NAME: + dhcp4: true + +Customize this file if the system is not a DHCP client. + +4.3 Configure the package sources: + +:: + + vi /mnt/etc/apt/sources.list + deb http://archive.ubuntu.com/ubuntu bionic main universe + deb-src http://archive.ubuntu.com/ubuntu bionic main universe + + deb http://security.ubuntu.com/ubuntu bionic-security main universe + deb-src http://security.ubuntu.com/ubuntu bionic-security main universe + + deb http://archive.ubuntu.com/ubuntu bionic-updates main universe + deb-src http://archive.ubuntu.com/ubuntu bionic-updates main universe + +4.4 Bind the virtual filesystems from the LiveCD environment to the new +system and ``chroot`` into it: + +:: + + mount --rbind /dev /mnt/dev + mount --rbind /proc /mnt/proc + mount --rbind /sys /mnt/sys + chroot /mnt /usr/bin/env DISK=$DISK bash --login + +**Note:** This is using ``--rbind``, not ``--bind``. + +4.5 Configure a basic system environment: + +:: + + ln -s /proc/self/mounts /etc/mtab + apt update + + dpkg-reconfigure locales + +Even if you prefer a non-English system language, always ensure that +``en_US.UTF-8`` is available. + +:: + + dpkg-reconfigure tzdata + +If you prefer nano over vi, install it: + +:: + + apt install --yes nano + +4.6 Install ZFS in the chroot environment for the new system: + +:: + + apt install --yes --no-install-recommends linux-image-generic + apt install --yes zfs-initramfs + +**Hint:** For the HWE kernel, install ``linux-image-generic-hwe-18.04`` +instead of ``linux-image-generic``. + +4.7 For LUKS installs only, setup crypttab: + +:: + + apt install --yes cryptsetup + + echo luks1 UUID=$(blkid -s UUID -o value ${DISK}-part4) none \ + luks,discard,initramfs > /etc/crypttab + +- The use of ``initramfs`` is a work-around for `cryptsetup does not + support + ZFS `__. + +**Hint:** If you are creating a mirror or raidz topology, repeat the +``/etc/crypttab`` entries for ``luks2``, etc. adjusting for each disk. + +4.8 Install GRUB + +Choose one of the following options: + +4.8a Install GRUB for legacy (BIOS) booting + +:: + + apt install --yes grub-pc + +Install GRUB to the disk(s), not the partition(s). + +4.8b Install GRUB for UEFI booting + +:: + + apt install dosfstools + mkdosfs -F 32 -s 1 -n EFI ${DISK}-part2 + mkdir /boot/efi + echo PARTUUID=$(blkid -s PARTUUID -o value ${DISK}-part2) \ + /boot/efi vfat nofail,x-systemd.device-timeout=1 0 1 >> /etc/fstab + mount /boot/efi + apt install --yes grub-efi-amd64-signed shim-signed + +- The ``-s 1`` for ``mkdosfs`` is only necessary for drives which + present 4 KiB logical sectors (“4Kn” drives) to meet the minimum + cluster size (given the partition size of 512 MiB) for FAT32. It also + works fine on drives which present 512 B sectors. + +**Note:** If you are creating a mirror or raidz topology, this step only +installs GRUB on the first disk. The other disk(s) will be handled +later. + +4.9 Set a root password + +:: + + passwd + +4.10 Enable importing bpool + +This ensures that ``bpool`` is always imported, regardless of whether +``/etc/zfs/zpool.cache`` exists, whether it is in the cachefile or not, +or whether ``zfs-import-scan.service`` is enabled. + +:: + + vi /etc/systemd/system/zfs-import-bpool.service + [Unit] + DefaultDependencies=no + Before=zfs-import-scan.service + Before=zfs-import-cache.service + + [Service] + Type=oneshot + RemainAfterExit=yes + ExecStart=/sbin/zpool import -N -o cachefile=none bpool + + [Install] + WantedBy=zfs-import.target + +:: + + systemctl enable zfs-import-bpool.service + +4.11 Optional (but recommended): Mount a tmpfs to /tmp + +If you chose to create a ``/tmp`` dataset above, skip this step, as they +are mutually exclusive choices. Otherwise, you can put ``/tmp`` on a +tmpfs (RAM filesystem) by enabling the ``tmp.mount`` unit. + +:: + + cp /usr/share/systemd/tmp.mount /etc/systemd/system/ + systemctl enable tmp.mount + +4.12 Setup system groups: + +:: + + addgroup --system lpadmin + addgroup --system sambashare + +Step 5: GRUB Installation +------------------------- + +5.1 Verify that the ZFS boot filesystem is recognized: + +:: + + grub-probe /boot + +5.2 Refresh the initrd files: + +:: + + update-initramfs -u -k all + +**Note:** When using LUKS, this will print "WARNING could not determine +root device from /etc/fstab". This is because `cryptsetup does not +support +ZFS `__. + +5.3 Workaround GRUB's missing zpool-features support: + +:: + + vi /etc/default/grub + Set: GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/ubuntu" + +5.4 Optional (but highly recommended): Make debugging GRUB easier: + +:: + + vi /etc/default/grub + Comment out: GRUB_TIMEOUT_STYLE=hidden + Set: GRUB_TIMEOUT=5 + Below GRUB_TIMEOUT, add: GRUB_RECORDFAIL_TIMEOUT=5 + Remove quiet and splash from: GRUB_CMDLINE_LINUX_DEFAULT + Uncomment: GRUB_TERMINAL=console + Save and quit. + +Later, once the system has rebooted twice and you are sure everything is +working, you can undo these changes, if desired. + +5.5 Update the boot configuration: + +:: + + update-grub + +**Note:** Ignore errors from ``osprober``, if present. + +5.6 Install the boot loader + +5.6a For legacy (BIOS) booting, install GRUB to the MBR: + +:: + + grub-install $DISK + +Note that you are installing GRUB to the whole disk, not a partition. + +If you are creating a mirror or raidz topology, repeat the +``grub-install`` command for each disk in the pool. + +5.6b For UEFI booting, install GRUB: + +:: + + grub-install --target=x86_64-efi --efi-directory=/boot/efi \ + --bootloader-id=ubuntu --recheck --no-floppy + +It is not necessary to specify the disk here. If you are creating a +mirror or raidz topology, the additional disks will be handled later. + +5.7 Verify that the ZFS module is installed: + +:: + + ls /boot/grub/*/zfs.mod + +5.8 Fix filesystem mount ordering + +`Until ZFS gains a systemd mount +generator `__, there are +races between mounting filesystems and starting certain daemons. In +practice, the issues (e.g. +`#5754 `__) seem to be +with certain filesystems in ``/var``, specifically ``/var/log`` and +``/var/tmp``. Setting these to use ``legacy`` mounting, and listing them +in ``/etc/fstab`` makes systemd aware that these are separate +mountpoints. In turn, ``rsyslog.service`` depends on ``var-log.mount`` +by way of ``local-fs.target`` and services using the ``PrivateTmp`` +feature of systemd automatically use ``After=var-tmp.mount``. + +Until there is support for mounting ``/boot`` in the initramfs, we also +need to mount that, because it was marked ``canmount=noauto``. Also, +with UEFI, we need to ensure it is mounted before its child filesystem +``/boot/efi``. + +``rpool`` is guaranteed to be imported by the initramfs, so there is no +point in adding ``x-systemd.requires=zfs-import.target`` to those +filesystems. + +For UEFI booting, unmount /boot/efi first: + +:: + + umount /boot/efi + +Everything else applies to both BIOS and UEFI booting: + +:: + + zfs set mountpoint=legacy bpool/BOOT/ubuntu + echo bpool/BOOT/ubuntu /boot zfs \ + nodev,relatime,x-systemd.requires=zfs-import-bpool.service 0 0 >> /etc/fstab + + zfs set mountpoint=legacy rpool/var/log + echo rpool/var/log /var/log zfs nodev,relatime 0 0 >> /etc/fstab + + zfs set mountpoint=legacy rpool/var/spool + echo rpool/var/spool /var/spool zfs nodev,relatime 0 0 >> /etc/fstab + +If you created a /var/tmp dataset: + +:: + + zfs set mountpoint=legacy rpool/var/tmp + echo rpool/var/tmp /var/tmp zfs nodev,relatime 0 0 >> /etc/fstab + +If you created a /tmp dataset: + +:: + + zfs set mountpoint=legacy rpool/tmp + echo rpool/tmp /tmp zfs nodev,relatime 0 0 >> /etc/fstab + +Step 6: First Boot +------------------ + +6.1 Snapshot the initial installation: + +:: + + zfs snapshot bpool/BOOT/ubuntu@install + zfs snapshot rpool/ROOT/ubuntu@install + +In the future, you will likely want to take snapshots before each +upgrade, and remove old snapshots (including this one) at some point to +save space. + +6.2 Exit from the ``chroot`` environment back to the LiveCD environment: + +:: + + exit + +6.3 Run these commands in the LiveCD environment to unmount all +filesystems: + +:: + + mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + zpool export -a + +6.4 Reboot: + +:: + + reboot + +6.5 Wait for the newly installed system to boot normally. Login as root. + +6.6 Create a user account: + +:: + + zfs create rpool/home/YOURUSERNAME + adduser YOURUSERNAME + cp -a /etc/skel/. /home/YOURUSERNAME + chown -R YOURUSERNAME:YOURUSERNAME /home/YOURUSERNAME + +6.7 Add your user account to the default set of groups for an +administrator: + +:: + + usermod -a -G adm,cdrom,dip,lpadmin,plugdev,sambashare,sudo YOURUSERNAME + +6.8 Mirror GRUB + +If you installed to multiple disks, install GRUB on the additional +disks: + +6.8a For legacy (BIOS) booting: + +:: + + dpkg-reconfigure grub-pc + Hit enter until you get to the device selection screen. + Select (using the space bar) all of the disks (not partitions) in your pool. + +6.8b UEFI + +:: + + umount /boot/efi + +For the second and subsequent disks (increment ubuntu-2 to -3, etc.): + +:: + + dd if=/dev/disk/by-id/scsi-SATA_disk1-part2 \ + of=/dev/disk/by-id/scsi-SATA_disk2-part2 + efibootmgr -c -g -d /dev/disk/by-id/scsi-SATA_disk2 \ + -p 2 -L "ubuntu-2" -l '\EFI\ubuntu\shimx64.efi' + + mount /boot/efi + +Step 7: (Optional) Configure Swap +--------------------------------- + +**Caution**: On systems with extremely high memory pressure, using a +zvol for swap can result in lockup, regardless of how much swap is still +available. This issue is currently being investigated in: +`https://github.com/zfsonlinux/zfs/issues/7734 `__ + +7.1 Create a volume dataset (zvol) for use as a swap device: + +:: + + zfs create -V 4G -b $(getconf PAGESIZE) -o compression=zle \ + -o logbias=throughput -o sync=always \ + -o primarycache=metadata -o secondarycache=none \ + -o com.sun:auto-snapshot=false rpool/swap + +You can adjust the size (the ``4G`` part) to your needs. + +The compression algorithm is set to ``zle`` because it is the cheapest +available algorithm. As this guide recommends ``ashift=12`` (4 kiB +blocks on disk), the common case of a 4 kiB page size means that no +compression algorithm can reduce I/O. The exception is all-zero pages, +which are dropped by ZFS; but some form of compression has to be enabled +to get this behavior. + +7.2 Configure the swap device: + +**Caution**: Always use long ``/dev/zvol`` aliases in configuration +files. Never use a short ``/dev/zdX`` device name. + +:: + + mkswap -f /dev/zvol/rpool/swap + echo /dev/zvol/rpool/swap none swap discard 0 0 >> /etc/fstab + echo RESUME=none > /etc/initramfs-tools/conf.d/resume + +The ``RESUME=none`` is necessary to disable resuming from hibernation. +This does not work, as the zvol is not present (because the pool has not +yet been imported) at the time the resume script runs. If it is not +disabled, the boot process hangs for 30 seconds waiting for the swap +zvol to appear. + +7.3 Enable the swap device: + +:: + + swapon -av + +Step 8: Full Software Installation +---------------------------------- + +8.1 Upgrade the minimal system: + +:: + + apt dist-upgrade --yes + +8.2 Install a regular set of software: + +Choose one of the following options: + +8.2a Install a command-line environment only: + +:: + + apt install --yes ubuntu-standard + +8.2b Install a full GUI environment: + +:: + + apt install --yes ubuntu-desktop + vi /etc/gdm3/custom.conf + In the [daemon] section, add: InitialSetupEnable=false + +**Hint**: If you are installing a full GUI environment, you will likely +want to manage your network with NetworkManager: + +:: + + vi /etc/netplan/01-netcfg.yaml + network: + version: 2 + renderer: NetworkManager + +8.3 Optional: Disable log compression: + +As ``/var/log`` is already compressed by ZFS, logrotate’s compression is +going to burn CPU and disk I/O for (in most cases) very little gain. +Also, if you are making snapshots of ``/var/log``, logrotate’s +compression will actually waste space, as the uncompressed data will +live on in the snapshot. You can edit the files in ``/etc/logrotate.d`` +by hand to comment out ``compress``, or use this loop (copy-and-paste +highly recommended): + +:: + + for file in /etc/logrotate.d/* ; do + if grep -Eq "(^|[^#y])compress" "$file" ; then + sed -i -r "s/(^|[^#y])(compress)/\1#\2/" "$file" + fi + done + +8.4 Reboot: + +:: + + reboot + +Step 9: Final Cleanup +~~~~~~~~~~~~~~~~~~~~~ + +9.1 Wait for the system to boot normally. Login using the account you +created. Ensure the system (including networking) works normally. + +9.2 Optional: Delete the snapshots of the initial installation: + +:: + + sudo zfs destroy bpool/BOOT/ubuntu@install + sudo zfs destroy rpool/ROOT/ubuntu@install + +9.3 Optional: Disable the root password + +:: + + sudo usermod -p '*' root + +9.4 Optional: Re-enable the graphical boot process: + +If you prefer the graphical boot process, you can re-enable it now. If +you are using LUKS, it makes the prompt look nicer. + +:: + + sudo vi /etc/default/grub + Uncomment: GRUB_TIMEOUT_STYLE=hidden + Add quiet and splash to: GRUB_CMDLINE_LINUX_DEFAULT + Comment out: GRUB_TERMINAL=console + Save and quit. + + sudo update-grub + +**Note:** Ignore errors from ``osprober``, if present. + +9.5 Optional: For LUKS installs only, backup the LUKS header: + +:: + + sudo cryptsetup luksHeaderBackup /dev/disk/by-id/scsi-SATA_disk1-part4 \ + --header-backup-file luks1-header.dat + +Store that backup somewhere safe (e.g. cloud storage). It is protected +by your LUKS passphrase, but you may wish to use additional encryption. + +**Hint:** If you created a mirror or raidz topology, repeat this for +each LUKS volume (``luks2``, etc.). + +Troubleshooting +--------------- + +Rescuing using a Live CD +~~~~~~~~~~~~~~~~~~~~~~~~ + +Go through `Step 1: Prepare The Install +Environment <#step-1-prepare-the-install-environment>`__. + +For LUKS, first unlock the disk(s): + +:: + + cryptsetup luksOpen /dev/disk/by-id/scsi-SATA_disk1-part4 luks1 + Repeat for additional disks, if this is a mirror or raidz topology. + +Mount everything correctly: + +:: + + zpool export -a + zpool import -N -R /mnt rpool + zpool import -N -R /mnt bpool + zfs mount rpool/ROOT/ubuntu + zfs mount -a + +If needed, you can chroot into your installed environment: + +:: + + mount --rbind /dev /mnt/dev + mount --rbind /proc /mnt/proc + mount --rbind /sys /mnt/sys + chroot /mnt /bin/bash --login + mount /boot + mount -a + +Do whatever you need to do to fix your system. + +When done, cleanup: + +:: + + exit + mount | grep -v zfs | tac | awk '/\/mnt/ {print $3}' | xargs -i{} umount -lf {} + zpool export -a + reboot + +MPT2SAS +~~~~~~~ + +Most problem reports for this tutorial involve ``mpt2sas`` hardware that +does slow asynchronous drive initialization, like some IBM M1015 or +OEM-branded cards that have been flashed to the reference LSI firmware. + +The basic problem is that disks on these controllers are not visible to +the Linux kernel until after the regular system is started, and ZoL does +not hotplug pool members. See +`https://github.com/zfsonlinux/zfs/issues/330 `__. + +Most LSI cards are perfectly compatible with ZoL. If your card has this +glitch, try setting ZFS_INITRD_PRE_MOUNTROOT_SLEEP=X in +/etc/default/zfs. The system will wait X seconds for all drives to +appear before importing the pool. + +Areca +~~~~~ + +Systems that require the ``arcsas`` blob driver should add it to the +``/etc/initramfs-tools/modules`` file and run +``update-initramfs -u -k all``. + +Upgrade or downgrade the Areca driver if something like +``RIP: 0010:[] [] native_read_tsc+0x6/0x20`` +appears anywhere in kernel log. ZoL is unstable on systems that emit +this error message. + +VMware +~~~~~~ + +- Set ``disk.EnableUUID = "TRUE"`` in the vmx file or vsphere + configuration. Doing this ensures that ``/dev/disk`` aliases are + created in the guest. + +QEMU/KVM/XEN +~~~~~~~~~~~~ + +Set a unique serial number on each virtual disk using libvirt or qemu +(e.g. ``-drive if=none,id=disk1,file=disk1.qcow2,serial=1234567890``). + +To be able to use UEFI in guests (instead of only BIOS booting), run +this on the host: + +:: + + sudo apt install ovmf + + sudo vi /etc/libvirt/qemu.conf + Uncomment these lines: + nvram = [ + "/usr/share/OVMF/OVMF_CODE.fd:/usr/share/OVMF/OVMF_VARS.fd", + "/usr/share/AAVMF/AAVMF_CODE.fd:/usr/share/AAVMF/AAVMF_VARS.fd" + ] + + sudo service libvirt-bin restart diff --git a/docs/Ubuntu.rst b/docs/Ubuntu.rst new file mode 100644 index 0000000..3d04320 --- /dev/null +++ b/docs/Ubuntu.rst @@ -0,0 +1,10 @@ +ZFS packages are `provided by the +distribution `__. + +If you want to use ZFS as your root filesystem, see these instructions: + +- [[Ubuntu 18.04 Root on ZFS]] + +For troubleshooting existing installations, see: + +- 16.04: [[Ubuntu 16.04 Root on ZFS]] diff --git a/docs/Workflow-Accept-PR.rst b/docs/Workflow-Accept-PR.rst new file mode 100644 index 0000000..50c62dd --- /dev/null +++ b/docs/Workflow-Accept-PR.rst @@ -0,0 +1,11 @@ +Accept a PR +=========== + +After a PR is generated, it is available to be commented upon by project +members. They may request additional changes, please work with them. + +In addition, project members may accept PRs; this is not an automatic +process. By convention, PRs aren't accepted for at least a day, to allow +all members a chance to comment. + +After a PR has been accepted, it is available to be merged. diff --git a/docs/Workflow-Close-PR.rst b/docs/Workflow-Close-PR.rst new file mode 100644 index 0000000..02c36d2 --- /dev/null +++ b/docs/Workflow-Close-PR.rst @@ -0,0 +1,2 @@ +Close a PR +========== diff --git a/docs/Workflow-Commit-Often.rst b/docs/Workflow-Commit-Often.rst new file mode 100644 index 0000000..7661073 --- /dev/null +++ b/docs/Workflow-Commit-Often.rst @@ -0,0 +1,24 @@ +Commit Often +============ + +When writing complex code, it is strongly suggested that developers save +their changes, and commit those changes to their local repository, on a +frequent basis. In general, this means every hour or two, or when a +specific milestone is hit in the development. This allows you to easily +*checkpoint* your work. + +Details of this process can be found in the `Commit the +changes `__ +page. + +In addition, it is suggested that the changes be pushed to your forked +Github repository with the ``git push`` command at least every day, as a +backup. Changes should also be pushed prior to running a test, in case +your system crashes. This project works with kernel software. A crash +while testing development software could easily cause loss of data. + +For developers who want to keep their development branches clean, it +might be useful to +`squash `__ +commits from time to time, even before you're ready to `create a +PR `__. diff --git a/docs/Workflow-Commit.rst b/docs/Workflow-Commit.rst new file mode 100644 index 0000000..14b9d20 --- /dev/null +++ b/docs/Workflow-Commit.rst @@ -0,0 +1,76 @@ +Commit the Changes +================== + +In order for your changes to be merged into the ZFS on Linux project, +you must first send the changes made in your *topic* branch to your +*local* repository. This can be done with the ``git commit -sa``. If +there are any new files, they will be reported as *untracked*, and they +will not be created in the *local* repository. To add newly created +files to the *local* repository, use the ``git add (file-name) ...`` +command. + +The ``-s`` option adds a *signed off by* line to the commit. This +*signed off by* line is required for the ZFS on Linux project. It +performs the following functions: + +- It is an acceptance of the `License + Terms `__ of + the project. +- It is the developer's certification that they have the right to + submit the patch for inclusion into the code base. +- It indicates agreement to the `Developer's Certificate of + Origin `__. + +The ``-a`` option causes all modified files in the current branch to be +*staged* prior to performing the commit. A list of the modified files in +the *local* branch can be created by the use of the ``git status`` +command. If there are files that have been modified that shouldn't be +part of the commit, they can either be rolled back in the current +branch, or the files can be manually staged with the +``git add (file-name) ...`` command, and the ``git commit -s`` command +can be run without the ``-a`` option. + +When you run the ``git commit`` command, an editor will appear to allow +you to enter the commit messages. The following requirements apply to a +commit message: + +- The first line is a title for the commit, and must be bo longer than + 50 characters. +- The second line should be blank, separating the title of the commit + message from the body of the commit message. +- There may be one or more lines in the commit message describing the + reason for the changes (the body of the commit message). These lines + must be no longer than 72 characters, and may contain blank lines. + + - If the commit closes an Issue, there should be a line in the body + with the string ``Closes``, followed by the issue number. If + multiple issues are closed, multiple lines should be used. + +- After the body of the commit message, there should be a blank line. + This separates the body from the *signed off by* line. +- The *signed off by* line should have been created by the + ``git commit -s`` command. If not, the line has the following format: + + - The string "Signed-off-by:" + - The name of the developer. Please do not use any no pseudonyms or + make any anonymous contributions. + - The email address of the developer, enclosed by angle brackets + ("<>"). + - An example of this is + ``Signed-off-by: Random Developer `` + +- If the commit changes only documentation, the line + ``Requires-builders: style`` may be included in the body. This will + cause only the *style* testing to be run. This can save a significant + amount of time when Github runs the automated testing. For + information on other testing options, please see the `Buildbot + options `__ + page. + +For more information about writing commit messages, please visit `How to +Write a Git Commit +Message `__. + +After the changes have been committed to your *local* repository, they +should be pushed to your *forked* repository. This is done with the +``git push`` command. diff --git a/docs/Workflow-Conflicts.rst b/docs/Workflow-Conflicts.rst new file mode 100644 index 0000000..aa2e624 --- /dev/null +++ b/docs/Workflow-Conflicts.rst @@ -0,0 +1,2 @@ +Fix Conflicts +============= diff --git a/docs/Workflow-Create-Branch.rst b/docs/Workflow-Create-Branch.rst new file mode 100644 index 0000000..8e6a37a --- /dev/null +++ b/docs/Workflow-Create-Branch.rst @@ -0,0 +1,32 @@ +Create a Branch +=============== + +With small projects, it's possible to develop code as commits directly +on the *master* branch. In the ZFS-on-Linux project, that sort of +development would create havoc and make it difficult to open a PR or +rebase the code. For this reason, development in the ZFS-on-Linux +project is done on *topic* branches. + +The following commands will perform the required functions: + +:: + + $ cd zfs + $ git fetch upstream master + $ git checkout master + $ git merge upstream/master + $ git branch (topic-branch-name) + $ git checkout (topic-branch-name) + +1. Navigate to your *local* repository. +2. Fetch the updates from the *upstream* repository. +3. Set the current branch to *master*. +4. Merge the fetched updates into the *local* repository. +5. Create a new *topic* branch on the updated *master* branch. The name + of the branch should be either the name of the feature (preferred for + development of features) or an indication of the issue being worked + on (preferred for bug fixes). +6. Set the current branch to the newly created *topic* branch. + +**Pro Tip**: The ``git checkout -b (topic-branch-name)`` command can be +used to create and checkout a new branch with one command. diff --git a/docs/Workflow-Create-Github-Account.rst b/docs/Workflow-Create-Github-Account.rst new file mode 100644 index 0000000..f55e1f9 --- /dev/null +++ b/docs/Workflow-Create-Github-Account.rst @@ -0,0 +1,18 @@ +Create a Github Account +======================= + +This page goes over how to create a Github account. There are no special +settings needed to use your Github account on the `ZFS on Linux +Project `__. + +Github did an excellent job of documenting how to create an account. The +following link provides everything you need to know to get your Github +account up and running. + +`https://help.github.com/articles/signing-up-for-a-new-github-account/ `__ + +In addition, the following articles might be useful: + +- `https://help.github.com/articles/keeping-your-account-and-data-secure/ `__ +- `https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/ `__ +- `https://help.github.com/articles/adding-a-fallback-authentication-method-with-recover-accounts-elsewhere/ `__ diff --git a/docs/Workflow-Create-Test.rst b/docs/Workflow-Create-Test.rst new file mode 100644 index 0000000..2cdc94b --- /dev/null +++ b/docs/Workflow-Create-Test.rst @@ -0,0 +1,2 @@ +Create a New Test +================= diff --git a/docs/Workflow-Delete-Branch.rst b/docs/Workflow-Delete-Branch.rst new file mode 100644 index 0000000..12046bb --- /dev/null +++ b/docs/Workflow-Delete-Branch.rst @@ -0,0 +1,11 @@ +Delete a Branch +=============== + +When a commit has been accepted and merged into the main ZFS repository, +the developer's topic branch should be deleted. This is also appropriate +if the developer abandons the change, and could be appropriate if they +change the direction of the change. + +To delete a topic branch, navigate to the base directory of your local +Git repository and use the ``git branch -d (branch-name)`` command. The +name of the branch should be the same as the branch that was created. diff --git a/docs/Workflow-Generate-PR.rst b/docs/Workflow-Generate-PR.rst new file mode 100644 index 0000000..36751d7 --- /dev/null +++ b/docs/Workflow-Generate-PR.rst @@ -0,0 +1,2 @@ +Generate a PR +============= diff --git a/docs/Workflow-Get-Source.rst b/docs/Workflow-Get-Source.rst new file mode 100644 index 0000000..1b5746c --- /dev/null +++ b/docs/Workflow-Get-Source.rst @@ -0,0 +1,52 @@ +.. raw:: html + + + +Get the Source Code +=================== + +This document goes over how a developer can get the ZFS source code for +the purpose of making changes to it. For other purposes, please see the +`Get the Source +Code `__ +page. + +The Git *master* branch contains the latest version of the software, +including changes that weren't included in the released tarball. This is +the preferred source code location and procedure for ZFS development. If +you would like to do development work for the `ZFS on Linux +Project `__, you can fork the Github +repository and prepare the source by using the following process. + +1. Go to the `ZFS on Linux Project `__ + and fork both the ZFS and SPL repositories. This will create two new + repositories (your *forked* repositories) under your account. + Detailed instructions can be found at + `https://help.github.com/articles/fork-a-repo/ `__. +2. Clone both of these repositories onto your development system. This + will create your *local* repositories. As an example, if your Github + account is *newzfsdeveloper*, the commands to clone the repositories + would be: + +:: + + $ mkdir zfs-on-linux + $ cd zfs-on-linux + $ git clone https://github.com/newzfsdeveloper/spl.git + $ git clone https://github.com/newzfsdeveloper/zfs.git + +3. Enter the following commands to make the necessary linkage to the + *upstream master* repositories and prepare the source to be compiled: + +:: + + $ cd spl + $ git remote add upstream https://github.com/zfsonlinux/spl.git + $ ./autogen.sh + $ cd ../zfs + $ git remote add upstream https://github.com/zfsonlinux/zfs.git + $ ./autogen.sh + cd .. + +The ``./autogen.sh`` script generates the build files. If the build +system is updated by any developer, these scripts need to be run again. diff --git a/docs/Workflow-Install-Git.rst b/docs/Workflow-Install-Git.rst new file mode 100644 index 0000000..3bdaceb --- /dev/null +++ b/docs/Workflow-Install-Git.rst @@ -0,0 +1,50 @@ +Install Git +=========== + +To work with the ZFS software on Github, it's necessary to install the +Git software on your computer and set it up. This page covers that +process for some common Linux operating systems. Other Linux operating +systems should be similar. + +Install the Software Package +---------------------------- + +The first step is to actually install the Git software package. This +package can be found in the repositories used by most Linux +distributions. If your distribution isn't listed here, or you'd like to +install from source, please have a look in the `official Git +documentation `__. + +Red Hat and CentOS +~~~~~~~~~~~~~~~~~~ + +:: + + # yum install git + +Fedora +~~~~~~ + +:: + + $ sudo dnf install git + +Debian and Ubuntu +~~~~~~~~~~~~~~~~~ + +:: + + $ sudo apt install git + +Configuring Git +--------------- + +Your user name and email address must be set within Git before you can +make commits to the ZFS project. In addition, your preferred text editor +should be set to whatever you would like to use. + +:: + + $ git config --global user.name "John Doe" + $ git config --global user.email johndoe@example.com + $ git config --global core.editor emacs diff --git a/docs/Workflow-Large-Features.rst b/docs/Workflow-Large-Features.rst new file mode 100644 index 0000000..0f5a0d3 --- /dev/null +++ b/docs/Workflow-Large-Features.rst @@ -0,0 +1,2 @@ +Adding Large Features +===================== diff --git a/docs/Workflow-Merge-PR.rst b/docs/Workflow-Merge-PR.rst new file mode 100644 index 0000000..c757290 --- /dev/null +++ b/docs/Workflow-Merge-PR.rst @@ -0,0 +1,9 @@ +Merge a PR +========== + +Once all the feedback has been addressed, the PR will be merged into the +*master* branch by a member with write permission (most members don't +have this permission). + +After the PR has been merged, it is eligible to be added to the +*release* branch. diff --git a/docs/Workflow-Rebase.rst b/docs/Workflow-Rebase.rst new file mode 100644 index 0000000..9e5ed5d --- /dev/null +++ b/docs/Workflow-Rebase.rst @@ -0,0 +1,28 @@ +Rebase the Update +================= + +Updates to the ZFS on Linux project should always be based on the +current *master* branch. This makes them easier to merge into the +repository. + +There are two steps in the rebase process. The first step is to update +the *local master* branch from the *upstream master* repository. This +can be done by entering the following commands: + +:: + + $ git fetch upstream master + $ git checkout master + $ git merge upstream/master + +The second step is to perform the actual rebase of the updates. This is +done by entering the command ``git rebase upstream/master``. If there +are any conflicts between the updates in your *local* branch and the +updates in the *upstream master* branch, you will be informed of them, +and allowed to correct them (see the +`Conflicts `__ +page). + +This would also be a good time to +`squash `__ your +commits. diff --git a/docs/Workflow-Squash.rst b/docs/Workflow-Squash.rst new file mode 100644 index 0000000..efa08d1 --- /dev/null +++ b/docs/Workflow-Squash.rst @@ -0,0 +1,2 @@ +Squash the Commits +================== diff --git a/docs/Workflow-Test.rst b/docs/Workflow-Test.rst new file mode 100644 index 0000000..8b64adc --- /dev/null +++ b/docs/Workflow-Test.rst @@ -0,0 +1,106 @@ +Testing Changes to ZFS +====================== + +The code in the ZFS on Linux project is quite complex. A minor error in +a change could easily introduce new bugs into the software, causing +unforeseeable problems. In an attempt to avoid this, the ZTS (ZFS Test +Suite) was developed. This test suite is run against multiple +architectures and distributions by the Github system when a PR (Pull +Request) is submitted. + +A subset of the full test suite can be run by the developer to perform a +preliminary verification of the changes in their *local* repository. + +Style Testing +------------- + +The first part of the testing is to verify that the software meets the +project's style guidelines. To verify that the code meets those +guidelines, run ``make checkstyle`` from the *local* repository. + +Basic Functionality Testing +--------------------------- + +The second part of the testing is to verify basic functionality. This is +to ensure that the changes made don't break previous functionality. + +There are a few helper scripts provided in the top-level scripts +directory designed to aid developers working with in-tree builds. + +- **zfs-helper.sh:** Certain functionality (i.e. /dev/zvol/) depends on + the ZFS provided udev helper scripts being installed on the system. + This script can be used to create symlinks on the system from the + installation location to the in-tree helper. These links must be in + place to successfully run the ZFS Test Suite. The ``-i`` and ``-r`` + options can be used to install and remove the symlinks. + +:: + + $ sudo ./scripts/zfs-helpers.sh -i + +- **zfs.sh:** The freshly built kernel modules from the *local* + repository can be loaded using ``zfs.sh``. This script will load + those modules, **even if there are ZFS modules loaded** from another + location, which could cause long-term problems if any of the + non-testing file-systems on the system use ZFS. + +This script can latter be used to unload the kernel modules with the +``-u`` option. + +:: + + $ sudo ./scripts/zfs.sh + +- **zfs-tests.sh:** A wrapper which can be used to launch the ZFS Test + Suite. Three loopback devices are created on top of sparse files + located in ``/var/tmp/`` and used for the regression test. Detailed + directions for the running the ZTS can be found in the `ZTS + Readme `__ file. + +**WARNING**: This script should **only** be run on a development system. +It makes configuration changes to the system to run the tests, and it +*tries* to remove those changes after completion, but the change removal +could fail, and dynamic canges of this nature are usually undesirable on +a production system. For more information on the changes made, please +see the `ZTS +Readme `__ file. + +:: + + $ sudo ./scripts/zfs-tests.sh -vx + +**tip:** The **delegate** tests will be skipped unless group read +permission is set on the zfs directory and its parents. + +- **zloop.sh:** A wrapper to run ztest repeatedly with randomized + arguments. The ztest command is a user space stress test designed to + detect correctness issues by concurrently running a random set of + test cases. If a crash is encountered, the ztest logs, any associated + vdev files, and core file (if one exists) are collected and moved to + the output directory for analysis. + +If there are any failures in this test, please see the `zloop +debugging `__ +page. + +:: + + $ sudo ./scripts/zloop.sh + +Change Testing +-------------- + +Finally, it's necessary to verify that the changes made actually do what +they were intended to do. The extent of the testing would depend on the +complexity of the changes. + +After the changes are tested, if the testing can be automated for +addition to ZTS, a `new +test `__ +should be created. This test should be part of the PR that resolves the +issue or adds the feature. If the festure is split into multiple PRs, +some testing should be included in the first, with additions to the test +as required. + +It should be noted that if the change adds too many lines of code that +don't get tested by ZTS, the change will not pass testing. diff --git a/docs/Workflow-Update-PR.rst b/docs/Workflow-Update-PR.rst new file mode 100644 index 0000000..8d02c30 --- /dev/null +++ b/docs/Workflow-Update-PR.rst @@ -0,0 +1,2 @@ +Update a PR +=========== diff --git a/docs/Workflow-Zloop-Debugging.rst b/docs/Workflow-Zloop-Debugging.rst new file mode 100644 index 0000000..1e5d344 --- /dev/null +++ b/docs/Workflow-Zloop-Debugging.rst @@ -0,0 +1,2 @@ +Debugging *Zloop* Failures +========================== diff --git a/docs/ZFS-Transaction-Delay.rst b/docs/ZFS-Transaction-Delay.rst new file mode 100644 index 0000000..a206f09 --- /dev/null +++ b/docs/ZFS-Transaction-Delay.rst @@ -0,0 +1,105 @@ +ZFS Transaction Delay +~~~~~~~~~~~~~~~~~~~~~ + +ZFS write operations are delayed when the backend storage isn't able to +accommodate the rate of incoming writes. This delay process is known as +the ZFS write throttle. + +If there is already a write transaction waiting, the delay is relative +to when that transaction will finish waiting. Thus the calculated delay +time is independent of the number of threads concurrently executing +transactions. + +If there is only one waiter, the delay is relative to when the +transaction started, rather than the current time. This credits the +transaction for "time already served." For example, if a write +transaction requires reading indirect blocks first, then the delay is +counted at the start of the transaction, just prior to the indirect +block reads. + +The minimum time for a transaction to take is calculated as: + +:: + + min_time = zfs_delay_scale * (dirty - min) / (max - dirty) + min_time is then capped at 100 milliseconds + +The delay has two degrees of freedom that can be adjusted via tunables: + +1. The percentage of dirty data at which we start to delay is defined by + zfs_delay_min_dirty_percent. This is typically be at or above + zfs_vdev_async_write_active_max_dirty_percent so delays occur after + writing at full speed has failed to keep up with the incoming write + rate. +2. The scale of the curve is defined by zfs_delay_scale. Roughly + speaking, this variable determines the amount of delay at the + midpoint of the curve. + +:: + + delay + 10ms +-------------------------------------------------------------*+ + | *| + 9ms + *+ + | *| + 8ms + *+ + | * | + 7ms + * + + | * | + 6ms + * + + | * | + 5ms + * + + | * | + 4ms + * + + | * | + 3ms + * + + | * | + 2ms + (midpoint) * + + | | ** | + 1ms + v *** + + | zfs_delay_scale ----------> ******** | + 0 +-------------------------------------*********----------------+ + 0% <- zfs_dirty_data_max -> 100% + +Note that since the delay is added to the outstanding time remaining on +the most recent transaction, the delay is effectively the inverse of +IOPS. Here the midpoint of 500 microseconds translates to 2000 IOPS. The +shape of the curve was chosen such that small changes in the amount of +accumulated dirty data in the first 3/4 of the curve yield relatively +small differences in the amount of delay. + +The effects can be easier to understand when the amount of delay is +represented on a log scale: + +:: + + delay + 100ms +-------------------------------------------------------------++ + + + + | | + + *+ + 10ms + *+ + + ** + + | (midpoint) ** | + + | ** + + 1ms + v **** + + + zfs_delay_scale ----------> ***** + + | **** | + + **** + + 100us + ** + + + * + + | * | + + * + + 10us + * + + + + + | | + + + + +--------------------------------------------------------------+ + 0% <- zfs_dirty_data_max -> 100% + +Note here that only as the amount of dirty data approaches its limit +does the delay start to increase rapidly. The goal of a properly tuned +system should be to keep the amount of dirty data out of that range by +first ensuring that the appropriate limits are set for the I/O scheduler +to reach optimal throughput on the backend storage, and then by changing +the value of zfs_delay_scale to increase the steepness of the curve. diff --git a/docs/ZFS-on-Linux-Module-Parameters.rst b/docs/ZFS-on-Linux-Module-Parameters.rst new file mode 100644 index 0000000..4ca102c --- /dev/null +++ b/docs/ZFS-on-Linux-Module-Parameters.rst @@ -0,0 +1,9351 @@ +ZFS on Linux Module Parameters +============================== + +Most of the ZFS kernel module parameters are accessible in the SysFS +``/sys/module/zfs/paramaters`` directory. Current value can be observed +by + +.. code:: shell + + cat /sys/module/zfs/parameters/PARAMETER + +Many of these can be changed by writing new values. These are denoted by +Change|Dynamic in the PARAMETER details below. + +.. code:: shell + + echo NEWVALUE >> /sys/module/zfs/parameters/PARAMETER + +If the parameter is not dynamically adjustable, an error can occur and +the value will not be set. It can be helpful to check the permissions +for the PARAMETER file in SysFS. + +In some cases, the parameter must be set prior to loading the kernel +modules or it is desired to have the parameters set automatically at +boot time. For many distros, this can be accomplished by creating a file +named ``/etc/modprobe.d/zfs.conf`` containing a text line for each +module parameter using the format: + +:: + + # change PARAMETER for workload XZY to solve problem PROBLEM_DESCRIPTION + # changed by YOUR_NAME on DATE + options zfs PARAMETER=VALUE + +Some parameters related to ZFS operations are located in module +parameters other than in the ``zfs`` kernel module. These are documented +in the individual parameter description. Unless otherwise noted, the +tunable applies to the ``zfs`` kernel module. For example, the ``icp`` +kernel module parameters are visible in the +``/sys/module/icp/parameters`` directory and can be set by default at +boot time by changing the ``/etc/modprobe.d/icp.conf`` file. + +See the man page for *modprobe.d* for more information. + +zfs-module-parameters Manual Page +--------------------------------- + +The *zfs-module-parameters(5)* man page contains brief descriptions of +the module parameters. Alas, man pages are not as suitable for quick +reference as wiki pages. This wiki page is intended to be a better +cross-reference and capture some of the wisdom of ZFS developers and +practitioners. + +ZFS Module Parameters +--------------------- + +The ZFS kernel module, ``zfs.ko``, parameters are detailed below. + +To observe the list of parameters along with a short synopsis of each +parameter, use the ``modinfo`` command: + +.. code:: bash + + modinfo zfs + +Tags +---- + +The list of parameters is quite large and resists hierarchical +representation. To assist in quickly finding relevant information +quickly, each module parameter has a "Tags" row with keywords for +frequent searches. + +.. _tags-1: + +Tags +---- + +ABD +^^^ + +- `zfs_abd_scatter_enabled <#zfs_abd_scatter_enabled>`__ +- `zfs_abd_scatter_max_order <#zfs_abd_scatter_max_order>`__ +- `zfs_compressed_arc_enabled <#zfs_compressed_arc_enabled>`__ + +allocation +^^^^^^^^^^ + +- `dmu_object_alloc_chunk_shift <#dmu_object_alloc_chunk_shift>`__ +- `metaslab_aliquot <#metaslab_aliquot>`__ +- `metaslab_bias_enabled <#metaslab_bias_enabled>`__ +- `metaslab_debug_load <#metaslab_debug_load>`__ +- `metaslab_debug_unload <#metaslab_debug_unload>`__ +- `metaslab_force_ganging <#metaslab_force_ganging>`__ +- `metaslab_fragmentation_factor_enabled <#metaslab_fragmentation_factor_enabled>`__ +- `zfs_metaslab_fragmentation_threshold <#zfs_metaslab_fragmentation_threshold>`__ +- `metaslab_lba_weighting_enabled <#metaslab_lba_weighting_enabled>`__ +- `metaslab_preload_enabled <#metaslab_preload_enabled>`__ +- `zfs_metaslab_segment_weight_enabled <#zfs_metaslab_segment_weight_enabled>`__ +- `zfs_metaslab_switch_threshold <#zfs_metaslab_switch_threshold>`__ +- `metaslabs_per_vdev <#metaslabs_per_vdev>`__ +- `zfs_mg_fragmentation_threshold <#zfs_mg_fragmentation_threshold>`__ +- `zfs_mg_noalloc_threshold <#zfs_mg_noalloc_threshold>`__ +- `spa_asize_inflation <#spa_asize_inflation>`__ +- `spa_load_verify_data <#spa_load_verify_data>`__ +- `spa_slop_shift <#spa_slop_shift>`__ +- `zfs_vdev_default_ms_count <#zfs_vdev_default_ms_count>`__ + +ARC +^^^ + +- `zfs_abd_scatter_min_size <#zfs_abd_scatter_min_size>`__ +- `zfs_arc_average_blocksize <#zfs_arc_average_blocksize>`__ +- `zfs_arc_dnode_limit <#zfs_arc_dnode_limit>`__ +- `zfs_arc_dnode_limit_percent <#zfs_arc_dnode_limit_percent>`__ +- `zfs_arc_dnode_reduce_percent <#zfs_arc_dnode_reduce_percent>`__ +- `zfs_arc_evict_batch_limit <#zfs_arc_evict_batch_limit>`__ +- `zfs_arc_grow_retry <#zfs_arc_grow_retry>`__ +- `zfs_arc_lotsfree_percent <#zfs_arc_lotsfree_percent>`__ +- `zfs_arc_max <#zfs_arc_max>`__ +- `zfs_arc_meta_adjust_restarts <#zfs_arc_meta_adjust_restarts>`__ +- `zfs_arc_meta_limit <#zfs_arc_meta_limit>`__ +- `zfs_arc_meta_limit_percent <#zfs_arc_meta_limit_percent>`__ +- `zfs_arc_meta_min <#zfs_arc_meta_min>`__ +- `zfs_arc_meta_prune <#zfs_arc_meta_prune>`__ +- `zfs_arc_meta_strategy <#zfs_arc_meta_strategy>`__ +- `zfs_arc_min <#zfs_arc_min>`__ +- `zfs_arc_min_prefetch_lifespan <#zfs_arc_min_prefetch_lifespan>`__ +- `zfs_arc_min_prefetch_ms <#zfs_arc_min_prefetch_ms>`__ +- `zfs_arc_min_prescient_prefetch_ms <#zfs_arc_min_prescient_prefetch_ms>`__ +- `zfs_arc_overflow_shift <#zfs_arc_overflow_shift>`__ +- `zfs_arc_p_dampener_disable <#zfs_arc_p_dampener_disable>`__ +- `zfs_arc_p_min_shift <#zfs_arc_p_min_shift>`__ +- `zfs_arc_pc_percent <#zfs_arc_pc_percent>`__ +- `zfs_arc_shrink_shift <#zfs_arc_shrink_shift>`__ +- `zfs_arc_sys_free <#zfs_arc_sys_free>`__ +- `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ +- `dbuf_cache_shift <#dbuf_cache_shift>`__ +- `dbuf_metadata_cache_shift <#dbuf_metadata_cache_shift>`__ +- `zfs_disable_dup_eviction <#zfs_disable_dup_eviction>`__ +- `l2arc_feed_again <#l2arc_feed_again>`__ +- `l2arc_feed_min_ms <#l2arc_feed_min_ms>`__ +- `l2arc_feed_secs <#l2arc_feed_secs>`__ +- `l2arc_headroom <#l2arc_headroom>`__ +- `l2arc_headroom_boost <#l2arc_headroom_boost>`__ +- `l2arc_nocompress <#l2arc_nocompress>`__ +- `l2arc_noprefetch <#l2arc_noprefetch>`__ +- `l2arc_norw <#l2arc_norw>`__ +- `l2arc_write_boost <#l2arc_write_boost>`__ +- `l2arc_write_max <#l2arc_write_max>`__ +- `zfs_multilist_num_sublists <#zfs_multilist_num_sublists>`__ +- `spa_load_verify_shift <#spa_load_verify_shift>`__ + +channel_programs +^^^^^^^^^^^^^^^^ + +- `zfs_lua_max_instrlimit <#zfs_lua_max_instrlimit>`__ +- `zfs_lua_max_memlimit <#zfs_lua_max_memlimit>`__ + +checkpoint +^^^^^^^^^^ + +- `zfs_spa_discard_memory_limit <#zfs_spa_discard_memory_limit>`__ + +checksum +^^^^^^^^ + +- `zfs_checksums_per_second <#zfs_checksums_per_second>`__ +- `zfs_fletcher_4_impl <#zfs_fletcher_4_impl>`__ +- `zfs_nopwrite_enabled <#zfs_nopwrite_enabled>`__ +- `zfs_qat_checksum_disable <#zfs_qat_checksum_disable>`__ + +compression +^^^^^^^^^^^ + +- `zfs_compressed_arc_enabled <#zfs_compressed_arc_enabled>`__ +- `zfs_qat_compress_disable <#zfs_qat_compress_disable>`__ +- `zfs_qat_disable <#zfs_qat_disable>`__ + +CPU +^^^ + +- `zfs_fletcher_4_impl <#zfs_fletcher_4_impl>`__ +- `zfs_mdcomp_disable <#zfs_mdcomp_disable>`__ +- `spl_kmem_cache_kmem_threads <#spl_kmem_cache_kmem_threads>`__ +- `spl_kmem_cache_magazine_size <#spl_kmem_cache_magazine_size>`__ +- `spl_taskq_thread_bind <#spl_taskq_thread_bind>`__ +- `spl_taskq_thread_priority <#spl_taskq_thread_priority>`__ +- `spl_taskq_thread_sequential <#spl_taskq_thread_sequential>`__ +- `zfs_vdev_raidz_impl <#zfs_vdev_raidz_impl>`__ + +dataset +^^^^^^^ + +- `zfs_max_dataset_nesting <#zfs_max_dataset_nesting>`__ + +dbuf_cache +^^^^^^^^^^ + +- `dbuf_cache_hiwater_pct <#dbuf_cache_hiwater_pct>`__ +- `dbuf_cache_lowater_pct <#dbuf_cache_lowater_pct>`__ +- `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ +- `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ +- `dbuf_cache_max_shift <#dbuf_cache_max_shift>`__ +- `dbuf_cache_shift <#dbuf_cache_shift>`__ +- `dbuf_metadata_cache_max_bytes <#dbuf_metadata_cache_max_bytes>`__ +- `dbuf_metadata_cache_shift <#dbuf_metadata_cache_shift>`__ + +debug +^^^^^ + +- `zfs_dbgmsg_enable <#zfs_dbgmsg_enable>`__ +- `zfs_dbgmsg_maxsize <#zfs_dbgmsg_maxsize>`__ +- `zfs_dbuf_state_index <#zfs_dbuf_state_index>`__ +- `zfs_deadman_checktime_ms <#zfs_deadman_checktime_ms>`__ +- `zfs_deadman_enabled <#zfs_deadman_enabled>`__ +- `zfs_deadman_failmode <#zfs_deadman_failmode>`__ +- `zfs_deadman_synctime_ms <#zfs_deadman_synctime_ms>`__ +- `zfs_deadman_ziotime_ms <#zfs_deadman_ziotime_ms>`__ +- `zfs_flags <#zfs_flags>`__ +- `zfs_free_leak_on_eio <#zfs_free_leak_on_eio>`__ +- `zfs_nopwrite_enabled <#zfs_nopwrite_enabled>`__ +- `zfs_object_mutex_size <#zfs_object_mutex_size>`__ +- `zfs_read_history <#zfs_read_history>`__ +- `zfs_read_history_hits <#zfs_read_history_hits>`__ +- `spl_panic_halt <#spl_panic_halt>`__ +- `zfs_txg_history <#zfs_txg_history>`__ +- `zfs_zevent_cols <#zfs_zevent_cols>`__ +- `zfs_zevent_console <#zfs_zevent_console>`__ +- `zfs_zevent_len_max <#zfs_zevent_len_max>`__ +- `zil_replay_disable <#zil_replay_disable>`__ +- `zio_deadman_log_all <#zio_deadman_log_all>`__ +- `zio_decompress_fail_fraction <#zio_decompress_fail_fraction>`__ +- `zio_delay_max <#zio_delay_max>`__ + +dedup +^^^^^ + +- `zfs_ddt_data_is_special <#zfs_ddt_data_is_special>`__ +- `zfs_disable_dup_eviction <#zfs_disable_dup_eviction>`__ + +delay +^^^^^ + +- `zfs_delays_per_second <#zfs_delays_per_second>`__ + +delete +^^^^^^ + +- `zfs_async_block_max_blocks <#zfs_async_block_max_blocks>`__ +- `zfs_delete_blocks <#zfs_delete_blocks>`__ +- `zfs_free_bpobj_enabled <#zfs_free_bpobj_enabled>`__ +- `zfs_free_max_blocks <#zfs_free_max_blocks>`__ +- `zfs_free_min_time_ms <#zfs_free_min_time_ms>`__ +- `zfs_obsolete_min_time_ms <#zfs_obsolete_min_time_ms>`__ +- `zfs_per_txg_dirty_frees_percent <#zfs_per_txg_dirty_frees_percent>`__ + +discard +^^^^^^^ + +- `zvol_max_discard_blocks <#zvol_max_discard_blocks>`__ + +disks +^^^^^ + +- `zfs_nocacheflush <#zfs_nocacheflush>`__ +- `zil_nocacheflush <#zil_nocacheflush>`__ + +DMU +^^^ + +- `zfs_async_block_max_blocks <#zfs_async_block_max_blocks>`__ +- `dmu_object_alloc_chunk_shift <#dmu_object_alloc_chunk_shift>`__ +- `zfs_dmu_offset_next_sync <#zfs_dmu_offset_next_sync>`__ + +encryption +^^^^^^^^^^ + +- `icp_aes_impl <#icp_aes_impl>`__ +- `icp_gcm_impl <#icp_gcm_impl>`__ +- `zfs_key_max_salt_uses <#zfs_key_max_salt_uses>`__ +- `zfs_qat_encrypt_disable <#zfs_qat_encrypt_disable>`__ + +filesystem +^^^^^^^^^^ + +- `zfs_admin_snapshot <#zfs_admin_snapshot>`__ +- `zfs_delete_blocks <#zfs_delete_blocks>`__ +- `zfs_expire_snapshot <#zfs_expire_snapshot>`__ +- `zfs_free_max_blocks <#zfs_free_max_blocks>`__ +- `zfs_max_recordsize <#zfs_max_recordsize>`__ +- `zfs_read_chunk_size <#zfs_read_chunk_size>`__ + +fragmentation +^^^^^^^^^^^^^ + +- `zfs_metaslab_fragmentation_threshold <#zfs_metaslab_fragmentation_threshold>`__ +- `zfs_mg_fragmentation_threshold <#zfs_mg_fragmentation_threshold>`__ +- `zfs_mg_noalloc_threshold <#zfs_mg_noalloc_threshold>`__ + +HDD +^^^ + +- `metaslab_lba_weighting_enabled <#metaslab_lba_weighting_enabled>`__ +- `zfs_vdev_mirror_rotating_inc <#zfs_vdev_mirror_rotating_inc>`__ +- `zfs_vdev_mirror_rotating_seek_inc <#zfs_vdev_mirror_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_seek_offset <#zfs_vdev_mirror_rotating_seek_offset>`__ + +hostid +^^^^^^ + +- `spl_hostid <#spl_hostid>`__ +- `spl_hostid_path <#spl_hostid_path>`__ + +import +^^^^^^ + +- `zfs_autoimport_disable <#zfs_autoimport_disable>`__ +- `zfs_max_missing_tvds <#zfs_max_missing_tvds>`__ +- `zfs_multihost_fail_intervals <#zfs_multihost_fail_intervals>`__ +- `zfs_multihost_history <#zfs_multihost_history>`__ +- `zfs_multihost_import_intervals <#zfs_multihost_import_intervals>`__ +- `zfs_multihost_interval <#zfs_multihost_interval>`__ +- `zfs_recover <#zfs_recover>`__ +- `spa_config_path <#spa_config_path>`__ +- `spa_load_print_vdev_tree <#spa_load_print_vdev_tree>`__ +- `spa_load_verify_maxinflight <#spa_load_verify_maxinflight>`__ +- `spa_load_verify_metadata <#spa_load_verify_metadata>`__ +- `spa_load_verify_shift <#spa_load_verify_shift>`__ +- `zvol_inhibit_dev <#zvol_inhibit_dev>`__ + +L2ARC +^^^^^ + +- `l2arc_feed_again <#l2arc_feed_again>`__ +- `l2arc_feed_min_ms <#l2arc_feed_min_ms>`__ +- `l2arc_feed_secs <#l2arc_feed_secs>`__ +- `l2arc_headroom <#l2arc_headroom>`__ +- `l2arc_headroom_boost <#l2arc_headroom_boost>`__ +- `l2arc_nocompress <#l2arc_nocompress>`__ +- `l2arc_noprefetch <#l2arc_noprefetch>`__ +- `l2arc_norw <#l2arc_norw>`__ +- `l2arc_write_boost <#l2arc_write_boost>`__ +- `l2arc_write_max <#l2arc_write_max>`__ + +memory +^^^^^^ + +- `zfs_abd_scatter_enabled <#zfs_abd_scatter_enabled>`__ +- `zfs_abd_scatter_max_order <#zfs_abd_scatter_max_order>`__ +- `zfs_arc_average_blocksize <#zfs_arc_average_blocksize>`__ +- `zfs_arc_grow_retry <#zfs_arc_grow_retry>`__ +- `zfs_arc_lotsfree_percent <#zfs_arc_lotsfree_percent>`__ +- `zfs_arc_max <#zfs_arc_max>`__ +- `zfs_arc_pc_percent <#zfs_arc_pc_percent>`__ +- `zfs_arc_shrink_shift <#zfs_arc_shrink_shift>`__ +- `zfs_arc_sys_free <#zfs_arc_sys_free>`__ +- `zfs_dedup_prefetch <#zfs_dedup_prefetch>`__ +- `zfs_max_recordsize <#zfs_max_recordsize>`__ +- `metaslab_debug_load <#metaslab_debug_load>`__ +- `metaslab_debug_unload <#metaslab_debug_unload>`__ +- `zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__ +- `zfs_scan_strict_mem_lim <#zfs_scan_strict_mem_lim>`__ +- `spl_kmem_alloc_max <#spl_kmem_alloc_max>`__ +- `spl_kmem_alloc_warn <#spl_kmem_alloc_warn>`__ +- `spl_kmem_cache_expire <#spl_kmem_cache_expire>`__ +- `spl_kmem_cache_kmem_limit <#spl_kmem_cache_kmem_limit>`__ +- `spl_kmem_cache_kmem_threads <#spl_kmem_cache_kmem_threads>`__ +- `spl_kmem_cache_magazine_size <#spl_kmem_cache_magazine_size>`__ +- `spl_kmem_cache_max_size <#spl_kmem_cache_max_size>`__ +- `spl_kmem_cache_obj_per_slab <#spl_kmem_cache_obj_per_slab>`__ +- `spl_kmem_cache_obj_per_slab_min <#spl_kmem_cache_obj_per_slab_min>`__ +- `spl_kmem_cache_reclaim <#spl_kmem_cache_reclaim>`__ +- `spl_kmem_cache_slab_limit <#spl_kmem_cache_slab_limit>`__ + +metadata +^^^^^^^^ + +- `zfs_mdcomp_disable <#zfs_mdcomp_disable>`__ + +metaslab +^^^^^^^^ + +- `metaslab_aliquot <#metaslab_aliquot>`__ +- `metaslab_bias_enabled <#metaslab_bias_enabled>`__ +- `metaslab_debug_load <#metaslab_debug_load>`__ +- `metaslab_debug_unload <#metaslab_debug_unload>`__ +- `metaslab_fragmentation_factor_enabled <#metaslab_fragmentation_factor_enabled>`__ +- `metaslab_lba_weighting_enabled <#metaslab_lba_weighting_enabled>`__ +- `metaslab_preload_enabled <#metaslab_preload_enabled>`__ +- `zfs_metaslab_segment_weight_enabled <#zfs_metaslab_segment_weight_enabled>`__ +- `zfs_metaslab_switch_threshold <#zfs_metaslab_switch_threshold>`__ +- `metaslabs_per_vdev <#metaslabs_per_vdev>`__ +- `zfs_vdev_min_ms_count <#zfs_vdev_min_ms_count>`__ +- `zfs_vdev_ms_count_limit <#zfs_vdev_ms_count_limit>`__ + +mirror +^^^^^^ + +- `zfs_vdev_mirror_non_rotating_inc <#zfs_vdev_mirror_non_rotating_inc>`__ +- `zfs_vdev_mirror_non_rotating_seek_inc <#zfs_vdev_mirror_non_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_inc <#zfs_vdev_mirror_rotating_inc>`__ +- `zfs_vdev_mirror_rotating_seek_inc <#zfs_vdev_mirror_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_seek_offset <#zfs_vdev_mirror_rotating_seek_offset>`__ + +MMP +^^^ + +- `zfs_multihost_fail_intervals <#zfs_multihost_fail_intervals>`__ +- `zfs_multihost_history <#zfs_multihost_history>`__ +- `zfs_multihost_import_intervals <#zfs_multihost_import_intervals>`__ +- `zfs_multihost_interval <#zfs_multihost_interval>`__ +- `spl_hostid <#spl_hostid>`__ +- `spl_hostid_path <#spl_hostid_path>`__ + +panic +^^^^^ + +- `spl_panic_halt <#spl_panic_halt>`__ + +prefetch +^^^^^^^^ + +- `zfs_arc_min_prefetch_ms <#zfs_arc_min_prefetch_ms>`__ +- `zfs_arc_min_prescient_prefetch_ms <#zfs_arc_min_prescient_prefetch_ms>`__ +- `zfs_dedup_prefetch <#zfs_dedup_prefetch>`__ +- `l2arc_noprefetch <#l2arc_noprefetch>`__ +- `zfs_no_scrub_prefetch <#zfs_no_scrub_prefetch>`__ +- `zfs_pd_bytes_max <#zfs_pd_bytes_max>`__ +- `zfs_prefetch_disable <#zfs_prefetch_disable>`__ +- `zfetch_array_rd_sz <#zfetch_array_rd_sz>`__ +- `zfetch_max_distance <#zfetch_max_distance>`__ +- `zfetch_max_streams <#zfetch_max_streams>`__ +- `zfetch_min_sec_reap <#zfetch_min_sec_reap>`__ +- `zvol_prefetch_bytes <#zvol_prefetch_bytes>`__ + +QAT +^^^ + +- `zfs_qat_checksum_disable <#zfs_qat_checksum_disable>`__ +- `zfs_qat_compress_disable <#zfs_qat_compress_disable>`__ +- `zfs_qat_disable <#zfs_qat_disable>`__ +- `zfs_qat_encrypt_disable <#zfs_qat_encrypt_disable>`__ + +raidz +^^^^^ + +- `zfs_vdev_raidz_impl <#zfs_vdev_raidz_impl>`__ + +receive +^^^^^^^ + +- `zfs_disable_ivset_guid_check <#zfs_disable_ivset_guid_check>`__ +- `zfs_recv_queue_length <#zfs_recv_queue_length>`__ + +remove +^^^^^^ + +- `zfs_obsolete_min_time_ms <#zfs_obsolete_min_time_ms>`__ +- `zfs_remove_max_segment <#zfs_remove_max_segment>`__ + +resilver +^^^^^^^^ + +- `zfs_resilver_delay <#zfs_resilver_delay>`__ +- `zfs_resilver_disable_defer <#zfs_resilver_disable_defer>`__ +- `zfs_resilver_min_time_ms <#zfs_resilver_min_time_ms>`__ +- `zfs_scan_checkpoint_intval <#zfs_scan_checkpoint_intval>`__ +- `zfs_scan_fill_weight <#zfs_scan_fill_weight>`__ +- `zfs_scan_idle <#zfs_scan_idle>`__ +- `zfs_scan_ignore_errors <#zfs_scan_ignore_errors>`__ +- `zfs_scan_issue_strategy <#zfs_scan_issue_strategy>`__ +- `zfs_scan_legacy <#zfs_scan_legacy>`__ +- `zfs_scan_max_ext_gap <#zfs_scan_max_ext_gap>`__ +- `zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__ +- `zfs_scan_mem_lim_soft_fact <#zfs_scan_mem_lim_soft_fact>`__ +- `zfs_scan_strict_mem_lim <#zfs_scan_strict_mem_lim>`__ +- `zfs_scan_suspend_progress <#zfs_scan_suspend_progress>`__ +- `zfs_scan_vdev_limit <#zfs_scan_vdev_limit>`__ +- `zfs_top_maxinflight <#zfs_top_maxinflight>`__ +- `zfs_vdev_scrub_max_active <#zfs_vdev_scrub_max_active>`__ +- `zfs_vdev_scrub_min_active <#zfs_vdev_scrub_min_active>`__ + +scrub +^^^^^ + +- `zfs_no_scrub_io <#zfs_no_scrub_io>`__ +- `zfs_no_scrub_prefetch <#zfs_no_scrub_prefetch>`__ +- `zfs_scan_checkpoint_intval <#zfs_scan_checkpoint_intval>`__ +- `zfs_scan_fill_weight <#zfs_scan_fill_weight>`__ +- `zfs_scan_idle <#zfs_scan_idle>`__ +- `zfs_scan_issue_strategy <#zfs_scan_issue_strategy>`__ +- `zfs_scan_legacy <#zfs_scan_legacy>`__ +- `zfs_scan_max_ext_gap <#zfs_scan_max_ext_gap>`__ +- `zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__ +- `zfs_scan_mem_lim_soft_fact <#zfs_scan_mem_lim_soft_fact>`__ +- `zfs_scan_min_time_ms <#zfs_scan_min_time_ms>`__ +- `zfs_scan_strict_mem_lim <#zfs_scan_strict_mem_lim>`__ +- `zfs_scan_suspend_progress <#zfs_scan_suspend_progress>`__ +- `zfs_scan_vdev_limit <#zfs_scan_vdev_limit>`__ +- `zfs_scrub_delay <#zfs_scrub_delay>`__ +- `zfs_scrub_min_time_ms <#zfs_scrub_min_time_ms>`__ +- `zfs_top_maxinflight <#zfs_top_maxinflight>`__ +- `zfs_vdev_scrub_max_active <#zfs_vdev_scrub_max_active>`__ +- `zfs_vdev_scrub_min_active <#zfs_vdev_scrub_min_active>`__ + +send +^^^^ + +- `ignore_hole_birth <#ignore_hole_birth>`__ +- `zfs_override_estimate_recordsize <#zfs_override_estimate_recordsize>`__ +- `zfs_pd_bytes_max <#zfs_pd_bytes_max>`__ +- `zfs_send_corrupt_data <#zfs_send_corrupt_data>`__ +- `zfs_send_queue_length <#zfs_send_queue_length>`__ +- `zfs_send_unmodified_spill_blocks <#zfs_send_unmodified_spill_blocks>`__ + +snapshot +^^^^^^^^ + +- `zfs_admin_snapshot <#zfs_admin_snapshot>`__ +- `zfs_expire_snapshot <#zfs_expire_snapshot>`__ + +SPA +^^^ + +- `spa_asize_inflation <#spa_asize_inflation>`__ +- `spa_load_print_vdev_tree <#spa_load_print_vdev_tree>`__ +- `spa_load_verify_data <#spa_load_verify_data>`__ +- `spa_load_verify_shift <#spa_load_verify_shift>`__ +- `spa_slop_shift <#spa_slop_shift>`__ +- `zfs_sync_pass_deferred_free <#zfs_sync_pass_deferred_free>`__ +- `zfs_sync_pass_dont_compress <#zfs_sync_pass_dont_compress>`__ +- `zfs_sync_pass_rewrite <#zfs_sync_pass_rewrite>`__ +- `zfs_sync_taskq_batch_pct <#zfs_sync_taskq_batch_pct>`__ +- `zfs_txg_timeout <#zfs_txg_timeout>`__ + +special_vdev +^^^^^^^^^^^^ + +- `zfs_ddt_data_is_special <#zfs_ddt_data_is_special>`__ +- `zfs_special_class_metadata_reserve_pct <#zfs_special_class_metadata_reserve_pct>`__ +- `zfs_user_indirect_is_special <#zfs_user_indirect_is_special>`__ + +SSD +^^^ + +- `metaslab_lba_weighting_enabled <#metaslab_lba_weighting_enabled>`__ +- `zfs_vdev_mirror_non_rotating_inc <#zfs_vdev_mirror_non_rotating_inc>`__ +- `zfs_vdev_mirror_non_rotating_seek_inc <#zfs_vdev_mirror_non_rotating_seek_inc>`__ + +taskq +^^^^^ + +- `spl_max_show_tasks <#spl_max_show_tasks>`__ +- `spl_taskq_kick <#spl_taskq_kick>`__ +- `spl_taskq_thread_bind <#spl_taskq_thread_bind>`__ +- `spl_taskq_thread_dynamic <#spl_taskq_thread_dynamic>`__ +- `spl_taskq_thread_priority <#spl_taskq_thread_priority>`__ +- `spl_taskq_thread_sequential <#spl_taskq_thread_sequential>`__ +- `zfs_zil_clean_taskq_nthr_pct <#zfs_zil_clean_taskq_nthr_pct>`__ +- `zio_taskq_batch_pct <#zio_taskq_batch_pct>`__ + +trim +^^^^ + +- `zfs_trim_extent_bytes_max <#zfs_trim_extent_bytes_max>`__ +- `zfs_trim_extent_bytes_min <#zfs_trim_extent_bytes_min>`__ +- `zfs_trim_metaslab_skip <#zfs_trim_metaslab_skip>`__ +- `zfs_trim_queue_limit <#zfs_trim_queue_limit>`__ +- `zfs_trim_txg_batch <#zfs_trim_txg_batch>`__ +- `zfs_vdev_aggregate_trim <#zfs_vdev_aggregate_trim>`__ + +vdev +^^^^ + +- `zfs_checksum_events_per_second <#zfs_checksum_events_per_second>`__ +- `metaslab_aliquot <#metaslab_aliquot>`__ +- `metaslab_bias_enabled <#metaslab_bias_enabled>`__ +- `zfs_metaslab_fragmentation_threshold <#zfs_metaslab_fragmentation_threshold>`__ +- `metaslabs_per_vdev <#metaslabs_per_vdev>`__ +- `zfs_mg_fragmentation_threshold <#zfs_mg_fragmentation_threshold>`__ +- `zfs_mg_noalloc_threshold <#zfs_mg_noalloc_threshold>`__ +- `zfs_multihost_interval <#zfs_multihost_interval>`__ +- `zfs_scan_vdev_limit <#zfs_scan_vdev_limit>`__ +- `zfs_slow_io_events_per_second <#zfs_slow_io_events_per_second>`__ +- `zfs_vdev_aggregate_trim <#zfs_vdev_aggregate_trim>`__ +- `zfs_vdev_aggregation_limit <#zfs_vdev_aggregation_limit>`__ +- `zfs_vdev_aggregation_limit_non_rotating <#zfs_vdev_aggregation_limit_non_rotating>`__ +- `zfs_vdev_async_read_max_active <#zfs_vdev_async_read_max_active>`__ +- `zfs_vdev_async_read_min_active <#zfs_vdev_async_read_min_active>`__ +- `zfs_vdev_async_write_active_max_dirty_percent <#zfs_vdev_async_write_active_max_dirty_percent>`__ +- `zfs_vdev_async_write_active_min_dirty_percent <#zfs_vdev_async_write_active_min_dirty_percent>`__ +- `zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ +- `zfs_vdev_async_write_min_active <#zfs_vdev_async_write_min_active>`__ +- `zfs_vdev_cache_bshift <#zfs_vdev_cache_bshift>`__ +- `zfs_vdev_cache_max <#zfs_vdev_cache_max>`__ +- `zfs_vdev_cache_size <#zfs_vdev_cache_size>`__ +- `zfs_vdev_initializing_max_active <#zfs_vdev_initializing_max_active>`__ +- `zfs_vdev_initializing_min_active <#zfs_vdev_initializing_min_active>`__ +- `zfs_vdev_max_active <#zfs_vdev_max_active>`__ +- `zfs_vdev_min_ms_count <#zfs_vdev_min_ms_count>`__ +- `zfs_vdev_mirror_non_rotating_inc <#zfs_vdev_mirror_non_rotating_inc>`__ +- `zfs_vdev_mirror_non_rotating_seek_inc <#zfs_vdev_mirror_non_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_inc <#zfs_vdev_mirror_rotating_inc>`__ +- `zfs_vdev_mirror_rotating_seek_inc <#zfs_vdev_mirror_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_seek_offset <#zfs_vdev_mirror_rotating_seek_offset>`__ +- `zfs_vdev_ms_count_limit <#zfs_vdev_ms_count_limit>`__ +- `zfs_vdev_queue_depth_pct <#zfs_vdev_queue_depth_pct>`__ +- `zfs_vdev_raidz_impl <#zfs_vdev_raidz_impl>`__ +- `zfs_vdev_read_gap_limit <#zfs_vdev_read_gap_limit>`__ +- `zfs_vdev_removal_max_active <#zfs_vdev_removal_max_active>`__ +- `zfs_vdev_removal_min_active <#zfs_vdev_removal_min_active>`__ +- `zfs_vdev_scheduler <#zfs_vdev_scheduler>`__ +- `zfs_vdev_scrub_max_active <#zfs_vdev_scrub_max_active>`__ +- `zfs_vdev_scrub_min_active <#zfs_vdev_scrub_min_active>`__ +- `zfs_vdev_sync_read_max_active <#zfs_vdev_sync_read_max_active>`__ +- `zfs_vdev_sync_read_min_active <#zfs_vdev_sync_read_min_active>`__ +- `zfs_vdev_sync_write_max_active <#zfs_vdev_sync_write_max_active>`__ +- `zfs_vdev_sync_write_min_active <#zfs_vdev_sync_write_min_active>`__ +- `zfs_vdev_trim_max_active <#zfs_vdev_trim_max_active>`__ +- `zfs_vdev_trim_min_active <#zfs_vdev_trim_min_active>`__ +- `vdev_validate_skip <#vdev_validate_skip>`__ +- `zfs_vdev_write_gap_limit <#zfs_vdev_write_gap_limit>`__ +- `zio_dva_throttle_enabled <#zio_dva_throttle_enabled>`__ +- `zio_slow_io_ms <#zio_slow_io_ms>`__ + +vdev_cache +^^^^^^^^^^ + +- `zfs_vdev_cache_bshift <#zfs_vdev_cache_bshift>`__ +- `zfs_vdev_cache_max <#zfs_vdev_cache_max>`__ +- `zfs_vdev_cache_size <#zfs_vdev_cache_size>`__ + +vdev_initialize +^^^^^^^^^^^^^^^ + +- `zfs_initialize_value <#zfs_initialize_value>`__ + +vdev_removal +^^^^^^^^^^^^ + +- `zfs_condense_indirect_commit_entry_delay_ms <#zfs_condense_indirect_commit_entry_delay_ms>`__ +- `zfs_condense_indirect_vdevs_enable <#zfs_condense_indirect_vdevs_enable>`__ +- `zfs_condense_max_obsolete_bytes <#zfs_condense_max_obsolete_bytes>`__ +- `zfs_condense_min_mapping_bytes <#zfs_condense_min_mapping_bytes>`__ +- `zfs_reconstruct_indirect_combinations_max <#zfs_reconstruct_indirect_combinations_max>`__ +- `zfs_removal_ignore_errors <#zfs_removal_ignore_errors>`__ +- `zfs_removal_suspend_progress <#zfs_removal_suspend_progress>`__ +- `vdev_removal_max_span <#vdev_removal_max_span>`__ + +volume +^^^^^^ + +- `zfs_max_recordsize <#zfs_max_recordsize>`__ +- `zvol_inhibit_dev <#zvol_inhibit_dev>`__ +- `zvol_major <#zvol_major>`__ +- `zvol_max_discard_blocks <#zvol_max_discard_blocks>`__ +- `zvol_prefetch_bytes <#zvol_prefetch_bytes>`__ +- `zvol_request_sync <#zvol_request_sync>`__ +- `zvol_threads <#zvol_threads>`__ +- `zvol_volmode <#zvol_volmode>`__ + +write_throttle +^^^^^^^^^^^^^^ + +- `zfs_delay_min_dirty_percent <#zfs_delay_min_dirty_percent>`__ +- `zfs_delay_scale <#zfs_delay_scale>`__ +- `zfs_dirty_data_max <#zfs_dirty_data_max>`__ +- `zfs_dirty_data_max_max <#zfs_dirty_data_max_max>`__ +- `zfs_dirty_data_max_max_percent <#zfs_dirty_data_max_max_percent>`__ +- `zfs_dirty_data_max_percent <#zfs_dirty_data_max_percent>`__ +- `zfs_dirty_data_sync <#zfs_dirty_data_sync>`__ +- `zfs_dirty_data_sync_percent <#zfs_dirty_data_sync_percent>`__ + +zed +^^^ + +- `zfs_checksums_per_second <#zfs_checksums_per_second>`__ +- `zfs_delays_per_second <#zfs_delays_per_second>`__ +- `zio_slow_io_ms <#zio_slow_io_ms>`__ + +ZIL +^^^ + +- `zfs_commit_timeout_pct <#zfs_commit_timeout_pct>`__ +- `zfs_immediate_write_sz <#zfs_immediate_write_sz>`__ +- `zfs_zil_clean_taskq_maxalloc <#zfs_zil_clean_taskq_maxalloc>`__ +- `zfs_zil_clean_taskq_minalloc <#zfs_zil_clean_taskq_minalloc>`__ +- `zfs_zil_clean_taskq_nthr_pct <#zfs_zil_clean_taskq_nthr_pct>`__ +- `zil_nocacheflush <#zil_nocacheflush>`__ +- `zil_replay_disable <#zil_replay_disable>`__ +- `zil_slog_bulk <#zil_slog_bulk>`__ + +ZIO_scheduler +^^^^^^^^^^^^^ + +- `zfs_dirty_data_sync <#zfs_dirty_data_sync>`__ +- `zfs_dirty_data_sync_percent <#zfs_dirty_data_sync_percent>`__ +- `zfs_resilver_delay <#zfs_resilver_delay>`__ +- `zfs_scan_idle <#zfs_scan_idle>`__ +- `zfs_scrub_delay <#zfs_scrub_delay>`__ +- `zfs_top_maxinflight <#zfs_top_maxinflight>`__ +- `zfs_txg_timeout <#zfs_txg_timeout>`__ +- `zfs_vdev_aggregate_trim <#zfs_vdev_aggregate_trim>`__ +- `zfs_vdev_aggregation_limit <#zfs_vdev_aggregation_limit>`__ +- `zfs_vdev_aggregation_limit_non_rotating <#zfs_vdev_aggregation_limit_non_rotating>`__ +- `zfs_vdev_async_read_max_active <#zfs_vdev_async_read_max_active>`__ +- `zfs_vdev_async_read_min_active <#zfs_vdev_async_read_min_active>`__ +- `zfs_vdev_async_write_active_max_dirty_percent <#zfs_vdev_async_write_active_max_dirty_percent>`__ +- `zfs_vdev_async_write_active_min_dirty_percent <#zfs_vdev_async_write_active_min_dirty_percent>`__ +- `zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ +- `zfs_vdev_async_write_min_active <#zfs_vdev_async_write_min_active>`__ +- `zfs_vdev_initializing_max_active <#zfs_vdev_initializing_max_active>`__ +- `zfs_vdev_initializing_min_active <#zfs_vdev_initializing_min_active>`__ +- `zfs_vdev_max_active <#zfs_vdev_max_active>`__ +- `zfs_vdev_queue_depth_pct <#zfs_vdev_queue_depth_pct>`__ +- `zfs_vdev_read_gap_limit <#zfs_vdev_read_gap_limit>`__ +- `zfs_vdev_removal_max_active <#zfs_vdev_removal_max_active>`__ +- `zfs_vdev_removal_min_active <#zfs_vdev_removal_min_active>`__ +- `zfs_vdev_scheduler <#zfs_vdev_scheduler>`__ +- `zfs_vdev_scrub_max_active <#zfs_vdev_scrub_max_active>`__ +- `zfs_vdev_scrub_min_active <#zfs_vdev_scrub_min_active>`__ +- `zfs_vdev_sync_read_max_active <#zfs_vdev_sync_read_max_active>`__ +- `zfs_vdev_sync_read_min_active <#zfs_vdev_sync_read_min_active>`__ +- `zfs_vdev_sync_write_max_active <#zfs_vdev_sync_write_max_active>`__ +- `zfs_vdev_sync_write_min_active <#zfs_vdev_sync_write_min_active>`__ +- `zfs_vdev_trim_max_active <#zfs_vdev_trim_max_active>`__ +- `zfs_vdev_trim_min_active <#zfs_vdev_trim_min_active>`__ +- `zfs_vdev_write_gap_limit <#zfs_vdev_write_gap_limit>`__ +- `zio_dva_throttle_enabled <#zio_dva_throttle_enabled>`__ +- `zio_requeue_io_start_cut_in_line <#zio_requeue_io_start_cut_in_line>`__ +- `zio_taskq_batch_pct <#zio_taskq_batch_pct>`__ + +Index +----- + +- `zfs_abd_scatter_enabled <#zfs_abd_scatter_enabled>`__ +- `zfs_abd_scatter_max_order <#zfs_abd_scatter_max_order>`__ +- `zfs_abd_scatter_min_size <#zfs_abd_scatter_min_size>`__ +- `zfs_admin_snapshot <#zfs_admin_snapshot>`__ +- `zfs_arc_average_blocksize <#zfs_arc_average_blocksize>`__ +- `zfs_arc_dnode_limit <#zfs_arc_dnode_limit>`__ +- `zfs_arc_dnode_limit_percent <#zfs_arc_dnode_limit_percent>`__ +- `zfs_arc_dnode_reduce_percent <#zfs_arc_dnode_reduce_percent>`__ +- `zfs_arc_evict_batch_limit <#zfs_arc_evict_batch_limit>`__ +- `zfs_arc_grow_retry <#zfs_arc_grow_retry>`__ +- `zfs_arc_lotsfree_percent <#zfs_arc_lotsfree_percent>`__ +- `zfs_arc_max <#zfs_arc_max>`__ +- `zfs_arc_meta_adjust_restarts <#zfs_arc_meta_adjust_restarts>`__ +- `zfs_arc_meta_limit <#zfs_arc_meta_limit>`__ +- `zfs_arc_meta_limit_percent <#zfs_arc_meta_limit_percent>`__ +- `zfs_arc_meta_min <#zfs_arc_meta_min>`__ +- `zfs_arc_meta_prune <#zfs_arc_meta_prune>`__ +- `zfs_arc_meta_strategy <#zfs_arc_meta_strategy>`__ +- `zfs_arc_min <#zfs_arc_min>`__ +- `zfs_arc_min_prefetch_lifespan <#zfs_arc_min_prefetch_lifespan>`__ +- `zfs_arc_min_prefetch_ms <#zfs_arc_min_prefetch_ms>`__ +- `zfs_arc_min_prescient_prefetch_ms <#zfs_arc_min_prescient_prefetch_ms>`__ +- `zfs_arc_overflow_shift <#zfs_arc_overflow_shift>`__ +- `zfs_arc_p_dampener_disable <#zfs_arc_p_dampener_disable>`__ +- `zfs_arc_p_min_shift <#zfs_arc_p_min_shift>`__ +- `zfs_arc_pc_percent <#zfs_arc_pc_percent>`__ +- `zfs_arc_shrink_shift <#zfs_arc_shrink_shift>`__ +- `zfs_arc_sys_free <#zfs_arc_sys_free>`__ +- `zfs_async_block_max_blocks <#zfs_async_block_max_blocks>`__ +- `zfs_autoimport_disable <#zfs_autoimport_disable>`__ +- `zfs_checksum_events_per_second <#zfs_checksum_events_per_second>`__ +- `zfs_checksums_per_second <#zfs_checksums_per_second>`__ +- `zfs_commit_timeout_pct <#zfs_commit_timeout_pct>`__ +- `zfs_compressed_arc_enabled <#zfs_compressed_arc_enabled>`__ +- `zfs_condense_indirect_commit_entry_delay_ms <#zfs_condense_indirect_commit_entry_delay_ms>`__ +- `zfs_condense_indirect_vdevs_enable <#zfs_condense_indirect_vdevs_enable>`__ +- `zfs_condense_max_obsolete_bytes <#zfs_condense_max_obsolete_bytes>`__ +- `zfs_condense_min_mapping_bytes <#zfs_condense_min_mapping_bytes>`__ +- `zfs_dbgmsg_enable <#zfs_dbgmsg_enable>`__ +- `zfs_dbgmsg_maxsize <#zfs_dbgmsg_maxsize>`__ +- `dbuf_cache_hiwater_pct <#dbuf_cache_hiwater_pct>`__ +- `dbuf_cache_lowater_pct <#dbuf_cache_lowater_pct>`__ +- `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ +- `dbuf_cache_max_shift <#dbuf_cache_max_shift>`__ +- `dbuf_cache_shift <#dbuf_cache_shift>`__ +- `dbuf_metadata_cache_max_bytes <#dbuf_metadata_cache_max_bytes>`__ +- `dbuf_metadata_cache_shift <#dbuf_metadata_cache_shift>`__ +- `zfs_dbuf_state_index <#zfs_dbuf_state_index>`__ +- `zfs_ddt_data_is_special <#zfs_ddt_data_is_special>`__ +- `zfs_deadman_checktime_ms <#zfs_deadman_checktime_ms>`__ +- `zfs_deadman_enabled <#zfs_deadman_enabled>`__ +- `zfs_deadman_failmode <#zfs_deadman_failmode>`__ +- `zfs_deadman_synctime_ms <#zfs_deadman_synctime_ms>`__ +- `zfs_deadman_ziotime_ms <#zfs_deadman_ziotime_ms>`__ +- `zfs_dedup_prefetch <#zfs_dedup_prefetch>`__ +- `zfs_delay_min_dirty_percent <#zfs_delay_min_dirty_percent>`__ +- `zfs_delay_scale <#zfs_delay_scale>`__ +- `zfs_delays_per_second <#zfs_delays_per_second>`__ +- `zfs_delete_blocks <#zfs_delete_blocks>`__ +- `zfs_dirty_data_max <#zfs_dirty_data_max>`__ +- `zfs_dirty_data_max_max <#zfs_dirty_data_max_max>`__ +- `zfs_dirty_data_max_max_percent <#zfs_dirty_data_max_max_percent>`__ +- `zfs_dirty_data_max_percent <#zfs_dirty_data_max_percent>`__ +- `zfs_dirty_data_sync <#zfs_dirty_data_sync>`__ +- `zfs_dirty_data_sync_percent <#zfs_dirty_data_sync_percent>`__ +- `zfs_disable_dup_eviction <#zfs_disable_dup_eviction>`__ +- `zfs_disable_ivset_guid_check <#zfs_disable_ivset_guid_check>`__ +- `dmu_object_alloc_chunk_shift <#dmu_object_alloc_chunk_shift>`__ +- `zfs_dmu_offset_next_sync <#zfs_dmu_offset_next_sync>`__ +- `zfs_expire_snapshot <#zfs_expire_snapshot>`__ +- `zfs_flags <#zfs_flags>`__ +- `zfs_fletcher_4_impl <#zfs_fletcher_4_impl>`__ +- `zfs_free_bpobj_enabled <#zfs_free_bpobj_enabled>`__ +- `zfs_free_leak_on_eio <#zfs_free_leak_on_eio>`__ +- `zfs_free_max_blocks <#zfs_free_max_blocks>`__ +- `zfs_free_min_time_ms <#zfs_free_min_time_ms>`__ +- `icp_aes_impl <#icp_aes_impl>`__ +- `icp_gcm_impl <#icp_gcm_impl>`__ +- `ignore_hole_birth <#ignore_hole_birth>`__ +- `zfs_immediate_write_sz <#zfs_immediate_write_sz>`__ +- `zfs_initialize_value <#zfs_initialize_value>`__ +- `zfs_key_max_salt_uses <#zfs_key_max_salt_uses>`__ +- `l2arc_feed_again <#l2arc_feed_again>`__ +- `l2arc_feed_min_ms <#l2arc_feed_min_ms>`__ +- `l2arc_feed_secs <#l2arc_feed_secs>`__ +- `l2arc_headroom <#l2arc_headroom>`__ +- `l2arc_headroom_boost <#l2arc_headroom_boost>`__ +- `l2arc_nocompress <#l2arc_nocompress>`__ +- `l2arc_noprefetch <#l2arc_noprefetch>`__ +- `l2arc_norw <#l2arc_norw>`__ +- `l2arc_write_boost <#l2arc_write_boost>`__ +- `l2arc_write_max <#l2arc_write_max>`__ +- `zfs_lua_max_instrlimit <#zfs_lua_max_instrlimit>`__ +- `zfs_lua_max_memlimit <#zfs_lua_max_memlimit>`__ +- `zfs_max_dataset_nesting <#zfs_max_dataset_nesting>`__ +- `zfs_max_missing_tvds <#zfs_max_missing_tvds>`__ +- `zfs_max_recordsize <#zfs_max_recordsize>`__ +- `zfs_mdcomp_disable <#zfs_mdcomp_disable>`__ +- `metaslab_aliquot <#metaslab_aliquot>`__ +- `metaslab_bias_enabled <#metaslab_bias_enabled>`__ +- `metaslab_debug_load <#metaslab_debug_load>`__ +- `metaslab_debug_unload <#metaslab_debug_unload>`__ +- `metaslab_force_ganging <#metaslab_force_ganging>`__ +- `metaslab_fragmentation_factor_enabled <#metaslab_fragmentation_factor_enabled>`__ +- `zfs_metaslab_fragmentation_threshold <#zfs_metaslab_fragmentation_threshold>`__ +- `metaslab_lba_weighting_enabled <#metaslab_lba_weighting_enabled>`__ +- `metaslab_preload_enabled <#metaslab_preload_enabled>`__ +- `zfs_metaslab_segment_weight_enabled <#zfs_metaslab_segment_weight_enabled>`__ +- `zfs_metaslab_switch_threshold <#zfs_metaslab_switch_threshold>`__ +- `metaslabs_per_vdev <#metaslabs_per_vdev>`__ +- `zfs_mg_fragmentation_threshold <#zfs_mg_fragmentation_threshold>`__ +- `zfs_mg_noalloc_threshold <#zfs_mg_noalloc_threshold>`__ +- `zfs_multihost_fail_intervals <#zfs_multihost_fail_intervals>`__ +- `zfs_multihost_history <#zfs_multihost_history>`__ +- `zfs_multihost_import_intervals <#zfs_multihost_import_intervals>`__ +- `zfs_multihost_interval <#zfs_multihost_interval>`__ +- `zfs_multilist_num_sublists <#zfs_multilist_num_sublists>`__ +- `zfs_no_scrub_io <#zfs_no_scrub_io>`__ +- `zfs_no_scrub_prefetch <#zfs_no_scrub_prefetch>`__ +- `zfs_nocacheflush <#zfs_nocacheflush>`__ +- `zfs_nopwrite_enabled <#zfs_nopwrite_enabled>`__ +- `zfs_object_mutex_size <#zfs_object_mutex_size>`__ +- `zfs_obsolete_min_time_ms <#zfs_obsolete_min_time_ms>`__ +- `zfs_override_estimate_recordsize <#zfs_override_estimate_recordsize>`__ +- `zfs_pd_bytes_max <#zfs_pd_bytes_max>`__ +- `zfs_per_txg_dirty_frees_percent <#zfs_per_txg_dirty_frees_percent>`__ +- `zfs_prefetch_disable <#zfs_prefetch_disable>`__ +- `zfs_qat_checksum_disable <#zfs_qat_checksum_disable>`__ +- `zfs_qat_compress_disable <#zfs_qat_compress_disable>`__ +- `zfs_qat_disable <#zfs_qat_disable>`__ +- `zfs_qat_encrypt_disable <#zfs_qat_encrypt_disable>`__ +- `zfs_read_chunk_size <#zfs_read_chunk_size>`__ +- `zfs_read_history <#zfs_read_history>`__ +- `zfs_read_history_hits <#zfs_read_history_hits>`__ +- `zfs_reconstruct_indirect_combinations_max <#zfs_reconstruct_indirect_combinations_max>`__ +- `zfs_recover <#zfs_recover>`__ +- `zfs_recv_queue_length <#zfs_recv_queue_length>`__ +- `zfs_removal_ignore_errors <#zfs_removal_ignore_errors>`__ +- `zfs_removal_suspend_progress <#zfs_removal_suspend_progress>`__ +- `zfs_remove_max_segment <#zfs_remove_max_segment>`__ +- `zfs_resilver_delay <#zfs_resilver_delay>`__ +- `zfs_resilver_disable_defer <#zfs_resilver_disable_defer>`__ +- `zfs_resilver_min_time_ms <#zfs_resilver_min_time_ms>`__ +- `zfs_scan_checkpoint_intval <#zfs_scan_checkpoint_intval>`__ +- `zfs_scan_fill_weight <#zfs_scan_fill_weight>`__ +- `zfs_scan_idle <#zfs_scan_idle>`__ +- `zfs_scan_ignore_errors <#zfs_scan_ignore_errors>`__ +- `zfs_scan_issue_strategy <#zfs_scan_issue_strategy>`__ +- `zfs_scan_legacy <#zfs_scan_legacy>`__ +- `zfs_scan_max_ext_gap <#zfs_scan_max_ext_gap>`__ +- `zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__ +- `zfs_scan_mem_lim_soft_fact <#zfs_scan_mem_lim_soft_fact>`__ +- `zfs_scan_min_time_ms <#zfs_scan_min_time_ms>`__ +- `zfs_scan_strict_mem_lim <#zfs_scan_strict_mem_lim>`__ +- `zfs_scan_suspend_progress <#zfs_scan_suspend_progress>`__ +- `zfs_scan_vdev_limit <#zfs_scan_vdev_limit>`__ +- `zfs_scrub_delay <#zfs_scrub_delay>`__ +- `zfs_scrub_min_time_ms <#zfs_scrub_min_time_ms>`__ +- `zfs_send_corrupt_data <#zfs_send_corrupt_data>`__ +- `send_holes_without_birth_time <#send_holes_without_birth_time>`__ +- `zfs_send_queue_length <#zfs_send_queue_length>`__ +- `zfs_send_unmodified_spill_blocks <#zfs_send_unmodified_spill_blocks>`__ +- `zfs_slow_io_events_per_second <#zfs_slow_io_events_per_second>`__ +- `spa_asize_inflation <#spa_asize_inflation>`__ +- `spa_config_path <#spa_config_path>`__ +- `zfs_spa_discard_memory_limit <#zfs_spa_discard_memory_limit>`__ +- `spa_load_print_vdev_tree <#spa_load_print_vdev_tree>`__ +- `spa_load_verify_data <#spa_load_verify_data>`__ +- `spa_load_verify_maxinflight <#spa_load_verify_maxinflight>`__ +- `spa_load_verify_metadata <#spa_load_verify_metadata>`__ +- `spa_load_verify_shift <#spa_load_verify_shift>`__ +- `spa_slop_shift <#spa_slop_shift>`__ +- `zfs_special_class_metadata_reserve_pct <#zfs_special_class_metadata_reserve_pct>`__ +- `spl_hostid <#spl_hostid>`__ +- `spl_hostid_path <#spl_hostid_path>`__ +- `spl_kmem_alloc_max <#spl_kmem_alloc_max>`__ +- `spl_kmem_alloc_warn <#spl_kmem_alloc_warn>`__ +- `spl_kmem_cache_expire <#spl_kmem_cache_expire>`__ +- `spl_kmem_cache_kmem_limit <#spl_kmem_cache_kmem_limit>`__ +- `spl_kmem_cache_kmem_threads <#spl_kmem_cache_kmem_threads>`__ +- `spl_kmem_cache_magazine_size <#spl_kmem_cache_magazine_size>`__ +- `spl_kmem_cache_max_size <#spl_kmem_cache_max_size>`__ +- `spl_kmem_cache_obj_per_slab <#spl_kmem_cache_obj_per_slab>`__ +- `spl_kmem_cache_obj_per_slab_min <#spl_kmem_cache_obj_per_slab_min>`__ +- `spl_kmem_cache_reclaim <#spl_kmem_cache_reclaim>`__ +- `spl_kmem_cache_slab_limit <#spl_kmem_cache_slab_limit>`__ +- `spl_max_show_tasks <#spl_max_show_tasks>`__ +- `spl_panic_halt <#spl_panic_halt>`__ +- `spl_taskq_kick <#spl_taskq_kick>`__ +- `spl_taskq_thread_bind <#spl_taskq_thread_bind>`__ +- `spl_taskq_thread_dynamic <#spl_taskq_thread_dynamic>`__ +- `spl_taskq_thread_priority <#spl_taskq_thread_priority>`__ +- `spl_taskq_thread_sequential <#spl_taskq_thread_sequential>`__ +- `zfs_sync_pass_deferred_free <#zfs_sync_pass_deferred_free>`__ +- `zfs_sync_pass_dont_compress <#zfs_sync_pass_dont_compress>`__ +- `zfs_sync_pass_rewrite <#zfs_sync_pass_rewrite>`__ +- `zfs_sync_taskq_batch_pct <#zfs_sync_taskq_batch_pct>`__ +- `zfs_top_maxinflight <#zfs_top_maxinflight>`__ +- `zfs_trim_extent_bytes_max <#zfs_trim_extent_bytes_max>`__ +- `zfs_trim_extent_bytes_min <#zfs_trim_extent_bytes_min>`__ +- `zfs_trim_metaslab_skip <#zfs_trim_metaslab_skip>`__ +- `zfs_trim_queue_limit <#zfs_trim_queue_limit>`__ +- `zfs_trim_txg_batch <#zfs_trim_txg_batch>`__ +- `zfs_txg_history <#zfs_txg_history>`__ +- `zfs_txg_timeout <#zfs_txg_timeout>`__ +- `zfs_unlink_suspend_progress <#zfs_unlink_suspend_progress>`__ +- `zfs_user_indirect_is_special <#zfs_user_indirect_is_special>`__ +- `zfs_vdev_aggregate_trim <#zfs_vdev_aggregate_trim>`__ +- `zfs_vdev_aggregation_limit <#zfs_vdev_aggregation_limit>`__ +- `zfs_vdev_aggregation_limit_non_rotating <#zfs_vdev_aggregation_limit_non_rotating>`__ +- `zfs_vdev_async_read_max_active <#zfs_vdev_async_read_max_active>`__ +- `zfs_vdev_async_read_min_active <#zfs_vdev_async_read_min_active>`__ +- `zfs_vdev_async_write_active_max_dirty_percent <#zfs_vdev_async_write_active_max_dirty_percent>`__ +- `zfs_vdev_async_write_active_min_dirty_percent <#zfs_vdev_async_write_active_min_dirty_percent>`__ +- `zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ +- `zfs_vdev_async_write_min_active <#zfs_vdev_async_write_min_active>`__ +- `zfs_vdev_cache_bshift <#zfs_vdev_cache_bshift>`__ +- `zfs_vdev_cache_max <#zfs_vdev_cache_max>`__ +- `zfs_vdev_cache_size <#zfs_vdev_cache_size>`__ +- `zfs_vdev_default_ms_count <#zfs_vdev_default_ms_count>`__ +- `zfs_vdev_initializing_max_active <#zfs_vdev_initializing_max_active>`__ +- `zfs_vdev_initializing_min_active <#zfs_vdev_initializing_min_active>`__ +- `zfs_vdev_max_active <#zfs_vdev_max_active>`__ +- `zfs_vdev_min_ms_count <#zfs_vdev_min_ms_count>`__ +- `zfs_vdev_mirror_non_rotating_inc <#zfs_vdev_mirror_non_rotating_inc>`__ +- `zfs_vdev_mirror_non_rotating_seek_inc <#zfs_vdev_mirror_non_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_inc <#zfs_vdev_mirror_rotating_inc>`__ +- `zfs_vdev_mirror_rotating_seek_inc <#zfs_vdev_mirror_rotating_seek_inc>`__ +- `zfs_vdev_mirror_rotating_seek_offset <#zfs_vdev_mirror_rotating_seek_offset>`__ +- `zfs_vdev_ms_count_limit <#zfs_vdev_ms_count_limit>`__ +- `zfs_vdev_queue_depth_pct <#zfs_vdev_queue_depth_pct>`__ +- `zfs_vdev_raidz_impl <#zfs_vdev_raidz_impl>`__ +- `zfs_vdev_read_gap_limit <#zfs_vdev_read_gap_limit>`__ +- `zfs_vdev_removal_max_active <#zfs_vdev_removal_max_active>`__ +- `vdev_removal_max_span <#vdev_removal_max_span>`__ +- `zfs_vdev_removal_min_active <#zfs_vdev_removal_min_active>`__ +- `zfs_vdev_scheduler <#zfs_vdev_scheduler>`__ +- `zfs_vdev_scrub_max_active <#zfs_vdev_scrub_max_active>`__ +- `zfs_vdev_scrub_min_active <#zfs_vdev_scrub_min_active>`__ +- `zfs_vdev_sync_read_max_active <#zfs_vdev_sync_read_max_active>`__ +- `zfs_vdev_sync_read_min_active <#zfs_vdev_sync_read_min_active>`__ +- `zfs_vdev_sync_write_max_active <#zfs_vdev_sync_write_max_active>`__ +- `zfs_vdev_sync_write_min_active <#zfs_vdev_sync_write_min_active>`__ +- `zfs_vdev_trim_max_active <#zfs_vdev_trim_max_active>`__ +- `zfs_vdev_trim_min_active <#zfs_vdev_trim_min_active>`__ +- `vdev_validate_skip <#vdev_validate_skip>`__ +- `zfs_vdev_write_gap_limit <#zfs_vdev_write_gap_limit>`__ +- `zfs_zevent_cols <#zfs_zevent_cols>`__ +- `zfs_zevent_console <#zfs_zevent_console>`__ +- `zfs_zevent_len_max <#zfs_zevent_len_max>`__ +- `zfetch_array_rd_sz <#zfetch_array_rd_sz>`__ +- `zfetch_max_distance <#zfetch_max_distance>`__ +- `zfetch_max_streams <#zfetch_max_streams>`__ +- `zfetch_min_sec_reap <#zfetch_min_sec_reap>`__ +- `zfs_zil_clean_taskq_maxalloc <#zfs_zil_clean_taskq_maxalloc>`__ +- `zfs_zil_clean_taskq_minalloc <#zfs_zil_clean_taskq_minalloc>`__ +- `zfs_zil_clean_taskq_nthr_pct <#zfs_zil_clean_taskq_nthr_pct>`__ +- `zil_nocacheflush <#zil_nocacheflush>`__ +- `zil_replay_disable <#zil_replay_disable>`__ +- `zil_slog_bulk <#zil_slog_bulk>`__ +- `zio_deadman_log_all <#zio_deadman_log_all>`__ +- `zio_decompress_fail_fraction <#zio_decompress_fail_fraction>`__ +- `zio_delay_max <#zio_delay_max>`__ +- `zio_dva_throttle_enabled <#zio_dva_throttle_enabled>`__ +- `zio_requeue_io_start_cut_in_line <#zio_requeue_io_start_cut_in_line>`__ +- `zio_slow_io_ms <#zio_slow_io_ms>`__ +- `zio_taskq_batch_pct <#zio_taskq_batch_pct>`__ +- `zvol_inhibit_dev <#zvol_inhibit_dev>`__ +- `zvol_major <#zvol_major>`__ +- `zvol_max_discard_blocks <#zvol_max_discard_blocks>`__ +- `zvol_prefetch_bytes <#zvol_prefetch_bytes>`__ +- `zvol_request_sync <#zvol_request_sync>`__ +- `zvol_threads <#zvol_threads>`__ +- `zvol_volmode <#zvol_volmode>`__ + +.. _zfs-module-parameters-1: + +ZFS Module Parameters +===================== + +ignore_hole_birth +~~~~~~~~~~~~~~~~~ + +When set, the hole_birth optimization will not be used and all holes +will always be sent by ``zfs send`` In the source code, +ignore_hole_birth is an alias for and SysFS PARAMETER for +`send_holes_without_birth_time <#send_holes_without_birth_time>`__. + ++-------------------+-------------------------------------------------+ +| ignore_hole_birth | Notes | ++===================+=================================================+ +| Tags | `send <#send>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Enable if you suspect your datasets are | +| | affected by a bug in hole_birth during | +| | ``zfs send`` operations | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=disabled, 1=enabled | ++-------------------+-------------------------------------------------+ +| Default | 1 (hole birth optimization is ignored) | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | TBD | ++-------------------+-------------------------------------------------+ + +l2arc_feed_again +~~~~~~~~~~~~~~~~ + +Turbo L2ARC cache warm-up. When the L2ARC is cold the fill interval will +be set to aggressively fill as fast as possible. + ++-------------------+-------------------------------------------------+ +| l2arc_feed_again | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If cache devices exist and it is desired to | +| | fill them as fast as possible | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=disabled, 1=enabled | ++-------------------+-------------------------------------------------+ +| Default | 1 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | TBD | ++-------------------+-------------------------------------------------+ + +l2arc_feed_min_ms +~~~~~~~~~~~~~~~~~ + +Minimum time period for aggressively feeding the L2ARC. The L2ARC feed +thread wakes up once per second (see +`l2arc_feed_secs <#l2arc_feed_secs>`__) to look for data to feed into +the L2ARC. ``l2arc_feed_min_ms`` only affects the turbo L2ARC cache +warm-up and allows the aggressiveness to be adjusted. + ++-------------------+-------------------------------------------------+ +| l2arc_feed_min_ms | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If cache devices exist and | +| | `l2arc_feed_again <#l2arc_feed_again>`__ and | +| | the feed is too aggressive, then this tunable | +| | can be adjusted to reduce the impact of the | +| | fill | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | milliseconds | ++-------------------+-------------------------------------------------+ +| Range | 0 to (1000 \* l2arc_feed_secs) | ++-------------------+-------------------------------------------------+ +| Default | 200 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | 0.6 and later | ++-------------------+-------------------------------------------------+ + +l2arc_feed_secs +~~~~~~~~~~~~~~~ + +Seconds between waking the L2ARC feed thread. One feed thread works for +all cache devices in turn. + +If the pool that owns a cache device is imported readonly, then the feed +thread is delayed 5 \* `l2arc_feed_secs <#l2arc_feed_secs>`__ before +moving onto the next cache device. If multiple pools are imported with +cache devices and one pool with cache is imported readonly, the L2ARC +feed rate to all caches can be slowed. + +================= ================================== +l2arc_feed_secs Notes +================= ================================== +Tags `ARC <#arc>`__, `L2ARC <#l2arc>`__ +When to change Do not change +Data Type uint64 +Units seconds +Range 1 to UINT64_MAX +Default 1 +Change Dynamic +Versions Affected 0.6 and later +================= ================================== + +l2arc_headroom +~~~~~~~~~~~~~~ + +How far through the ARC lists to search for L2ARC cacheable content, +expressed as a multiplier of `l2arc_write_max <#l2arc_write_max>`__ + ++-------------------+-------------------------------------------------+ +| l2arc_headroom | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If the rate of change in the ARC is faster than | +| | the overall L2ARC feed rate, then increasing | +| | l2arc_headroom can increase L2ARC efficiency. | +| | Setting the value too large can cause the L2ARC | +| | feed thread to consume more CPU time looking | +| | for data to feed. | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | unit | ++-------------------+-------------------------------------------------+ +| Range | 0 to UINT64_MAX | ++-------------------+-------------------------------------------------+ +| Default | 2 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | 0.6 and later | ++-------------------+-------------------------------------------------+ + +l2arc_headroom_boost +~~~~~~~~~~~~~~~~~~~~ + +Percentage scale for `l2arc_headroom <#l2arc_headroom>`__ when L2ARC +contents are being successfully compressed before writing. + ++----------------------+----------------------------------------------+ +| l2arc_headroom_boost | Notes | ++======================+==============================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++----------------------+----------------------------------------------+ +| When to change | If average compression efficiency is greater | +| | than 2:1, then increasing | +| | `l2a | +| | rc_headroom_boost <#l2arc_headroom_boost>`__ | +| | can increase the L2ARC feed rate | ++----------------------+----------------------------------------------+ +| Data Type | uint64 | ++----------------------+----------------------------------------------+ +| Units | percent | ++----------------------+----------------------------------------------+ +| Range | 100 to UINT64_MAX, when set to 100, the | +| | L2ARC headroom boost feature is effectively | +| | disabled | ++----------------------+----------------------------------------------+ +| Default | 200 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | all | ++----------------------+----------------------------------------------+ + +l2arc_nocompress +~~~~~~~~~~~~~~~~ + +Disable writing compressed data to cache devices. Disabling allows the +legacy behavior of writing decompressed data to cache devices. + ++-------------------+-------------------------------------------------+ +| l2arc_nocompress | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | When testing compressed L2ARC feature | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=store compressed blocks in cache device, | +| | 1=store uncompressed blocks in cache device | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | deprecated in v0.7.0 by new compressed ARC | +| | design | ++-------------------+-------------------------------------------------+ + +l2arc_noprefetch +~~~~~~~~~~~~~~~~ + +Disables writing prefetched, but unused, buffers to cache devices. + ++-------------------+-------------------------------------------------+ +| l2arc_noprefetch | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__, | +| | `prefetch <#prefetch>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Setting to 0 can increase L2ARC hit rates for | +| | workloads where the ARC is too small for a read | +| | workload that benefits from prefetching. Also, | +| | if the main pool devices are very slow, setting | +| | to 0 can improve some workloads such as | +| | backups. | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=write prefetched but unused buffers to cache | +| | devices, 1=do not write prefetched but unused | +| | buffers to cache devices | ++-------------------+-------------------------------------------------+ +| Default | 1 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.0 and later | ++-------------------+-------------------------------------------------+ + +l2arc_norw +~~~~~~~~~~ + +Disables writing to cache devices while they are being read. + ++-------------------+-------------------------------------------------+ +| l2arc_norw | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | In the early days of SSDs, some devices did not | +| | perform well when reading and writing | +| | simultaneously. Modern SSDs do not have these | +| | issues. | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=read and write simultaneously, 1=avoid writes | +| | when reading for antique SSDs | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +l2arc_write_boost +~~~~~~~~~~~~~~~~~ + +Until the ARC fills, increases the L2ARC fill rate +`l2arc_write_max <#l2arc_write_max>`__ by ``l2arc_write_boost``. + ++-------------------+-------------------------------------------------+ +| l2arc_write_boost | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | To fill the cache devices more aggressively | +| | after pool import. | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 0 to UINT64_MAX | ++-------------------+-------------------------------------------------+ +| Default | 8,388,608 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +l2arc_write_max +~~~~~~~~~~~~~~~ + +Maximum number of bytes to be written to each cache device for each +L2ARC feed thread interval (see `l2arc_feed_secs <#l2arc_feed_secs>`__). +The actual limit can be adjusted by +`l2arc_write_boost <#l2arc_write_boost>`__. By default +`l2arc_feed_secs <#l2arc_feed_secs>`__ is 1 second, delivering a maximum +write workload to cache devices of 8 MiB/sec. + ++-------------------+-------------------------------------------------+ +| l2arc_write_max | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `L2ARC <#l2arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If the cache devices can sustain the write | +| | workload, increasing the rate of cache device | +| | fill when workloads generate new data at a rate | +| | higher than l2arc_write_max can increase L2ARC | +| | hit rate | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 1 to UINT64_MAX | ++-------------------+-------------------------------------------------+ +| Default | 8,388,608 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +metaslab_aliquot +~~~~~~~~~~~~~~~~ + +Sets the metaslab granularity. Nominally, ZFS will try to allocate this +amount of data to a top-level vdev before moving on to the next +top-level vdev. This is roughly similar to what would be referred to as +the "stripe size" in traditional RAID arrays. + +When tuning for HDDs, it can be more efficient to have a few larger, +sequential writes to a device rather than switching to the next device. +Monitoring the size of contiguous writes to the disks relative to the +write throughput can be used to determine if increasing +``metaslab_aliquot`` can help. For modern devices, it is unlikely that +decreasing ``metaslab_aliquot`` from the default will help. + +If there is only one top-level vdev, this tunable is not used. + ++-------------------+-------------------------------------------------+ +| metaslab_aliquot | Notes | ++===================+=================================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__, `vdev <#vdev>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If write performance increases as devices more | +| | efficiently write larger, contiguous blocks | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 0 to UINT64_MAX | ++-------------------+-------------------------------------------------+ +| Default | 524,288 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +metaslab_bias_enabled +~~~~~~~~~~~~~~~~~~~~~ + +Enables metaslab group biasing based on a top-level vdev's utilization +relative to the pool. Nominally, all top-level devs are the same size +and the allocation is spread evenly. When the top-level vdevs are not of +the same size, for example if a new (empty) top-level is added to the +pool, this allows the new top-level vdev to get a larger portion of new +allocations. + ++-----------------------+---------------------------------------------+ +| metaslab_bias_enabled | Notes | ++=======================+=============================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__, `vdev <#vdev>`__ | ++-----------------------+---------------------------------------------+ +| When to change | If a new top-level vdev is added and you do | +| | not want to bias new allocations to the new | +| | top-level vdev | ++-----------------------+---------------------------------------------+ +| Data Type | boolean | ++-----------------------+---------------------------------------------+ +| Range | 0=spread evenly across top-level vdevs, | +| | 1=bias spread to favor less full top-level | +| | vdevs | ++-----------------------+---------------------------------------------+ +| Default | 1 | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-----------------------+---------------------------------------------+ + +zfs_metaslab_segment_weight_enabled +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Enables metaslab allocation based on largest free segment rather than +total amount of free space. The goal is to avoid metaslabs that exhibit +free space fragmentation: when there is a lot of small free spaces, but +few larger free spaces. + +If ``zfs_metaslab_segment_weight_enabled`` is enabled, then +`metaslab_fragmentation_factor_enabled <#metaslab_fragmentation_factor_enabled>`__ +is ignored. + ++----------------------------------+----------------------------------+ +| zfs | Notes | +| _metaslab_segment_weight_enabled | | ++==================================+==================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__ | ++----------------------------------+----------------------------------+ +| When to change | When testing allocation and | +| | fragmentation | ++----------------------------------+----------------------------------+ +| Data Type | boolean | ++----------------------------------+----------------------------------+ +| Range | 0=do not consider metaslab | +| | fragmentation, 1=avoid metaslabs | +| | where free space is highly | +| | fragmented | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------------+----------------------------------+ + +zfs_metaslab_switch_threshold +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When using segment-based metaslab selection (see +`zfs_metaslab_segment_weight_enabled <#zfs_metaslab_segment_weight_enabled>`__), +continue allocating from the active metaslab until +``zfs_metaslab_switch_threshold`` worth of free space buckets have been +exhausted. + ++-------------------------------+-------------------------------------+ +| zfs_metaslab_switch_threshold | Notes | ++===============================+=====================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__ | ++-------------------------------+-------------------------------------+ +| When to change | When testing allocation and | +| | fragmentation | ++-------------------------------+-------------------------------------+ +| Data Type | uint64 | ++-------------------------------+-------------------------------------+ +| Units | free spaces | ++-------------------------------+-------------------------------------+ +| Range | 0 to UINT64_MAX | ++-------------------------------+-------------------------------------+ +| Default | 2 | ++-------------------------------+-------------------------------------+ +| Change | Dynamic | ++-------------------------------+-------------------------------------+ +| Versions Affected | v0.7.0 and later | ++-------------------------------+-------------------------------------+ + +metaslab_debug_load +~~~~~~~~~~~~~~~~~~~ + +When enabled, all metaslabs are loaded into memory during pool import. +Nominally, metaslab space map information is loaded and unloaded as +needed (see `metaslab_debug_unload <#metaslab_debug_unload>`__) + +It is difficult to predict how much RAM is required to store a space +map. An empty or completely full metaslab has a small space map. +However, a highly fragmented space map can consume significantly more +memory. + +Enabling ``metaslab_debug_load`` can increase pool import time. + ++---------------------+-----------------------------------------------+ +| metaslab_debug_load | Notes | ++=====================+===============================================+ +| Tags | `allocation <#allocation>`__, | +| | `memory <#memory>`__, | +| | `metaslab <#metaslab>`__ | ++---------------------+-----------------------------------------------+ +| When to change | When RAM is plentiful and pool import time is | +| | not a consideration | ++---------------------+-----------------------------------------------+ +| Data Type | boolean | ++---------------------+-----------------------------------------------+ +| Range | 0=do not load all metaslab info at pool | +| | import, 1=dynamically load metaslab info as | +| | needed | ++---------------------+-----------------------------------------------+ +| Default | 0 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++---------------------+-----------------------------------------------+ + +metaslab_debug_unload +~~~~~~~~~~~~~~~~~~~~~ + +When enabled, prevents metaslab information from being dynamically +unloaded from RAM. Nominally, metaslab space map information is loaded +and unloaded as needed (see +`metaslab_debug_load <#metaslab_debug_load>`__) + +It is difficult to predict how much RAM is required to store a space +map. An empty or completely full metaslab has a small space map. +However, a highly fragmented space map can consume significantly more +memory. + +Enabling ``metaslab_debug_unload`` consumes RAM that would otherwise be +freed. + ++-----------------------+---------------------------------------------+ +| metaslab_debug_unload | Notes | ++=======================+=============================================+ +| Tags | `allocation <#allocation>`__, | +| | `memory <#memory>`__, | +| | `metaslab <#metaslab>`__ | ++-----------------------+---------------------------------------------+ +| When to change | When RAM is plentiful and the penalty for | +| | dynamically reloading metaslab info from | +| | the pool is high | ++-----------------------+---------------------------------------------+ +| Data Type | boolean | ++-----------------------+---------------------------------------------+ +| Range | 0=dynamically unload metaslab info, | +| | 1=unload metaslab info only upon pool | +| | export | ++-----------------------+---------------------------------------------+ +| Default | 0 | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-----------------------+---------------------------------------------+ + +metaslab_fragmentation_factor_enabled +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Enable use of the fragmentation metric in computing metaslab weights. + +In version v0.7.0, if +`zfs_metaslab_segment_weight_enabled <#zfs_metaslab_segment_weight_enabled>`__ +is enabled, then ``metaslab_fragmentation_factor_enabled`` is ignored. + ++----------------------------------+----------------------------------+ +| metas | Notes | +| lab_fragmentation_factor_enabled | | ++==================================+==================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__ | ++----------------------------------+----------------------------------+ +| When to change | To test metaslab fragmentation | ++----------------------------------+----------------------------------+ +| Data Type | boolean | ++----------------------------------+----------------------------------+ +| Range | 0=do not consider metaslab free | +| | space fragmentation, 1=try to | +| | avoid fragmented metaslabs | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------------------+----------------------------------+ + +metaslabs_per_vdev +~~~~~~~~~~~~~~~~~~ + +When a vdev is added, it will be divided into approximately, but no more +than, this number of metaslabs. + ++--------------------+------------------------------------------------+ +| metaslabs_per_vdev | Notes | ++====================+================================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__, `vdev <#vdev>`__ | ++--------------------+------------------------------------------------+ +| When to change | When testing metaslab allocation | ++--------------------+------------------------------------------------+ +| Data Type | uint64 | ++--------------------+------------------------------------------------+ +| Units | metaslabs | ++--------------------+------------------------------------------------+ +| Range | 16 to UINT64_MAX | ++--------------------+------------------------------------------------+ +| Default | 200 | ++--------------------+------------------------------------------------+ +| Change | Prior to pool creation or adding new top-level | +| | vdevs | ++--------------------+------------------------------------------------+ +| Versions Affected | all | ++--------------------+------------------------------------------------+ + +metaslab_preload_enabled +~~~~~~~~~~~~~~~~~~~~~~~~ + +Enable metaslab group preloading. Each top-level vdev has a metaslab +group. By default, up to 3 copies of metadata can exist and are +distributed across multiple top-level vdevs. +``metaslab_preload_enabled`` allows the corresponding metaslabs to be +preloaded, thus improving allocation efficiency. + ++--------------------------+------------------------------------------+ +| metaslab_preload_enabled | Notes | ++==========================+==========================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__ | ++--------------------------+------------------------------------------+ +| When to change | When testing metaslab allocation | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=do not preload metaslab info, | +| | 1=preload up to 3 metaslabs | ++--------------------------+------------------------------------------+ +| Default | 1 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------+------------------------------------------+ + +metaslab_lba_weighting_enabled +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Modern HDDs have uniform bit density and constant angular velocity. +Therefore, the outer recording zones are faster (higher bandwidth) than +the inner zones by the ratio of outer to inner track diameter. The +difference in bandwidth can be 2:1, and is often available in the HDD +detailed specifications or drive manual. For HDDs when +``metaslab_lba_weighting_enabled`` is true, write allocation preference +is given to the metaslabs representing the outer recording zones. Thus +the allocation to metaslabs prefers faster bandwidth over free space. + +If the devices are not rotational, yet misrepresent themselves to the OS +as rotational, then disabling ``metaslab_lba_weighting_enabled`` can +result in more even, free-space-based allocation. + ++--------------------------------+------------------------------------+ +| metaslab_lba_weighting_enabled | Notes | ++================================+====================================+ +| Tags | `allocation <#allocation>`__, | +| | `metaslab <#metaslab>`__, | +| | `HDD <#hdd>`__, `SSD <#ssd>`__ | ++--------------------------------+------------------------------------+ +| When to change | disable if using only SSDs and | +| | version v0.6.4 or earlier | ++--------------------------------+------------------------------------+ +| Data Type | boolean | ++--------------------------------+------------------------------------+ +| Range | 0=do not use LBA weighting, 1=use | +| | LBA weighting | ++--------------------------------+------------------------------------+ +| Default | 1 | ++--------------------------------+------------------------------------+ +| Change | Dynamic | ++--------------------------------+------------------------------------+ +| Verfication | The rotational setting described | +| | by a block device in sysfs by | +| | observing | +| | ``/sys/ | +| | block/DISK_NAME/queue/rotational`` | ++--------------------------------+------------------------------------+ +| Versions Affected | prior to v0.6.5, the check for | +| | non-rotation media did not exist | ++--------------------------------+------------------------------------+ + +spa_config_path +~~~~~~~~~~~~~~~ + +By default, the ``zpool import`` command searches for pool information +in the ``zpool.cache`` file. If the pool to be imported has an entry in +``zpool.cache`` then the devices do not have to be scanned to determine +if they are pool members. The path to the cache file is spa_config_path. + +For more information on ``zpool import`` and the ``-o cachefile`` and +``-d`` options, see the man page for zpool(8) + +See also `zfs_autoimport_disable <#zfs_autoimport_disable>`__ + ++-------------------+-------------------------------------------------+ +| spa_config_path | Notes | ++===================+=================================================+ +| Tags | `import <#import>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If creating a non-standard distribution and the | +| | cachefile property is inconvenient | ++-------------------+-------------------------------------------------+ +| Data Type | string | ++-------------------+-------------------------------------------------+ +| Default | ``/etc/zfs/zpool.cache`` | ++-------------------+-------------------------------------------------+ +| Change | Dynamic, applies only to the next invocation of | +| | ``zpool import`` | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +spa_asize_inflation +~~~~~~~~~~~~~~~~~~~ + +Multiplication factor used to estimate actual disk consumption from the +size of data being written. The default value is a worst case estimate, +but lower values may be valid for a given pool depending on its +configuration. Pool administrators who understand the factors involved +may wish to specify a more realistic inflation factor, particularly if +they operate close to quota or capacity limits. + +The worst case space requirement for allocation is single-sector +max-parity RAIDZ blocks, in which case the space requirement is exactly +4 times the size, accounting for a maximum of 3 parity blocks. This is +added to the maximum number of ZFS ``copies`` parameter (copies max=3). +Additional space is required if the block could impact deduplication +tables. Altogether, the worst case is 24. + +If the estimation is not correct, then quotas or out-of-space conditions +can lead to optimistic expectations of the ability to allocate. +Applications are typically not prepared to deal with such failures and +can misbehave. + ++---------------------+-----------------------------------------------+ +| spa_asize_inflation | Notes | ++=====================+===============================================+ +| Tags | `allocation <#allocation>`__, `SPA <#spa>`__ | ++---------------------+-----------------------------------------------+ +| When to change | If the allocation requirements for the | +| | workload are well known and quotas are used | ++---------------------+-----------------------------------------------+ +| Data Type | uint64 | ++---------------------+-----------------------------------------------+ +| Units | unit | ++---------------------+-----------------------------------------------+ +| Range | 1 to 24 | ++---------------------+-----------------------------------------------+ +| Default | 24 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.6.3 and later | ++---------------------+-----------------------------------------------+ + +spa_load_verify_data +~~~~~~~~~~~~~~~~~~~~ + +An extreme rewind import (see ``zpool import -X``) normally performs a +full traversal of all blocks in the pool for verification. If this +parameter is set to 0, the traversal skips non-metadata blocks. It can +be toggled once the import has started to stop or start the traversal of +non-metadata blocks. See also +`spa_load_verify_metadata <#spa_load_verify_metadata>`__. + ++----------------------+----------------------------------------------+ +| spa_load_verify_data | Notes | ++======================+==============================================+ +| Tags | `allocation <#allocation>`__, `SPA <#spa>`__ | ++----------------------+----------------------------------------------+ +| When to change | At the risk of data integrity, to speed | +| | extreme import of large pool | ++----------------------+----------------------------------------------+ +| Data Type | boolean | ++----------------------+----------------------------------------------+ +| Range | 0=do not verify data upon pool import, | +| | 1=verify pool data upon import | ++----------------------+----------------------------------------------+ +| Default | 1 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------+----------------------------------------------+ + +spa_load_verify_metadata +~~~~~~~~~~~~~~~~~~~~~~~~ + +An extreme rewind import (see ``zpool import -X``) normally performs a +full traversal of all blocks in the pool for verification. If this +parameter is set to 0, the traversal is not performed. It can be toggled +once the import has started to stop or start the traversal. See +`spa_load_verify_data <#spa_load_verify_data>`__ + ++--------------------------+------------------------------------------+ +| spa_load_verify_metadata | Notes | ++==========================+==========================================+ +| Tags | `import <#import>`__ | ++--------------------------+------------------------------------------+ +| When to change | At the risk of data integrity, to speed | +| | extreme import of large pool | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=do not verify metadata upon pool | +| | import, 1=verify pool metadata upon | +| | import | ++--------------------------+------------------------------------------+ +| Default | 1 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------+------------------------------------------+ + +spa_load_verify_maxinflight +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Maximum number of concurrent I/Os during the data verification performed +during an extreme rewind import (see ``zpool import -X``) + ++-----------------------------+---------------------------------------+ +| spa_load_verify_maxinflight | Notes | ++=============================+=======================================+ +| Tags | `import <#import>`__ | ++-----------------------------+---------------------------------------+ +| When to change | During an extreme rewind import, to | +| | match the concurrent I/O capabilities | +| | of the pool devices | ++-----------------------------+---------------------------------------+ +| Data Type | int | ++-----------------------------+---------------------------------------+ +| Units | I/Os | ++-----------------------------+---------------------------------------+ +| Range | 1 to MAX_INT | ++-----------------------------+---------------------------------------+ +| Default | 10,000 | ++-----------------------------+---------------------------------------+ +| Change | Dynamic | ++-----------------------------+---------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-----------------------------+---------------------------------------+ + +spa_slop_shift +~~~~~~~~~~~~~~ + +Normally, the last 3.2% (1/(2^\ ``spa_slop_shift``)) of pool space is +reserved to ensure the pool doesn't run completely out of space, due to +unaccounted changes (e.g. to the MOS). This also limits the worst-case +time to allocate space. When less than this amount of free space exists, +most ZPL operations (e.g. write, create) return error:no space (ENOSPC). + +Changing spa_slop_shift affects the currently loaded ZFS module and all +imported pools. spa_slop_shift is not stored on disk. Beware when +importing full pools on systems with larger spa_slop_shift can lead to +over-full conditions. + +The minimum SPA slop space is limited to 128 MiB. + ++-------------------+-------------------------------------------------+ +| spa_slop_shift | Notes | ++===================+=================================================+ +| Tags | `allocation <#allocation>`__, `SPA <#spa>`__ | ++-------------------+-------------------------------------------------+ +| When to change | For large pools, when 3.2% may be too | +| | conservative and more usable space is desired, | +| | consider increasing ``spa_slop_shift`` | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Units | shift | ++-------------------+-------------------------------------------------+ +| Range | 1 to MAX_INT, however the practical upper limit | +| | is 15 for a system with 4TB of RAM | ++-------------------+-------------------------------------------------+ +| Default | 5 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++-------------------+-------------------------------------------------+ + +zfetch_array_rd_sz +~~~~~~~~~~~~~~~~~~ + +If prefetching is enabled, do not prefetch blocks larger than +``zfetch_array_rd_sz`` size. + +================== ================================================= +zfetch_array_rd_sz Notes +================== ================================================= +Tags `prefetch <#prefetch>`__ +When to change To allow prefetching when using large block sizes +Data Type unsigned long +Units bytes +Range 0 to MAX_ULONG +Default 1,048,576 (1 MiB) +Change Dynamic +Versions Affected all +================== ================================================= + +zfetch_max_distance +~~~~~~~~~~~~~~~~~~~ + +Limits the maximum number of bytes to prefetch per stream. + ++---------------------+-----------------------------------------------+ +| zfetch_max_distance | Notes | ++=====================+===============================================+ +| Tags | `prefetch <#prefetch>`__ | ++---------------------+-----------------------------------------------+ +| When to change | Consider increasing read workloads that use | +| | large blocks and exhibit high prefetch hit | +| | ratios | ++---------------------+-----------------------------------------------+ +| Data Type | uint | ++---------------------+-----------------------------------------------+ +| Units | bytes | ++---------------------+-----------------------------------------------+ +| Range | 0 to UINT_MAX | ++---------------------+-----------------------------------------------+ +| Default | 8,388,608 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.7.0 | ++---------------------+-----------------------------------------------+ + +zfetch_max_streams +~~~~~~~~~~~~~~~~~~ + +Maximum number of prefetch streams per file. + +For version v0.7.0 and later, when prefetching small files the number of +prefetch streams is automatically reduced below to prevent the streams +from overlapping. + ++--------------------+------------------------------------------------+ +| zfetch_max_streams | Notes | ++====================+================================================+ +| Tags | `prefetch <#prefetch>`__ | ++--------------------+------------------------------------------------+ +| When to change | If the workload benefits from prefetching and | +| | has more than ``zfetch_max_streams`` | +| | concurrent reader threads | ++--------------------+------------------------------------------------+ +| Data Type | uint | ++--------------------+------------------------------------------------+ +| Units | streams | ++--------------------+------------------------------------------------+ +| Range | 1 to MAX_UINT | ++--------------------+------------------------------------------------+ +| Default | 8 | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| Versions Affected | all | ++--------------------+------------------------------------------------+ + +zfetch_min_sec_reap +~~~~~~~~~~~~~~~~~~~ + +Prefetch streams that have been accessed in ``zfetch_min_sec_reap`` +seconds are automatically stopped. + +=================== =========================== +zfetch_min_sec_reap Notes +=================== =========================== +Tags `prefetch <#prefetch>`__ +When to change To test prefetch efficiency +Data Type uint +Units seconds +Range 0 to MAX_UINT +Default 2 +Change Dynamic +Versions Affected all +=================== =========================== + +zfs_arc_dnode_limit_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Percentage of ARC metadata space that can be used for dnodes. + +The value calculated for ``zfs_arc_dnode_limit_percent`` can be +overridden by `zfs_arc_dnode_limit <#zfs_arc_dnode_limit>`__. + ++-----------------------------+---------------------------------------+ +| zfs_arc_dnode_limit_percent | Notes | ++=============================+=======================================+ +| Tags | `ARC <#arc>`__ | ++-----------------------------+---------------------------------------+ +| When to change | Consider increasing if ``arc_prune`` | +| | is using excessive system time and | +| | ``/proc/spl/kstat/zfs/arcstats`` | +| | shows ``arc_dnode_size`` is near or | +| | over ``arc_dnode_limit`` | ++-----------------------------+---------------------------------------+ +| Data Type | int | ++-----------------------------+---------------------------------------+ +| Units | percent of arc_meta_limit | ++-----------------------------+---------------------------------------+ +| Range | 0 to 100 | ++-----------------------------+---------------------------------------+ +| Default | 10 | ++-----------------------------+---------------------------------------+ +| Change | Dynamic | ++-----------------------------+---------------------------------------+ +| Versions Affected | v0.7.0 and later | ++-----------------------------+---------------------------------------+ + +zfs_arc_dnode_limit +~~~~~~~~~~~~~~~~~~~ + +When the number of bytes consumed by dnodes in the ARC exceeds +``zfs_arc_dnode_limit`` bytes, demand for new metadata can take from the +space consumed by dnodes. + +The default value 0, indicates that a percent which is based on +`zfs_arc_dnode_limit_percent <#zfs_arc_dnode_limit_percent>`__ of the +ARC meta buffers that may be used for dnodes. + +``zfs_arc_dnode_limit`` is similar to +`zfs_arc_meta_prune <#zfs_arc_meta_prune>`__ which serves a similar +purpose for metadata. + ++---------------------+-----------------------------------------------+ +| zfs_arc_dnode_limit | Notes | ++=====================+===============================================+ +| Tags | `ARC <#arc>`__ | ++---------------------+-----------------------------------------------+ +| When to change | Consider increasing if ``arc_prune`` is using | +| | excessive system time and | +| | ``/proc/spl/kstat/zfs/arcstats`` shows | +| | ``arc_dnode_size`` is near or over | +| | ``arc_dnode_limit`` | ++---------------------+-----------------------------------------------+ +| Data Type | uint64 | ++---------------------+-----------------------------------------------+ +| Units | bytes | ++---------------------+-----------------------------------------------+ +| Range | 0 to MAX_UINT64 | ++---------------------+-----------------------------------------------+ +| Default | 0 (uses | +| | `zfs_arc_dnode_lim | +| | it_percent <#zfs_arc_dnode_limit_percent>`__) | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++---------------------+-----------------------------------------------+ + +zfs_arc_dnode_reduce_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Percentage of ARC dnodes to try to evict in response to demand for +non-metadata when the number of bytes consumed by dnodes exceeds +`zfs_arc_dnode_limit <#zfs_arc_dnode_limit>`__. + ++------------------------------+--------------------------------------+ +| zfs_arc_dnode_reduce_percent | Notes | ++==============================+======================================+ +| Tags | `ARC <#arc>`__ | ++------------------------------+--------------------------------------+ +| When to change | Testing dnode cache efficiency | ++------------------------------+--------------------------------------+ +| Data Type | uint64 | ++------------------------------+--------------------------------------+ +| Units | percent of size of dnode space used | +| | above | +| | `zfs_arc_d | +| | node_limit <#zfs_arc_dnode_limit>`__ | ++------------------------------+--------------------------------------+ +| Range | 0 to 100 | ++------------------------------+--------------------------------------+ +| Default | 10 | ++------------------------------+--------------------------------------+ +| Change | Dynamic | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.7.0 and later | ++------------------------------+--------------------------------------+ + +zfs_arc_average_blocksize +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ARC's buffer hash table is sized based on the assumption of an +average block size of ``zfs_arc_average_blocksize``. The default of 8 +KiB uses approximately 1 MiB of hash table per 1 GiB of physical memory +with 8-byte pointers. + ++---------------------------+-----------------------------------------+ +| zfs_arc_average_blocksize | Notes | ++===========================+=========================================+ +| Tags | `ARC <#arc>`__, `memory <#memory>`__ | ++---------------------------+-----------------------------------------+ +| When to change | For workloads where the known average | +| | blocksize is larger, increasing | +| | ``zfs_arc_average_blocksize`` can | +| | reduce memory usage | ++---------------------------+-----------------------------------------+ +| Data Type | int | ++---------------------------+-----------------------------------------+ +| Units | bytes | ++---------------------------+-----------------------------------------+ +| Range | 512 to 16,777,216 | ++---------------------------+-----------------------------------------+ +| Default | 8,192 | ++---------------------------+-----------------------------------------+ +| Change | Prior to zfs module load | ++---------------------------+-----------------------------------------+ +| Versions Affected | all | ++---------------------------+-----------------------------------------+ + +zfs_arc_evict_batch_limit +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Number ARC headers to evict per sublist before proceeding to another +sublist. This batch-style operation prevents entire sublists from being +evicted at once but comes at a cost of additional unlocking and locking. + +========================= ============================== +zfs_arc_evict_batch_limit Notes +========================= ============================== +Tags `ARC <#arc>`__ +When to change Testing ARC multilist features +Data Type int +Units count of ARC headers +Range 1 to INT_MAX +Default 10 +Change Dynamic +Versions Affected v0.6.5 and later +========================= ============================== + +zfs_arc_grow_retry +~~~~~~~~~~~~~~~~~~ + +When the ARC is shrunk due to memory demand, do not retry growing the +ARC for ``zfs_arc_grow_retry`` seconds. This operates as a damper to +prevent oscillating grow/shrink cycles when there is memory pressure. + +If ``zfs_arc_grow_retry`` = 0, the internal default of 5 seconds is +used. + +================== ==================================== +zfs_arc_grow_retry Notes +================== ==================================== +Tags `ARC <#arc>`__, `memory <#memory>`__ +When to change TBD +Data Type int +Units seconds +Range 1 to MAX_INT +Default 0 +Change Dynamic +Versions Affected v0.6.5 and later +================== ==================================== + +zfs_arc_lotsfree_percent +~~~~~~~~~~~~~~~~~~~~~~~~ + +Throttle ARC memory consumption, effectively throttling I/O, when free +system memory drops below this percentage of total system memory. +Setting ``zfs_arc_lotsfree_percent`` to 0 disables the throttle. + +The arcstat_memory_throttle_count counter in +``/proc/spl/kstat/arcstats`` can indicate throttle activity. + +======================== ==================================== +zfs_arc_lotsfree_percent Notes +======================== ==================================== +Tags `ARC <#arc>`__, `memory <#memory>`__ +When to change TBD +Data Type int +Units percent +Range 0 to 100 +Default 10 +Change Dynamic +Versions Affected v0.6.5 and later +======================== ==================================== + +zfs_arc_max +~~~~~~~~~~~ + +Maximum size of ARC in bytes. If set to 0 then the maximum ARC size is +set to 1/2 of system RAM. + +``zfs_arc_max`` can be changed dynamically with some caveats. It cannot +be set back to 0 while running and reducing it below the current ARC +size will not cause the ARC to shrink without memory pressure to induce +shrinking. + ++-------------------+-------------------------------------------------+ +| zfs_arc_max | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `memory <#memory>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Reduce if ARC competes too much with other | +| | applications, increase if ZFS is the primary | +| | application and can use more RAM | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 67,108,864 to RAM size in bytes | ++-------------------+-------------------------------------------------+ +| Default | 0 (uses default of RAM size in bytes / 2) | ++-------------------+-------------------------------------------------+ +| Change | Dynamic (see description above) | ++-------------------+-------------------------------------------------+ +| Verification | ``c`` column in ``arcstats.py`` or | +| | ``/proc/spl/kstat/zfs/arcstats`` entry | +| | ``c_max`` | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +zfs_arc_meta_adjust_restarts +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The number of restart passes to make while scanning the ARC attempting +the free buffers in order to stay below the +`zfs_arc_meta_limit <#zfs_arc_meta_limit>`__. + +============================ ======================================= +zfs_arc_meta_adjust_restarts Notes +============================ ======================================= +Tags `ARC <#arc>`__ +When to change Testing ARC metadata adjustment feature +Data Type int +Units restarts +Range 0 to INT_MAX +Default 4,096 +Change Dynamic +Versions Affected v0.6.5 and later +============================ ======================================= + +zfs_arc_meta_limit +~~~~~~~~~~~~~~~~~~ + +Sets the maximum allowed size metadata buffers in the ARC. When +`zfs_arc_meta_limit <#zfs_arc_meta_limit>`__ is reached metadata buffers +are reclaimed, even if the overall ``c_max`` has not been reached. + +In version v0.7.0, with a default value = 0, +``zfs_arc_meta_limit_percent`` is used to set ``arc_meta_limit`` + ++--------------------+------------------------------------------------+ +| zfs_arc_meta_limit | Notes | ++====================+================================================+ +| Tags | `ARC <#arc>`__ | ++--------------------+------------------------------------------------+ +| When to change | For workloads where the metadata to data ratio | +| | in the ARC can be changed to improve ARC hit | +| | rates | ++--------------------+------------------------------------------------+ +| Data Type | uint64 | ++--------------------+------------------------------------------------+ +| Units | bytes | ++--------------------+------------------------------------------------+ +| Range | 0 to ``c_max`` | ++--------------------+------------------------------------------------+ +| Default | 0 | ++--------------------+------------------------------------------------+ +| Change | Dynamic, except that it cannot be set back to | +| | 0 for a specific percent of the ARC; it must | +| | be set to an explicit value | ++--------------------+------------------------------------------------+ +| Verification | ``/proc/spl/kstat/zfs/arcstats`` entry | +| | ``arc_meta_limit`` | ++--------------------+------------------------------------------------+ +| Versions Affected | all | ++--------------------+------------------------------------------------+ + +zfs_arc_meta_limit_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Sets the limit to ARC metadata, ``arc_meta_limit``, as a percentage of +the maximum size target of the ARC, ``c_max`` + +Prior to version v0.7.0, the +`zfs_arc_meta_limit <#zfs_arc_meta_limit>`__ was used to set the limit +as a fixed size. ``zfs_arc_meta_limit_percent`` provides a more +convenient interface for setting the limit. + ++----------------------------+----------------------------------------+ +| zfs_arc_meta_limit_percent | Notes | ++============================+========================================+ +| Tags | `ARC <#arc>`__ | ++----------------------------+----------------------------------------+ +| When to change | For workloads where the metadata to | +| | data ratio in the ARC can be changed | +| | to improve ARC hit rates | ++----------------------------+----------------------------------------+ +| Data Type | uint64 | ++----------------------------+----------------------------------------+ +| Units | percent of ``c_max`` | ++----------------------------+----------------------------------------+ +| Range | 0 to 100 | ++----------------------------+----------------------------------------+ +| Default | 75 | ++----------------------------+----------------------------------------+ +| Change | Dynamic | ++----------------------------+----------------------------------------+ +| Verification | ``/proc/spl/kstat/zfs/arcstats`` entry | +| | ``arc_meta_limit`` | ++----------------------------+----------------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------+----------------------------------------+ + +zfs_arc_meta_min +~~~~~~~~~~~~~~~~ + +The minimum allowed size in bytes that metadata buffers may consume in +the ARC. This value defaults to 0 which disables a floor on the amount +of the ARC devoted meta data. + +When evicting data from the ARC, if the ``metadata_size`` is less than +``arc_meta_min`` then data is evicted instead of metadata. + ++-------------------+---------------------------------------------------------+ +| zfs_arc_meta_min | Notes | ++===================+=========================================================+ +| Tags | `ARC <#arc>`__ | ++-------------------+---------------------------------------------------------+ +| When to change | | ++-------------------+---------------------------------------------------------+ +| Data Type | uint64 | ++-------------------+---------------------------------------------------------+ +| Units | bytes | ++-------------------+---------------------------------------------------------+ +| Range | 16,777,216 to ``c_max`` | ++-------------------+---------------------------------------------------------+ +| Default | 0 (use internal default 16 MiB) | ++-------------------+---------------------------------------------------------+ +| Change | Dynamic | ++-------------------+---------------------------------------------------------+ +| Verification | ``/proc/spl/kstat/zfs/arcstats`` entry ``arc_meta_min`` | ++-------------------+---------------------------------------------------------+ +| Versions Affected | all | ++-------------------+---------------------------------------------------------+ + +zfs_arc_meta_prune +~~~~~~~~~~~~~~~~~~ + +``zfs_arc_meta_prune`` sets the number of dentries and znodes to be +scanned looking for entries which can be dropped. This provides a +mechanism to ensure the ARC can honor the ``arc_meta_limit and`` reclaim +otherwise pinned ARC buffers. Pruning may be required when the ARC size +drops to ``arc_meta_limit`` because dentries and znodes can pin buffers +in the ARC. Increasing this value will cause to dentry and znode caches +to be pruned more aggressively and the arc_prune thread becomes more +active. Setting ``zfs_arc_meta_prune`` to 0 will disable pruning. + ++--------------------+------------------------------------------------+ +| zfs_arc_meta_prune | Notes | ++====================+================================================+ +| Tags | `ARC <#arc>`__ | ++--------------------+------------------------------------------------+ +| When to change | TBD | ++--------------------+------------------------------------------------+ +| Data Type | uint64 | ++--------------------+------------------------------------------------+ +| Units | entries | ++--------------------+------------------------------------------------+ +| Range | 0 to INT_MAX | ++--------------------+------------------------------------------------+ +| Default | 10,000 | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| ! Verification | Prune activity is counted by the | +| | ``/proc/spl/kstat/zfs/arcstats`` entry | +| | ``arc_prune`` | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++--------------------+------------------------------------------------+ + +zfs_arc_meta_strategy +~~~~~~~~~~~~~~~~~~~~~ + +Defines the strategy for ARC metadata eviction (meta reclaim strategy). +A value of 0 (META_ONLY) will evict only the ARC metadata. A value of 1 +(BALANCED) indicates that additional data may be evicted if required in +order to evict the requested amount of metadata. + ++-----------------------+---------------------------------------------+ +| zfs_arc_meta_strategy | Notes | ++=======================+=============================================+ +| Tags | `ARC <#arc>`__ | ++-----------------------+---------------------------------------------+ +| When to change | Testing ARC metadata eviction | ++-----------------------+---------------------------------------------+ +| Data Type | int | ++-----------------------+---------------------------------------------+ +| Units | enum | ++-----------------------+---------------------------------------------+ +| Range | 0=evict metadata only, 1=also evict data | +| | buffers if they can free metadata buffers | +| | for eviction | ++-----------------------+---------------------------------------------+ +| Default | 1 (BALANCED) | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++-----------------------+---------------------------------------------+ + +zfs_arc_min +~~~~~~~~~~~ + +Minimum ARC size limit. When the ARC is asked to shrink, it will stop +shrinking at ``c_min`` as tuned by ``zfs_arc_min``. + ++-------------------+-------------------------------------------------+ +| zfs_arc_min | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If the primary focus of the system is ZFS, then | +| | increasing can ensure the ARC gets a minimum | +| | amount of RAM | ++-------------------+-------------------------------------------------+ +| Data Type | uint64 | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 33,554,432 to ``c_max`` | ++-------------------+-------------------------------------------------+ +| Default | For kernel: greater of 33,554,432 (32 MiB) and | +| | memory size / 32. For user-land: greater of | +| | 33,554,432 (32 MiB) and ``c_max`` / 2. | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Verification | ``/proc/spl/kstat/zfs/arcstats`` entry | +| | ``c_min`` | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +zfs_arc_min_prefetch_ms +~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum time prefetched blocks are locked in the ARC. + +A value of 0 represents the default of 1 second. However, once changed, +dynamically setting to 0 will not return to the default. + +======================= ======================================== +zfs_arc_min_prefetch_ms Notes +======================= ======================================== +Tags `ARC <#arc>`__, `prefetch <#prefetch>`__ +When to change TBD +Data Type int +Units milliseconds +Range 1 to INT_MAX +Default 0 (use internal default of 1000 ms) +Change Dynamic +Versions Affected v0.8.0 and later +======================= ======================================== + +zfs_arc_min_prescient_prefetch_ms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum time "prescient prefetched" blocks are locked in the ARC. These +blocks are meant to be prefetched fairly aggresively ahead of the code +that may use them. + +A value of 0 represents the default of 6 seconds. However, once changed, +dynamically setting to 0 will not return to the default. + ++----------------------------------+----------------------------------+ +| z | Notes | +| fs_arc_min_prescient_prefetch_ms | | ++==================================+==================================+ +| Tags | `ARC <#arc>`__, | +| | `prefetch <#prefetch>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | milliseconds | ++----------------------------------+----------------------------------+ +| Range | 1 to INT_MAX | ++----------------------------------+----------------------------------+ +| Default | 0 (use internal default of 6000 | +| | ms) | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.8.0 and later | ++----------------------------------+----------------------------------+ + +zfs_multilist_num_sublists +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To allow more fine-grained locking, each ARC state contains a series of +lists (sublists) for both data and metadata objects. Locking is +performed at the sublist level. This parameters controls the number of +sublists per ARC state, and also applies to other uses of the multilist +data structure. + ++----------------------------+----------------------------------------+ +| zfs_multilist_num_sublists | Notes | ++============================+========================================+ +| Tags | `ARC <#arc>`__ | ++----------------------------+----------------------------------------+ +| When to change | TBD | ++----------------------------+----------------------------------------+ +| Data Type | int | ++----------------------------+----------------------------------------+ +| Units | lists | ++----------------------------+----------------------------------------+ +| Range | 1 to INT_MAX | ++----------------------------+----------------------------------------+ +| Default | 0 (internal value is greater of number | +| | of online CPUs or 4) | ++----------------------------+----------------------------------------+ +| Change | Prior to zfs module load | ++----------------------------+----------------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------+----------------------------------------+ + +zfs_arc_overflow_shift +~~~~~~~~~~~~~~~~~~~~~~ + +The ARC size is considered to be overflowing if it exceeds the current +ARC target size (``/proc/spl/kstat/zfs/arcstats`` entry ``c``) by a +threshold determined by ``zfs_arc_overflow_shift``. The threshold is +calculated as a fraction of c using the formula: (ARC target size) +``c >> zfs_arc_overflow_shift`` + +The default value of 8 causes the ARC to be considered to be overflowing +if it exceeds the target size by 1/256th (0.3%) of the target size. + +When the ARC is overflowing, new buffer allocations are stalled until +the reclaim thread catches up and the overflow condition no longer +exists. + +====================== ================ +zfs_arc_overflow_shift Notes +====================== ================ +Tags `ARC <#arc>`__ +When to change TBD +Data Type int +Units shift +Range 1 to INT_MAX +Default 8 +Change Dynamic +Versions Affected v0.6.5 and later +====================== ================ + +zfs_arc_p_min_shift +~~~~~~~~~~~~~~~~~~~ + +arc_p_min_shift is used to shift of ARC target size +(``/proc/spl/kstat/zfs/arcstats`` entry ``c``) for calculating both +minimum and maximum most recently used (MRU) target size +(``/proc/spl/kstat/zfs/arcstats`` entry ``p``) + +A value of 0 represents the default setting of ``arc_p_min_shift`` = 4. +However, once changed, dynamically setting ``zfs_arc_p_min_shift`` to 0 +will not return to the default. + ++---------------------+-----------------------------------------------+ +| zfs_arc_p_min_shift | Notes | ++=====================+===============================================+ +| Tags | `ARC <#arc>`__ | ++---------------------+-----------------------------------------------+ +| When to change | TBD | ++---------------------+-----------------------------------------------+ +| Data Type | int | ++---------------------+-----------------------------------------------+ +| Units | shift | ++---------------------+-----------------------------------------------+ +| Range | 1 to INT_MAX | ++---------------------+-----------------------------------------------+ +| Default | 0 (internal default = 4) | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Verification | Observe changes to | +| | ``/proc/spl/kstat/zfs/arcstats`` entry ``p`` | ++---------------------+-----------------------------------------------+ +| Versions Affected | all | ++---------------------+-----------------------------------------------+ + +zfs_arc_p_dampener_disable +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When data is being added to the ghost lists, the MRU target size is +adjusted. The amount of adjustment is based on the ratio of the MRU/MFU +sizes. When enabled, the ratio is capped to 10, avoiding large +adjustments. + ++----------------------------+----------------------------------------+ +| zfs_arc_p_dampener_disable | Notes | ++============================+========================================+ +| Tags | `ARC <#arc>`__ | ++----------------------------+----------------------------------------+ +| When to change | Testing ARC ghost list behaviour | ++----------------------------+----------------------------------------+ +| Data Type | boolean | ++----------------------------+----------------------------------------+ +| Range | 0=avoid large adjustments, 1=permit | +| | large adjustments | ++----------------------------+----------------------------------------+ +| Default | 1 | ++----------------------------+----------------------------------------+ +| Change | Dynamic | ++----------------------------+----------------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------------+----------------------------------------+ + +zfs_arc_shrink_shift +~~~~~~~~~~~~~~~~~~~~ + +``arc_shrink_shift`` is used to adjust the ARC target sizes when large +reduction is required. The current ARC target size, ``c``, and MRU size +``p`` can be reduced by by the current ``size >> arc_shrink_shift``. For +the default value of 7, this reduces the target by approximately 0.8%. + +A value of 0 represents the default setting of arc_shrink_shift = 7. +However, once changed, dynamically setting arc_shrink_shift to 0 will +not return to the default. + ++----------------------+----------------------------------------------+ +| zfs_arc_shrink_shift | Notes | ++======================+==============================================+ +| Tags | `ARC <#arc>`__, `memory <#memory>`__ | ++----------------------+----------------------------------------------+ +| When to change | During memory shortfall, reducing | +| | ``zfs_arc_shrink_shift`` increases the rate | +| | of ARC shrinkage | ++----------------------+----------------------------------------------+ +| Data Type | int | ++----------------------+----------------------------------------------+ +| Units | shift | ++----------------------+----------------------------------------------+ +| Range | 1 to INT_MAX | ++----------------------+----------------------------------------------+ +| Default | 0 (``arc_shrink_shift`` = 7) | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | all | ++----------------------+----------------------------------------------+ + +zfs_arc_pc_percent +~~~~~~~~~~~~~~~~~~ + +``zfs_arc_pc_percent`` allows ZFS arc to play more nicely with the +kernel's LRU pagecache. It can guarantee that the arc size won't +collapse under scanning pressure on the pagecache, yet still allows arc +to be reclaimed down to zfs_arc_min if necessary. This value is +specified as percent of pagecache size (as measured by +``NR_FILE_PAGES``) where that percent may exceed 100. This only operates +during memory pressure/reclaim. + ++--------------------+------------------------------------------------+ +| zfs_arc_pc_percent | Notes | ++====================+================================================+ +| Tags | `ARC <#arc>`__, `memory <#memory>`__ | ++--------------------+------------------------------------------------+ +| When to change | When using file systems under memory | +| | shortfall, if the page scanner causes the ARC | +| | to shrink too fast, then adjusting | +| | ``zfs_arc_pc_percent`` can reduce the shrink | +| | rate | ++--------------------+------------------------------------------------+ +| Data Type | int | ++--------------------+------------------------------------------------+ +| Units | percent | ++--------------------+------------------------------------------------+ +| Range | 0 to 100 | ++--------------------+------------------------------------------------+ +| Default | 0 (disabled) | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++--------------------+------------------------------------------------+ + +zfs_arc_sys_free +~~~~~~~~~~~~~~~~ + +``zfs_arc_sys_free`` is the target number of bytes the ARC should leave +as free memory on the system. Defaults to the larger of 1/64 of physical +memory or 512K. Setting this option to a non-zero value will override +the default. + +A value of 0 represents the default setting of larger of 1/64 of +physical memory or 512 KiB. However, once changed, dynamically setting +zfs_arc_sys_free to 0 will not return to the default. + ++-------------------+-------------------------------------------------+ +| zfs_arc_sys_free | Notes | ++===================+=================================================+ +| Tags | `ARC <#arc>`__, `memory <#memory>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Change if more free memory is desired as a | +| | margin against memory demand by applications | ++-------------------+-------------------------------------------------+ +| Data Type | ulong | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 0 to ULONG_MAX | ++-------------------+-------------------------------------------------+ +| Default | 0 (default to larger of 1/64 of physical memory | +| | or 512 KiB) | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++-------------------+-------------------------------------------------+ + +zfs_autoimport_disable +~~~~~~~~~~~~~~~~~~~~~~ + +Disable reading zpool.cache file (see +`spa_config_path <#spa_config_path>`__) when loading the zfs module. + ++------------------------+--------------------------------------------+ +| zfs_autoimport_disable | Notes | ++========================+============================================+ +| Tags | `import <#import>`__ | ++------------------------+--------------------------------------------+ +| When to change | Leave as default so that zfs behaves as | +| | other Linux kernel modules | ++------------------------+--------------------------------------------+ +| Data Type | boolean | ++------------------------+--------------------------------------------+ +| Range | 0=read ``zpool.cache`` at module load, | +| | 1=do not read ``zpool.cache`` at module | +| | load | ++------------------------+--------------------------------------------+ +| Default | 1 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++------------------------+--------------------------------------------+ + +zfs_commit_timeout_pct +~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_commit_timeout_pct`` controls the amount of time that a log (ZIL) +write block (lwb) remains "open" when it isn't "full" and it has a +thread waiting to commit to stable storage. The timeout is scaled based +on a percentage of the last lwb latency to avoid significantly impacting +the latency of each individual intent log transaction (itx). + +====================== ============== +zfs_commit_timeout_pct Notes +====================== ============== +Tags `ZIL <#zil>`__ +When to change TBD +Data Type int +Units percent +Range 1 to 100 +Default 5 +Change Dynamic +Versions Affected v0.8.0 +====================== ============== + +zfs_dbgmsg_enable +~~~~~~~~~~~~~~~~~ + +| Internally ZFS keeps a small log to facilitate debugging. The contents + of the log are in the ``/proc/spl/kstat/zfs/dbgmsg`` file. +| Writing 0 to ``/proc/spl/kstat/zfs/dbgmsg`` file clears the log. + +See also `zfs_dbgmsg_maxsize <#zfs_dbgmsg_maxsize>`__ + +================= ================================================= +zfs_dbgmsg_enable Notes +================= ================================================= +Tags `debug <#debug>`__ +When to change To view ZFS internal debug log +Data Type boolean +Range 0=do not log debug messages, 1=log debug messages +Default 0 (1 for debug builds) +Change Dynamic +Versions Affected v0.6.5 and later +================= ================================================= + +zfs_dbgmsg_maxsize +~~~~~~~~~~~~~~~~~~ + +The ``/proc/spl/kstat/zfs/dbgmsg`` file size limit is set by +zfs_dbgmsg_maxsize. + +See also zfs_dbgmsg_enable + +================== ================== +zfs_dbgmsg_maxsize Notes +================== ================== +Tags `debug <#debug>`__ +When to change TBD +Data Type int +Units bytes +Range 0 to INT_MAX +Default 4 MiB +Change Dynamic +Versions Affected v0.6.5 and later +================== ================== + +zfs_dbuf_state_index +~~~~~~~~~~~~~~~~~~~~ + +The ``zfs_dbuf_state_index`` feature is currently unused. It is normally +used for controlling values in the ``/proc/spl/kstat/zfs/dbufs`` file. + +==================== ================== +zfs_dbuf_state_index Notes +==================== ================== +Tags `debug <#debug>`__ +When to change Do not change +Data Type int +Units TBD +Range TBD +Default 0 +Change Dynamic +Versions Affected v0.6.5 and later +==================== ================== + +zfs_deadman_enabled +~~~~~~~~~~~~~~~~~~~ + +When a pool sync operation takes longer than zfs_deadman_synctime_ms +milliseconds, a "slow spa_sync" message is logged to the debug log (see +`zfs_dbgmsg_enable <#zfs_dbgmsg_enable>`__). If ``zfs_deadman_enabled`` +is set to 1, then all pending IO operations are also checked and if any +haven't completed within zfs_deadman_synctime_ms milliseconds, a "SLOW +IO" message is logged to the debug log and a "deadman" system event (see +zpool events command) with the details of the hung IO is posted. + +=================== ===================================== +zfs_deadman_enabled Notes +=================== ===================================== +Tags `debug <#debug>`__ +When to change To disable logging of slow I/O +Data Type boolean +Range 0=do not log slow I/O, 1=log slow I/O +Default 1 +Change Dynamic +Versions Affected v0.8.0 +=================== ===================================== + +zfs_deadman_checktime_ms +~~~~~~~~~~~~~~~~~~~~~~~~ + +Once a pool sync operation has taken longer than +`zfs_deadman_synctime_ms <#zfs_deadman_synctime_ms>`__ milliseconds, +continue to check for slow operations every +`zfs_deadman_checktime_ms <#zfs_deadman_synctime_ms>`__ milliseconds. + +======================== ======================= +zfs_deadman_checktime_ms Notes +======================== ======================= +Tags `debug <#debug>`__ +When to change When debugging slow I/O +Data Type ulong +Units milliseconds +Range 1 to ULONG_MAX +Default 60,000 (1 minute) +Change Dynamic +Versions Affected v0.8.0 +======================== ======================= + +zfs_deadman_ziotime_ms +~~~~~~~~~~~~~~~~~~~~~~ + +When an individual I/O takes longer than ``zfs_deadman_ziotime_ms`` +milliseconds, then the operation is considered to be "hung". If +`zfs_deadman_enabled <#zfs_deadman_enabled>`__ is set then the deadman +behaviour is invoked as described by the +`zfs_deadman_failmode <#zfs_deadman_failmode>`__ option. + +====================== ==================== +zfs_deadman_ziotime_ms Notes +====================== ==================== +Tags `debug <#debug>`__ +When to change Testing ABD features +Data Type ulong +Units milliseconds +Range 1 to ULONG_MAX +Default 300,000 (5 minutes) +Change Dynamic +Versions Affected v0.8.0 +====================== ==================== + +zfs_deadman_synctime_ms +~~~~~~~~~~~~~~~~~~~~~~~ + +The I/O deadman timer expiration time has two meanings + +1. determines when the ``spa_deadman()`` logic should fire, indicating + the txg sync has not completed in a timely manner +2. determines if an I/O is considered "hung" + +In version v0.8.0, any I/O that has not completed in +``zfs_deadman_synctime_ms`` is considered "hung" resulting in one of +three behaviors controlled by the +`zfs_deadman_failmode <#zfs_deadman_failmode>`__ parameter. + +``zfs_deadman_synctime_ms`` takes effect if +`zfs_deadman_enabled <#zfs_deadman_enabled>`__ = 1. + +======================= ======================= +zfs_deadman_synctime_ms Notes +======================= ======================= +Tags `debug <#debug>`__ +When to change When debugging slow I/O +Data Type ulong +Units milliseconds +Range 1 to ULONG_MAX +Default 600,000 (10 minutes) +Change Dynamic +Versions Affected v0.6.5 and later +======================= ======================= + +zfs_deadman_failmode +~~~~~~~~~~~~~~~~~~~~ + +zfs_deadman_failmode controls the behavior of the I/O deadman timer when +it detects a "hung" I/O. Valid values are: + +- wait - Wait for the "hung" I/O (default) +- continue - Attempt to recover from a "hung" I/O +- panic - Panic the system + +==================== =============================================== +zfs_deadman_failmode Notes +==================== =============================================== +Tags `debug <#debug>`__ +When to change In some cluster cases, panic can be appropriate +Data Type string +Range *wait*, *continue*, or *panic* +Default wait +Change Dynamic +Versions Affected v0.8.0 +==================== =============================================== + +zfs_dedup_prefetch +~~~~~~~~~~~~~~~~~~ + +ZFS can prefetch deduplication table (DDT) entries. +``zfs_dedup_prefetch`` allows DDT prefetches to be enabled. + ++--------------------+------------------------------------------------+ +| zfs_dedup_prefetch | Notes | ++====================+================================================+ +| Tags | `prefetch <#prefetch>`__, `memory <#memory>`__ | ++--------------------+------------------------------------------------+ +| When to change | For systems with limited RAM using the dedup | +| | feature, disabling deduplication table | +| | prefetch can reduce memory pressure | ++--------------------+------------------------------------------------+ +| Data Type | boolean | ++--------------------+------------------------------------------------+ +| Range | 0=do not prefetch, 1=prefetch dedup table | +| | entries | ++--------------------+------------------------------------------------+ +| Default | 0 | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++--------------------+------------------------------------------------+ + +zfs_delete_blocks +~~~~~~~~~~~~~~~~~ + +``zfs_delete_blocks`` defines a large file for the purposes of delete. +Files containing more than ``zfs_delete_blocks`` will be deleted +asynchronously while smaller files are deleted synchronously. Decreasing +this value reduces the time spent in an ``unlink(2)`` system call at the +expense of a longer delay before the freed space is available. + +The ``zfs_delete_blocks`` value is specified in blocks, not bytes. The +size of blocks can vary and is ultimately limited by the filesystem's +recordsize property. + ++-------------------+-------------------------------------------------+ +| zfs_delete_blocks | Notes | ++===================+=================================================+ +| Tags | `filesystem <#filesystem>`__, | +| | `delete <#delete>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If applications delete large files and blocking | +| | on ``unlink(2)`` is not desired | ++-------------------+-------------------------------------------------+ +| Data Type | ulong | ++-------------------+-------------------------------------------------+ +| Units | blocks | ++-------------------+-------------------------------------------------+ +| Range | 1 to ULONG_MAX | ++-------------------+-------------------------------------------------+ +| Default | 20,480 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +zfs_delay_min_dirty_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ZFS write throttle begins to delay each transaction when the amount +of dirty data reaches the threshold ``zfs_delay_min_dirty_percent`` of +`zfs_dirty_data_max <#zfs_dirty_data_max>`__. This value should be >= +`zfs_vdev_async_write_active_max_dirty_percent <#zfs_vdev_async_write_active_max_dirty_percent>`__. + +=========================== ==================================== +zfs_delay_min_dirty_percent Notes +=========================== ==================================== +Tags `write_throttle <#write_throttle>`__ +When to change See section "ZFS TRANSACTION DELAY" +Data Type int +Units percent +Range 0 to 100 +Default 60 +Change Dynamic +Versions Affected v0.6.4 and later +=========================== ==================================== + +zfs_delay_scale +~~~~~~~~~~~~~~~ + +``zfs_delay_scale`` controls how quickly the ZFS write throttle +transaction delay approaches infinity. Larger values cause longer delays +for a given amount of dirty data. + +For the smoothest delay, this value should be about 1 billion divided by +the maximum number of write operations per second the pool can sustain. +The throttle will smoothly handle between 10x and 1/10th +``zfs_delay_scale``. + +Note: ``zfs_delay_scale`` \* +`zfs_dirty_data_max <#zfs_dirty_data_max>`__ must be < 2^64. + +================= ==================================== +zfs_delay_scale Notes +================= ==================================== +Tags `write_throttle <#write_throttle>`__ +When to change See section "ZFS TRANSACTION DELAY" +Data Type ulong +Units scalar (nanoseconds) +Range 0 to ULONG_MAX +Default 500,000 +Change Dynamic +Versions Affected v0.6.4 and later +================= ==================================== + +zfs_dirty_data_max +~~~~~~~~~~~~~~~~~~ + +``zfs_dirty_data_max`` is the ZFS write throttle dirty space limit. Once +this limit is exceeded, new writes are delayed until space is freed by +writes being committed to the pool. + +zfs_dirty_data_max takes precedence over +`zfs_dirty_data_max_percent <#zfs_dirty_data_max_percent>`__. + ++--------------------+------------------------------------------------+ +| zfs_dirty_data_max | Notes | ++====================+================================================+ +| Tags | `write_throttle <#write_throttle>`__ | ++--------------------+------------------------------------------------+ +| When to change | See section "ZFS TRANSACTION DELAY" | ++--------------------+------------------------------------------------+ +| Data Type | ulong | ++--------------------+------------------------------------------------+ +| Units | bytes | ++--------------------+------------------------------------------------+ +| Range | 1 to | +| | `zfs_d | +| | irty_data_max_max <#zfs_dirty_data_max_max>`__ | ++--------------------+------------------------------------------------+ +| Default | 10% of physical RAM | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------+------------------------------------------------+ + +zfs_dirty_data_max_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_dirty_data_max_percent`` is an alternative method of specifying +`zfs_dirty_data_max <#zfs_dirty_data_max>`__, the ZFS write throttle +dirty space limit. Once this limit is exceeded, new writes are delayed +until space is freed by writes being committed to the pool. + +`zfs_dirty_data_max <#zfs_dirty_data_max>`__ takes precedence over +``zfs_dirty_data_max_percent``. + ++----------------------------+----------------------------------------+ +| zfs_dirty_data_max_percent | Notes | ++============================+========================================+ +| Tags | `write_throttle <#write_throttle>`__ | ++----------------------------+----------------------------------------+ +| When to change | See section "ZFS TRANSACTION DELAY" | ++----------------------------+----------------------------------------+ +| Data Type | int | ++----------------------------+----------------------------------------+ +| Units | percent | ++----------------------------+----------------------------------------+ +| Range | 1 to 100 | ++----------------------------+----------------------------------------+ +| Default | 10% of physical RAM | ++----------------------------+----------------------------------------+ +| Change | Prior to zfs module load or a memory | +| | hot plug event | ++----------------------------+----------------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------------+----------------------------------------+ + +zfs_dirty_data_max_max +~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_dirty_data_max_max`` is the maximum allowable value of +`zfs_dirty_data_max <#zfs_dirty_data_max>`__. + +``zfs_dirty_data_max_max`` takes precedence over +`zfs_dirty_data_max_max_percent <#zfs_dirty_data_max_max_percent>`__. + +====================== ==================================== +zfs_dirty_data_max_max Notes +====================== ==================================== +Tags `write_throttle <#write_throttle>`__ +When to change See section "ZFS TRANSACTION DELAY" +Data Type ulong +Units bytes +Range 1 to physical RAM size +Default 25% of physical RAM +Change Prior to zfs module load +Versions Affected v0.6.4 and later +====================== ==================================== + +zfs_dirty_data_max_max_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_dirty_data_max_max_percent`` an alternative to +`zfs_dirty_data_max_max <#zfs_dirty_data_max_max>`__ for setting the +maximum allowable value of `zfs_dirty_data_max <#zfs_dirty_data_max>`__ + +`zfs_dirty_data_max_max <#zfs_dirty_data_max_max>`__ takes precedence +over ``zfs_dirty_data_max_max_percent`` + +============================== ==================================== +zfs_dirty_data_max_max_percent Notes +============================== ==================================== +Tags `write_throttle <#write_throttle>`__ +When to change See section "ZFS TRANSACTION DELAY" +Data Type int +Units percent +Range 1 to 100 +Default 25% of physical RAM +Change Prior to zfs module load +Versions Affected v0.6.4 and later +============================== ==================================== + +zfs_dirty_data_sync +~~~~~~~~~~~~~~~~~~~ + +When there is at least ``zfs_dirty_data_sync`` dirty data, a transaction +group sync is started. This allows a transaction group sync to occur +more frequently than the transaction group timeout interval (see +`zfs_txg_timeout <#zfs_txg_timeout>`__) when there is dirty data to be +written. + ++---------------------+-----------------------------------------------+ +| zfs_dirty_data_sync | Notes | ++=====================+===============================================+ +| Tags | `write_throttle <#write_throttle>`__, | +| | `ZIO_scheduler <#ZIO_scheduler>`__ | ++---------------------+-----------------------------------------------+ +| When to change | TBD | ++---------------------+-----------------------------------------------+ +| Data Type | ulong | ++---------------------+-----------------------------------------------+ +| Units | bytes | ++---------------------+-----------------------------------------------+ +| Range | 1 to ULONG_MAX | ++---------------------+-----------------------------------------------+ +| Default | 67,108,864 (64 MiB) | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.6.4 through v0.8.x, deprecation planned | +| | for v2 | ++---------------------+-----------------------------------------------+ + +zfs_dirty_data_sync_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When there is at least ``zfs_dirty_data_sync_percent`` of +`zfs_dirty_data_max <#zfs_dirty_data_max>`__ dirty data, a transaction +group sync is started. This allows a transaction group sync to occur +more frequently than the transaction group timeout interval (see +`zfs_txg_timeout <#zfs_txg_timeout>`__) when there is dirty data to be +written. + ++-----------------------------+---------------------------------------+ +| zfs_dirty_data_sync_percent | Notes | ++=============================+=======================================+ +| Tags | `write_throttle <#write_throttle>`__, | +| | `ZIO_scheduler <#ZIO_scheduler>`__ | ++-----------------------------+---------------------------------------+ +| When to change | TBD | ++-----------------------------+---------------------------------------+ +| Data Type | int | ++-----------------------------+---------------------------------------+ +| Units | percent | ++-----------------------------+---------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_async_write_ac | +| | tive_min_dirty_percent <#zfs_vdev_asy | +| | nc_write_active_min_dirty_percent>`__ | ++-----------------------------+---------------------------------------+ +| Default | 20 | ++-----------------------------+---------------------------------------+ +| Change | Dynamic | ++-----------------------------+---------------------------------------+ +| Versions Affected | planned for v2, deprecates | +| | `zfs_dirt | +| | y_data_sync <#zfs_dirty_data_sync>`__ | ++-----------------------------+---------------------------------------+ + +zfs_fletcher_4_impl +~~~~~~~~~~~~~~~~~~~ + +Fletcher-4 is the default checksum algorithm for metadata and data. When +the zfs kernel module is loaded, a set of microbenchmarks are run to +determine the fastest algorithm for the current hardware. The +``zfs_fletcher_4_impl`` parameter allows a specific implementation to be +specified other than the default (fastest). Selectors other than +*fastest* and *scalar* require instruction set extensions to be +available and will only appear if ZFS detects their presence. The +*scalar* implementation works on all processors. + +The results of the microbenchmark are visible in the +``/proc/spl/kstat/zfs/fletcher_4_bench`` file. Larger numbers indicate +better performance. Since ZFS is processor endian-independent, the +microbenchmark is run against both big and little-endian transformation. + ++---------------------+-----------------------------------------------+ +| zfs_fletcher_4_impl | Notes | ++=====================+===============================================+ +| Tags | `CPU <#cpu>`__, `checksum <#checksum>`__ | ++---------------------+-----------------------------------------------+ +| When to change | Testing Fletcher-4 algorithms | ++---------------------+-----------------------------------------------+ +| Data Type | string | ++---------------------+-----------------------------------------------+ +| Range | *fastest*, *scalar*, *superscalar*, | +| | *superscalar4*, *sse2*, *ssse3*, *avx2*, | +| | *avx512f*, or *aarch64_neon* depending on | +| | hardware support | ++---------------------+-----------------------------------------------+ +| Default | fastest | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++---------------------+-----------------------------------------------+ + +zfs_free_bpobj_enabled +~~~~~~~~~~~~~~~~~~~~~~ + +The processing of the free_bpobj object can be enabled by +``zfs_free_bpobj_enabled`` + ++------------------------+--------------------------------------------+ +| zfs_free_bpobj_enabled | Notes | ++========================+============================================+ +| Tags | `delete <#delete>`__ | ++------------------------+--------------------------------------------+ +| When to change | If there's a problem with processing | +| | free_bpobj (e.g. i/o error or bug) | ++------------------------+--------------------------------------------+ +| Data Type | boolean | ++------------------------+--------------------------------------------+ +| Range | 0=do not process free_bpobj objects, | +| | 1=process free_bpobj objects | ++------------------------+--------------------------------------------+ +| Default | 1 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++------------------------+--------------------------------------------+ + +zfs_free_max_blocks +~~~~~~~~~~~~~~~~~~~ + +``zfs_free_max_blocks`` sets the maximum number of blocks to be freed in +a single transaction group (txg). For workloads that delete (free) large +numbers of blocks in a short period of time, the processing of the frees +can negatively impact other operations, including txg commits. +``zfs_free_max_blocks`` acts as a limit to reduce the impact. + ++---------------------+-----------------------------------------------+ +| zfs_free_max_blocks | Notes | ++=====================+===============================================+ +| Tags | `filesystem <#filesystem>`__, | +| | `delete <#delete>`__ | ++---------------------+-----------------------------------------------+ +| When to change | For workloads that delete large files, | +| | ``zfs_free_max_blocks`` can be adjusted to | +| | meet performance requirements while reducing | +| | the impacts of deletion | ++---------------------+-----------------------------------------------+ +| Data Type | ulong | ++---------------------+-----------------------------------------------+ +| Units | blocks | ++---------------------+-----------------------------------------------+ +| Range | 1 to ULONG_MAX | ++---------------------+-----------------------------------------------+ +| Default | 100,000 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++---------------------+-----------------------------------------------+ + +zfs_vdev_async_read_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Maximum asynchronous read I/Os active to each device. + ++--------------------------------+------------------------------------+ +| zfs_vdev_async_read_max_active | Notes | ++================================+====================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------------+------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------------+------------------------------------+ +| Data Type | uint32 | ++--------------------------------+------------------------------------+ +| Units | I/O operations | ++--------------------------------+------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_ma | +| | x_active <#zfs_vdev_max_active>`__ | ++--------------------------------+------------------------------------+ +| Default | 3 | ++--------------------------------+------------------------------------+ +| Change | Dynamic | ++--------------------------------+------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------------+------------------------------------+ + +zfs_vdev_async_read_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Minimum asynchronous read I/Os active to each device. + ++--------------------------------+------------------------------------+ +| zfs_vdev_async_read_min_active | Notes | ++================================+====================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------------+------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------------+------------------------------------+ +| Data Type | uint32 | ++--------------------------------+------------------------------------+ +| Units | I/O operations | ++--------------------------------+------------------------------------+ +| Range | 1 to | +| | ( | +| | `zfs_vdev_async_read_max_active <# | +| | zfs_vdev_async_read_max_active>`__ | +| | - 1) | ++--------------------------------+------------------------------------+ +| Default | 1 | ++--------------------------------+------------------------------------+ +| Change | Dynamic | ++--------------------------------+------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------------+------------------------------------+ + +zfs_vdev_async_write_active_max_dirty_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When the amount of dirty data exceeds the threshold +``zfs_vdev_async_write_active_max_dirty_percent`` of +`zfs_dirty_data_max <#zfs_dirty_data_max>`__ dirty data, then +`zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ +is used to limit active async writes. If the dirty data is between +`zfs_vdev_async_write_active_min_dirty_percent <#zfs_vdev_async_write_active_min_dirty_percent>`__ +and ``zfs_vdev_async_write_active_max_dirty_percent``, the active I/O +limit is linearly interpolated between +`zfs_vdev_async_write_min_active <#zfs_vdev_async_write_min_active>`__ +and +`zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ + ++----------------------------------+----------------------------------+ +| zfs_vdev_asyn | Notes | +| c_write_active_max_dirty_percent | | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `Z | +| | IO_scheduler <#zio_scheduler>`__ | ++----------------------------------+----------------------------------+ +| When to change | See `ZFS I/O | +| | Sch | +| | eduler `__ | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | percent of | +| | `zfs_dirty_d | +| | ata_max <#zfs_dirty_data_max>`__ | ++----------------------------------+----------------------------------+ +| Range | 0 to 100 | ++----------------------------------+----------------------------------+ +| Default | 60 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------------------+----------------------------------+ + +zfs_vdev_async_write_active_min_dirty_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If the amount of dirty data is between +``zfs_vdev_async_write_active_min_dirty_percent`` and +`zfs_vdev_async_write_active_max_dirty_percent <#zfs_vdev_async_write_active_max_dirty_percent>`__ +of `zfs_dirty_data_max <#zfs_dirty_data_max>`__, the active I/O limit is +linearly interpolated between +`zfs_vdev_async_write_min_active <#zfs_vdev_async_write_min_active>`__ +and +`zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ + ++----------------------------------+----------------------------------+ +| zfs_vdev_asyn | Notes | +| c_write_active_min_dirty_percent | | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `Z | +| | IO_scheduler <#zio_scheduler>`__ | ++----------------------------------+----------------------------------+ +| When to change | See `ZFS I/O | +| | Sch | +| | eduler `__ | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | percent of zfs_dirty_data_max | ++----------------------------------+----------------------------------+ +| Range | 0 to | +| | (`z | +| | fs_vdev_async_write_active_max_d | +| | irty_percent <#zfs_vdev_async_wr | +| | ite_active_max_dirty_percent>`__ | +| | - 1) | ++----------------------------------+----------------------------------+ +| Default | 30 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------------------+----------------------------------+ + +zfs_vdev_async_write_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_async_write_max_active`` sets the maximum asynchronous write +I/Os active to each device. + ++---------------------------------+-----------------------------------+ +| zfs_vdev_async_write_max_active | Notes | ++=================================+===================================+ +| Tags | `vdev <#vdev>`__, | +| | ` | +| | ZIO_scheduler <#zio_scheduler>`__ | ++---------------------------------+-----------------------------------+ +| When to change | See `ZFS I/O | +| | S | +| | cheduler `__ | ++---------------------------------+-----------------------------------+ +| Data Type | uint32 | ++---------------------------------+-----------------------------------+ +| Units | I/O operations | ++---------------------------------+-----------------------------------+ +| Range | 1 to | +| | `zfs_vdev_max | +| | _active <#zfs_vdev_max_active>`__ | ++---------------------------------+-----------------------------------+ +| Default | 10 | ++---------------------------------+-----------------------------------+ +| Change | Dynamic | ++---------------------------------+-----------------------------------+ +| Versions Affected | v0.6.4 and later | ++---------------------------------+-----------------------------------+ + +zfs_vdev_async_write_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_async_write_min_active`` sets the minimum asynchronous write +I/Os active to each device. + +Lower values are associated with better latency on rotational media but +poorer resilver performance. The default value of 2 was chosen as a +compromise. A value of 3 has been shown to improve resilver performance +further at a cost of further increasing latency. + ++---------------------------------+-----------------------------------+ +| zfs_vdev_async_write_min_active | Notes | ++=================================+===================================+ +| Tags | `vdev <#vdev>`__, | +| | ` | +| | ZIO_scheduler <#zio_scheduler>`__ | ++---------------------------------+-----------------------------------+ +| When to change | See `ZFS I/O | +| | S | +| | cheduler `__ | ++---------------------------------+-----------------------------------+ +| Data Type | uint32 | ++---------------------------------+-----------------------------------+ +| Units | I/O operations | ++---------------------------------+-----------------------------------+ +| Range | 1 to | +| | `zfs | +| | _vdev_async_write_max_active <#zf | +| | s_vdev_async_write_max_active>`__ | ++---------------------------------+-----------------------------------+ +| Default | 1 for v0.6.x, 2 for v0.7.0 and | +| | later | ++---------------------------------+-----------------------------------+ +| Change | Dynamic | ++---------------------------------+-----------------------------------+ +| Versions Affected | v0.6.4 and later | ++---------------------------------+-----------------------------------+ + +zfs_vdev_max_active +~~~~~~~~~~~~~~~~~~~ + +The maximum number of I/Os active to each device. Ideally, +``zfs_vdev_max_active`` >= the sum of each queue's max_active. + +Once queued to the device, the ZFS I/O scheduler is no longer able to +prioritize I/O operations. The underlying device drivers have their own +scheduler and queue depth limits. Values larger than the device's +maximum queue depth can have the affect of increased latency as the I/Os +are queued in the intervening device driver layers. + ++---------------------+-----------------------------------------------+ +| zfs_vdev_max_active | Notes | ++=====================+===============================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++---------------------+-----------------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++---------------------+-----------------------------------------------+ +| Data Type | uint32 | ++---------------------+-----------------------------------------------+ +| Units | I/O operations | ++---------------------+-----------------------------------------------+ +| Range | sum of each queue's min_active to UINT32_MAX | ++---------------------+-----------------------------------------------+ +| Default | 1,000 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++---------------------+-----------------------------------------------+ + +zfs_vdev_scrub_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_scrub_max_active`` sets the maximum scrub or scan read I/Os +active to each device. + ++---------------------------+-----------------------------------------+ +| zfs_vdev_scrub_max_active | Notes | ++===========================+=========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__, | +| | `scrub <#scrub>`__, | +| | `resilver <#resilver>`__ | ++---------------------------+-----------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++---------------------------+-----------------------------------------+ +| Data Type | uint32 | ++---------------------------+-----------------------------------------+ +| Units | I/O operations | ++---------------------------+-----------------------------------------+ +| Range | 1 to | +| | `zfs_vd | +| | ev_max_active <#zfs_vdev_max_active>`__ | ++---------------------------+-----------------------------------------+ +| Default | 2 | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | v0.6.4 and later | ++---------------------------+-----------------------------------------+ + +zfs_vdev_scrub_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_scrub_min_active`` sets the minimum scrub or scan read I/Os +active to each device. + ++---------------------------+-----------------------------------------+ +| zfs_vdev_scrub_min_active | Notes | ++===========================+=========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__, | +| | `scrub <#scrub>`__, | +| | `resilver <#resilver>`__ | ++---------------------------+-----------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++---------------------------+-----------------------------------------+ +| Data Type | uint32 | ++---------------------------+-----------------------------------------+ +| Units | I/O operations | ++---------------------------+-----------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_scrub_max | +| | _active <#zfs_vdev_scrub_max_active>`__ | ++---------------------------+-----------------------------------------+ +| Default | 1 | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | v0.6.4 and later | ++---------------------------+-----------------------------------------+ + +zfs_vdev_sync_read_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Maximum synchronous read I/Os active to each device. + ++-------------------------------+-------------------------------------+ +| zfs_vdev_sync_read_max_active | Notes | ++===============================+=====================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------------------+-------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++-------------------------------+-------------------------------------+ +| Data Type | uint32 | ++-------------------------------+-------------------------------------+ +| Units | I/O operations | ++-------------------------------+-------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_m | +| | ax_active <#zfs_vdev_max_active>`__ | ++-------------------------------+-------------------------------------+ +| Default | 10 | ++-------------------------------+-------------------------------------+ +| Change | Dynamic | ++-------------------------------+-------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-------------------------------+-------------------------------------+ + +zfs_vdev_sync_read_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_sync_read_min_active`` sets the minimum synchronous read I/Os +active to each device. + ++-------------------------------+-------------------------------------+ +| zfs_vdev_sync_read_min_active | Notes | ++===============================+=====================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------------------+-------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++-------------------------------+-------------------------------------+ +| Data Type | uint32 | ++-------------------------------+-------------------------------------+ +| Units | I/O operations | ++-------------------------------+-------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_sync_read_max_active | +| | <#zfs_vdev_sync_read_max_active>`__ | ++-------------------------------+-------------------------------------+ +| Default | 10 | ++-------------------------------+-------------------------------------+ +| Change | Dynamic | ++-------------------------------+-------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-------------------------------+-------------------------------------+ + +zfs_vdev_sync_write_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_sync_write_max_active`` sets the maximum synchronous write +I/Os active to each device. + ++--------------------------------+------------------------------------+ +| zfs_vdev_sync_write_max_active | Notes | ++================================+====================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------------+------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------------+------------------------------------+ +| Data Type | uint32 | ++--------------------------------+------------------------------------+ +| Units | I/O operations | ++--------------------------------+------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_ma | +| | x_active <#zfs_vdev_max_active>`__ | ++--------------------------------+------------------------------------+ +| Default | 10 | ++--------------------------------+------------------------------------+ +| Change | Dynamic | ++--------------------------------+------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------------+------------------------------------+ + +zfs_vdev_sync_write_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_sync_write_min_active`` sets the minimum synchronous write +I/Os active to each device. + ++--------------------------------+------------------------------------+ +| zfs_vdev_sync_write_min_active | Notes | ++================================+====================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------------+------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------------+------------------------------------+ +| Data Type | uint32 | ++--------------------------------+------------------------------------+ +| Units | I/O operations | ++--------------------------------+------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_sync_write_max_active <# | +| | zfs_vdev_sync_write_max_active>`__ | ++--------------------------------+------------------------------------+ +| Default | 10 | ++--------------------------------+------------------------------------+ +| Change | Dynamic | ++--------------------------------+------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------------+------------------------------------+ + +zfs_vdev_queue_depth_pct +~~~~~~~~~~~~~~~~~~~~~~~~ + +Maximum number of queued allocations per top-level vdev expressed as a +percentage of +`zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__. +This allows the system to detect devices that are more capable of +handling allocations and to allocate more blocks to those devices. It +also allows for dynamic allocation distribution when devices are +imbalanced as fuller devices will tend to be slower than empty devices. +Once the queue depth reaches (``zfs_vdev_queue_depth_pct`` \* +`zfs_vdev_async_write_max_active <#zfs_vdev_async_write_max_active>`__ / +100) then allocator will stop allocating blocks on that top-level device +and switch to the next. + +See also `zio_dva_throttle_enabled <#zio_dva_throttle_enabled>`__ + ++--------------------------+------------------------------------------+ +| zfs_vdev_queue_depth_pct | Notes | ++==========================+==========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------+------------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------+------------------------------------------+ +| Data Type | uint32 | ++--------------------------+------------------------------------------+ +| Units | I/O operations | ++--------------------------+------------------------------------------+ +| Range | 1 to UINT32_MAX | ++--------------------------+------------------------------------------+ +| Default | 1,000 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++--------------------------+------------------------------------------+ + +zfs_disable_dup_eviction +~~~~~~~~~~~~~~~~~~~~~~~~ + +Disable duplicate buffer eviction from ARC. + ++--------------------------+------------------------------------------+ +| zfs_disable_dup_eviction | Notes | ++==========================+==========================================+ +| Tags | `ARC <#arc>`__, `dedup <#dedup>`__ | ++--------------------------+------------------------------------------+ +| When to change | TBD | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=duplicate buffers can be evicted, 1=do | +| | not evict duplicate buffers | ++--------------------------+------------------------------------------+ +| Default | 0 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.6.5, deprecated in v0.7.0 | ++--------------------------+------------------------------------------+ + +zfs_expire_snapshot +~~~~~~~~~~~~~~~~~~~ + +Snapshots of filesystems are normally automounted under the filesystem's +``.zfs/snapshot`` subdirectory. When not in use, snapshots are unmounted +after zfs_expire_snapshot seconds. + ++---------------------+-----------------------------------------------+ +| zfs_expire_snapshot | Notes | ++=====================+===============================================+ +| Tags | `filesystem <#filesystem>`__, | +| | `snapshot <#snapshot>`__ | ++---------------------+-----------------------------------------------+ +| When to change | TBD | ++---------------------+-----------------------------------------------+ +| Data Type | int | ++---------------------+-----------------------------------------------+ +| Units | seconds | ++---------------------+-----------------------------------------------+ +| Range | 0 disables automatic unmounting, maximum time | +| | is INT_MAX | ++---------------------+-----------------------------------------------+ +| Default | 300 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.6.1 and later | ++---------------------+-----------------------------------------------+ + +zfs_admin_snapshot +~~~~~~~~~~~~~~~~~~ + +Allow the creation, removal, or renaming of entries in the +``.zfs/snapshot`` subdirectory to cause the creation, destruction, or +renaming of snapshots. When enabled this functionality works both +locally and over NFS exports which have the "no_root_squash" option set. + ++--------------------+------------------------------------------------+ +| zfs_admin_snapshot | Notes | ++====================+================================================+ +| Tags | `filesystem <#filesystem>`__, | +| | `snapshot <#snapshot>`__ | ++--------------------+------------------------------------------------+ +| When to change | TBD | ++--------------------+------------------------------------------------+ +| Data Type | boolean | ++--------------------+------------------------------------------------+ +| Range | 0=do not allow snapshot manipulation via the | +| | filesystem, 1=allow snapshot manipulation via | +| | the filesystem | ++--------------------+------------------------------------------------+ +| Default | 1 | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++--------------------+------------------------------------------------+ + +zfs_flags +~~~~~~~~~ + +Set additional debugging flags (see +`zfs_dbgmsg_enable <#zfs_dbgmsg_enable>`__) + ++------------+---------------------------+---------------------------+ +| flag value | symbolic name | description | ++============+===========================+===========================+ +| 0x1 | ZFS_DEBUG_DPRINTF | Enable dprintf entries in | +| | | the debug log | ++------------+---------------------------+---------------------------+ +| 0x2 | ZFS_DEBUG_DBUF_VERIFY | Enable extra dnode | +| | | verifications | ++------------+---------------------------+---------------------------+ +| 0x4 | ZFS_DEBUG_DNODE_VERIFY | Enable extra dnode | +| | | verifications | ++------------+---------------------------+---------------------------+ +| 0x8 | ZFS_DEBUG_SNAPNAMES | Enable snapshot name | +| | | verification | ++------------+---------------------------+---------------------------+ +| 0x10 | ZFS_DEBUG_MODIFY | Check for illegally | +| | | modified ARC buffers | ++------------+---------------------------+---------------------------+ +| 0x20 | ZFS_DEBUG_SPA | Enable spa_dbgmsg entries | +| | | in the debug log | ++------------+---------------------------+---------------------------+ +| 0x40 | ZFS_DEBUG_ZIO_FREE | Enable verification of | +| | | block frees | ++------------+---------------------------+---------------------------+ +| 0x80 | Z | Enable extra spacemap | +| | FS_DEBUG_HISTOGRAM_VERIFY | histogram verifications | ++------------+---------------------------+---------------------------+ +| 0x100 | ZFS_DEBUG_METASLAB_VERIFY | Verify space accounting | +| | | on disk matches in-core | +| | | range_trees | ++------------+---------------------------+---------------------------+ +| 0x200 | ZFS_DEBUG_SET_ERROR | Enable SET_ERROR and | +| | | dprintf entries in the | +| | | debug log | ++------------+---------------------------+---------------------------+ + ++-------------------+-------------------------------------------------+ +| zfs_flags | Notes | ++===================+=================================================+ +| Tags | `debug <#debug>`__ | ++-------------------+-------------------------------------------------+ +| When to change | When debugging ZFS | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Default | 0 no debug flags set, for debug builds: all | +| | except ZFS_DEBUG_DPRINTF and ZFS_DEBUG_SPA | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-------------------+-------------------------------------------------+ + +zfs_free_leak_on_eio +~~~~~~~~~~~~~~~~~~~~ + +If destroy encounters an I/O error (EIO) while reading metadata (eg +indirect blocks), space referenced by the missing metadata cannot be +freed. Normally, this causes the background destroy to become "stalled", +as the destroy is unable to make forward progress. While in this stalled +state, all remaining space to free from the error-encountering +filesystem is temporarily leaked. Set ``zfs_free_leak_on_eio = 1`` to +ignore the EIO, permanently leak the space from indirect blocks that can +not be read, and continue to free everything else that it can. + +The default, stalling behavior is useful if the storage partially fails +(eg some but not all I/Os fail), and then later recovers. In this case, +we will be able to continue pool operations while it is partially +failed, and when it recovers, we can continue to free the space, with no +leaks. However, note that this case is rare. + +Typically pools either: + +1. fail completely (but perhaps temporarily (eg a top-level vdev going + offline) + +2. have localized, permanent errors (eg disk returns the wrong data due + to bit flip or firmware bug) + +In case (1), the ``zfs_free_leak_on_eio`` setting does not matter +because the pool will be suspended and the sync thread will not be able +to make forward progress. In case (2), because the error is permanent, +the best effort do is leak the minimum amount of space. Therefore, it is +reasonable for ``zfs_free_leak_on_eio`` be set, but by default the more +conservative approach is taken, so that there is no possibility of +leaking space in the "partial temporary" failure case. + ++----------------------+----------------------------------------------+ +| zfs_free_leak_on_eio | Notes | ++======================+==============================================+ +| Tags | `debug <#debug>`__ | ++----------------------+----------------------------------------------+ +| When to change | When debugging I/O errors during destroy | ++----------------------+----------------------------------------------+ +| Data Type | boolean | ++----------------------+----------------------------------------------+ +| Range | 0=normal behavior, 1=ignore error and | +| | permanently leak space | ++----------------------+----------------------------------------------+ +| Default | 0 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++----------------------+----------------------------------------------+ + +zfs_free_min_time_ms +~~~~~~~~~~~~~~~~~~~~ + +During a ``zfs destroy`` operation using ``feature@async_destroy`` a +minimum of ``zfs_free_min_time_ms`` time will be spent working on +freeing blocks per txg commit. + +==================== ============================== +zfs_free_min_time_ms Notes +==================== ============================== +Tags `delete <#delete>`__ +When to change TBD +Data Type int +Units milliseconds +Range 1 to (zfs_txg_timeout \* 1000) +Default 1,000 +Change Dynamic +Versions Affected v0.6.0 and later +==================== ============================== + +zfs_immediate_write_sz +~~~~~~~~~~~~~~~~~~~~~~ + +If a pool does not have a log device, data blocks equal to or larger +than ``zfs_immediate_write_sz`` are treated as if the dataset being +written to had the property setting ``logbias=throughput`` + +Terminology note: ``logbias=throughput`` writes the blocks in "indirect +mode" to the ZIL where the data is written to the pool and a pointer to +the data is written to the ZIL. + ++------------------------+--------------------------------------------+ +| zfs_immediate_write_sz | Notes | ++========================+============================================+ +| Tags | `ZIL <#zil>`__ | ++------------------------+--------------------------------------------+ +| When to change | TBD | ++------------------------+--------------------------------------------+ +| Data Type | long | ++------------------------+--------------------------------------------+ +| Units | bytes | ++------------------------+--------------------------------------------+ +| Range | 512 to 16,777,216 (valid block sizes) | ++------------------------+--------------------------------------------+ +| Default | 32,768 (32 KiB) | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Verification | Data blocks that exceed | +| | ``zfs_immediate_write_sz`` or are written | +| | as ``logbias=throughput`` increment the | +| | ``zil_itx_indirect_count`` entry in | +| | ``/proc/spl/kstat/zfs/zil`` | ++------------------------+--------------------------------------------+ +| Versions Affected | all | ++------------------------+--------------------------------------------+ + +zfs_max_recordsize +~~~~~~~~~~~~~~~~~~ + +ZFS supports logical record (block) sizes from 512 bytes to 16 MiB. The +benefits of larger blocks, and thus larger average I/O sizes, can be +weighed against the cost of copy-on-write of large block to modify one +byte. Additionally, very large blocks can have a negative impact on both +I/O latency at the device level and the memory allocator. The +``zfs_max_recordsize`` parameter limits the upper bound of the dataset +volblocksize and recordsize properties. + +Larger blocks can be created by enabling ``zpool`` ``large_blocks`` +feature and changing this ``zfs_max_recordsize``. Pools with larger +blocks can always be imported and used, regardless of the value of +``zfs_max_recordsize``. + +For 32-bit systems, ``zfs_max_recordsize`` also limits the size of +kernel virtual memory caches used in the ZFS I/O pipeline (``zio_buf_*`` +and ``zio_data_buf_*``). + +See also the ``zpool`` ``large_blocks`` feature. + ++--------------------+------------------------------------------------+ +| zfs_max_recordsize | Notes | ++====================+================================================+ +| Tags | `filesystem <#filesystem>`__, | +| | `memory <#memory>`__, `volume <#volume>`__ | ++--------------------+------------------------------------------------+ +| When to change | To create datasets with larger volblocksize or | +| | recordsize | ++--------------------+------------------------------------------------+ +| Data Type | int | ++--------------------+------------------------------------------------+ +| Units | bytes | ++--------------------+------------------------------------------------+ +| Range | 512 to 16,777,216 (valid block sizes) | ++--------------------+------------------------------------------------+ +| Default | 1,048,576 | ++--------------------+------------------------------------------------+ +| Change | Dynamic, set prior to creating volumes or | +| | changing filesystem recordsize | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.6.5 and later | ++--------------------+------------------------------------------------+ + +zfs_mdcomp_disable +~~~~~~~~~~~~~~~~~~ + +``zfs_mdcomp_disable`` allows metadata compression to be disabled. + +================== =============================================== +zfs_mdcomp_disable Notes +================== =============================================== +Tags `CPU <#cpu>`__, `metadata <#metadata>`__ +When to change When CPU cycles cost less than I/O +Data Type boolean +Range 0=compress metadata, 1=do not compress metadata +Default 0 +Change Dynamic +Versions Affected from v0.6.0 to v0.8.0 +================== =============================================== + +zfs_metaslab_fragmentation_threshold +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Allow metaslabs to keep their active state as long as their +fragmentation percentage is less than or equal to this value. When +writing, an active metaslab whose fragmentation percentage exceeds +``zfs_metaslab_fragmentation_threshold`` is avoided allowing metaslabs +with less fragmentation to be preferred. + +Metaslab fragmentation is used to calculate the overall pool +``fragmentation`` property value. However, individual metaslab +fragmentation levels are observable using the ``zdb`` with the ``-mm`` +option. + +``zfs_metaslab_fragmentation_threshold`` works at the metaslab level and +each top-level vdev has approximately +`metaslabs_per_vdev <#metaslabs_per_vdev>`__ metaslabs. See also +`zfs_mg_fragmentation_threshold <#zfs_mg_fragmentation_threshold>`__ + ++----------------------------------+----------------------------------+ +| zfs_ | Notes | +| metaslab_fragmentation_threshold | | ++==================================+==================================+ +| Tags | `allocation <#allocation>`__, | +| | `fr | +| | agmentation <#fragmentation>`__, | +| | `vdev <#vdev>`__ | ++----------------------------------+----------------------------------+ +| When to change | Testing metaslab allocation | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | percent | ++----------------------------------+----------------------------------+ +| Range | 1 to 100 | ++----------------------------------+----------------------------------+ +| Default | 70 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.6.4 and later | ++----------------------------------+----------------------------------+ + +zfs_mg_fragmentation_threshold +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Metaslab groups (top-level vdevs) are considered eligible for +allocations if their fragmentation percentage metric is less than or +equal to ``zfs_mg_fragmentation_threshold``. If a metaslab group exceeds +this threshold then it will be skipped unless all metaslab groups within +the metaslab class have also crossed the +``zfs_mg_fragmentation_threshold`` threshold. + ++--------------------------------+------------------------------------+ +| zfs_mg_fragmentation_threshold | Notes | ++================================+====================================+ +| Tags | `allocation <#allocation>`__, | +| | ` | +| | fragmentation <#fragmentation>`__, | +| | `vdev <#vdev>`__ | ++--------------------------------+------------------------------------+ +| When to change | Testing metaslab allocation | ++--------------------------------+------------------------------------+ +| Data Type | int | ++--------------------------------+------------------------------------+ +| Units | percent | ++--------------------------------+------------------------------------+ +| Range | 1 to 100 | ++--------------------------------+------------------------------------+ +| Default | 85 | ++--------------------------------+------------------------------------+ +| Change | Dynamic | ++--------------------------------+------------------------------------+ +| Versions Affected | v0.6.4 and later | ++--------------------------------+------------------------------------+ + +zfs_mg_noalloc_threshold +~~~~~~~~~~~~~~~~~~~~~~~~ + +Metaslab groups (top-level vdevs) with free space percentage greater +than ``zfs_mg_noalloc_threshold`` are eligible for new allocations. If a +metaslab group's free space is less than or equal to the threshold, the +allocator avoids allocating to that group unless all groups in the pool +have reached the threshold. Once all metaslab groups have reached the +threshold, all metaslab groups are allowed to accept allocations. The +default value of 0 disables the feature and causes all metaslab groups +to be eligible for allocations. + +This parameter allows one to deal with pools having heavily imbalanced +vdevs such as would be the case when a new vdev has been added. Setting +the threshold to a non-zero percentage will stop allocations from being +made to vdevs that aren't filled to the specified percentage and allow +lesser filled vdevs to acquire more allocations than they otherwise +would under the older ``zfs_mg_alloc_failures`` facility. + ++--------------------------+------------------------------------------+ +| zfs_mg_noalloc_threshold | Notes | ++==========================+==========================================+ +| Tags | `allocation <#allocation>`__, | +| | `fragmentation <#fragmentation>`__, | +| | `vdev <#vdev>`__ | ++--------------------------+------------------------------------------+ +| When to change | To force rebalancing as top-level vdevs | +| | are added or expanded | ++--------------------------+------------------------------------------+ +| Data Type | int | ++--------------------------+------------------------------------------+ +| Units | percent | ++--------------------------+------------------------------------------+ +| Range | 0 to 100 | ++--------------------------+------------------------------------------+ +| Default | 0 (disabled) | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++--------------------------+------------------------------------------+ + +zfs_multihost_history +~~~~~~~~~~~~~~~~~~~~~ + +The pool ``multihost`` multimodifier protection (MMP) subsystem can +record historical updates in the +``/proc/spl/kstat/zfs/POOL_NAME/multihost`` file for debugging purposes. +The number of lines of history is determined by zfs_multihost_history. + +===================== ==================================== +zfs_multihost_history Notes +===================== ==================================== +Tags `MMP <#mmp>`__, `import <#import>`__ +When to change When testing multihost feature +Data Type int +Units lines +Range 0 to INT_MAX +Default 0 +Change Dynamic +Versions Affected v0.7.0 and later +===================== ==================================== + +zfs_multihost_interval +~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_multihost_interval`` controls the frequency of multihost writes +performed by the pool multihost multimodifier protection (MMP) +subsystem. The multihost write period is (``zfs_multihost_interval`` / +number of leaf-vdevs) milliseconds. Thus on average a multihost write +will be issued for each leaf vdev every ``zfs_multihost_interval`` +milliseconds. In practice, the observed period can vary with the I/O +load and this observed value is the delay which is stored in the +uberblock. + +On import the multihost activity check waits a minimum amount of time +determined by (``zfs_multihost_interval`` \* +`zfs_multihost_import_intervals <#zfs_multihost_import_intervals>`__) +with a lower bound of 1 second. The activity check time may be further +extended if the value of mmp delay found in the best uberblock indicates +actual multihost updates happened at longer intervals than +``zfs_multihost_interval`` + +Note: the multihost protection feature applies to storage devices that +can be shared between multiple systems. + ++------------------------+--------------------------------------------+ +| zfs_multihost_interval | Notes | ++========================+============================================+ +| Tags | `MMP <#mmp>`__, `import <#import>`__, | +| | `vdev <#vdev>`__ | ++------------------------+--------------------------------------------+ +| When to change | To optimize pool import time against | +| | possibility of simultaneous import by | +| | another system | ++------------------------+--------------------------------------------+ +| Data Type | ulong | ++------------------------+--------------------------------------------+ +| Units | milliseconds | ++------------------------+--------------------------------------------+ +| Range | 100 to ULONG_MAX | ++------------------------+--------------------------------------------+ +| Default | 1000 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++------------------------+--------------------------------------------+ + +zfs_multihost_import_intervals +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_multihost_import_intervals`` controls the duration of the activity +test on pool import for the multihost multimodifier protection (MMP) +subsystem. The activity test can be expected to take a minimum time of +(``zfs_multihost_import_interval``\ s \* +`zfs_multihost_interval <#zfs_multihost_interval>`__ \* ``random(25%)``) +milliseconds. The random period of up to 25% improves simultaneous +import detection. For example, if two hosts are rebooted at the same +time and automatically attempt to import the pool, then is is highly +probable that one host will win. + +Smaller values of ``zfs_multihost_import_intervals`` reduces the import +time but increases the risk of failing to detect an active pool. The +total activity check time is never allowed to drop below one second. + +Note: the multihost protection feature applies to storage devices that +can be shared between multiple systems. + +============================== ==================================== +zfs_multihost_import_intervals Notes +============================== ==================================== +Tags `MMP <#mmp>`__, `import <#import>`__ +When to change TBD +Data Type uint +Units intervals +Range 1 to UINT_MAX +Default 10 +Change Dynamic +Versions Affected v0.7.0 and later +============================== ==================================== + +zfs_multihost_fail_intervals +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_multihost_fail_intervals`` controls the behavior of the pool when +write failures are detected in the multihost multimodifier protection +(MMP) subsystem. + +If ``zfs_multihost_fail_intervals = 0`` then multihost write failures +are ignored. The write failures are reported to the ZFS event daemon +(``zed``) which can take action such as suspending the pool or offlining +a device. + +| If ``zfs_multihost_fail_intervals > 0`` then sequential multihost + write failures will cause the pool to be suspended. This occurs when + (``zfs_multihost_fail_intervals`` \* + `zfs_multihost_interval <#zfs_multihost_interval>`__) milliseconds + have passed since the last successful multihost write. +| This guarantees the activity test will see multihost writes if the + pool is attempted to be imported by another system. + +============================ ==================================== +zfs_multihost_fail_intervals Notes +============================ ==================================== +Tags `MMP <#mmp>`__, `import <#import>`__ +When to change TBD +Data Type uint +Units intervals +Range 0 to UINT_MAX +Default 5 +Change Dynamic +Versions Affected v0.7.0 and later +============================ ==================================== + +zfs_delays_per_second +~~~~~~~~~~~~~~~~~~~~~ + +The ZFS Event Daemon (zed) processes events from ZFS. However, it can be +overwhelmed by high rates of error reports which can be generated by +failing, high-performance devices. ``zfs_delays_per_second`` limits the +rate of delay events reported to zed. + ++-----------------------+---------------------------------------------+ +| zfs_delays_per_second | Notes | ++=======================+=============================================+ +| Tags | `zed <#zed>`__, `delay <#delay>`__ | ++-----------------------+---------------------------------------------+ +| When to change | If processing delay events at a higher rate | +| | is desired | ++-----------------------+---------------------------------------------+ +| Data Type | uint | ++-----------------------+---------------------------------------------+ +| Units | events per second | ++-----------------------+---------------------------------------------+ +| Range | 0 to UINT_MAX | ++-----------------------+---------------------------------------------+ +| Default | 20 | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.7.7 and later | ++-----------------------+---------------------------------------------+ + +zfs_checksums_per_second +~~~~~~~~~~~~~~~~~~~~~~~~ + +The ZFS Event Daemon (zed) processes events from ZFS. However, it can be +overwhelmed by high rates of error reports which can be generated by +failing, high-performance devices. ``zfs_checksums_per_second`` limits +the rate of checksum events reported to zed. + +Note: do not set this value lower than the SERD limit for ``checksum`` +in zed. By default, ``checksum_N`` = 10 and ``checksum_T`` = 10 minutes, +resulting in a practical lower limit of 1. + ++--------------------------+------------------------------------------+ +| zfs_checksums_per_second | Notes | ++==========================+==========================================+ +| Tags | `zed <#zed>`__, `checksum <#checksum>`__ | ++--------------------------+------------------------------------------+ +| When to change | If processing checksum error events at a | +| | higher rate is desired | ++--------------------------+------------------------------------------+ +| Data Type | uint | ++--------------------------+------------------------------------------+ +| Units | events per second | ++--------------------------+------------------------------------------+ +| Range | 0 to UINT_MAX | ++--------------------------+------------------------------------------+ +| Default | 20 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.7 and later | ++--------------------------+------------------------------------------+ + +zfs_no_scrub_io +~~~~~~~~~~~~~~~ + +When ``zfs_no_scrub_io = 1`` scrubs do not actually scrub data and +simply doing a metadata crawl of the pool instead. + +================= =============================================== +zfs_no_scrub_io Notes +================= =============================================== +Tags `scrub <#scrub>`__ +When to change Testing scrub feature +Data Type boolean +Range 0=perform scrub I/O, 1=do not perform scrub I/O +Default 0 +Change Dynamic +Versions Affected v0.6.0 and later +================= =============================================== + +zfs_no_scrub_prefetch +~~~~~~~~~~~~~~~~~~~~~ + +When ``zfs_no_scrub_prefetch = 1``, prefetch is disabled for scrub I/Os. + ++-----------------------+-----------------------------------------------------+ +| zfs_no_scrub_prefetch | Notes | ++=======================+=====================================================+ +| Tags | `prefetch <#prefetch>`__, `scrub <#scrub>`__ | ++-----------------------+-----------------------------------------------------+ +| When to change | Testing scrub feature | ++-----------------------+-----------------------------------------------------+ +| Data Type | boolean | ++-----------------------+-----------------------------------------------------+ +| Range | 0=prefetch scrub I/Os, 1=do not prefetch scrub I/Os | ++-----------------------+-----------------------------------------------------+ +| Default | 0 | ++-----------------------+-----------------------------------------------------+ +| Change | Dynamic | ++-----------------------+-----------------------------------------------------+ +| Versions Affected | v0.6.4 and later | ++-----------------------+-----------------------------------------------------+ + +zfs_nocacheflush +~~~~~~~~~~~~~~~~ + +ZFS uses barriers (volatile cache flush commands) to ensure data is +committed to permanent media by devices. This ensures consistent +on-media state for devices where caches are volatile (eg HDDs). + +For devices with nonvolatile caches, the cache flush operation can be a +no-op. However, in some RAID arrays, cache flushes can cause the entire +cache to be flushed to the backing devices. + +To ensure on-media consistency, keep cache flush enabled. + ++-------------------+-------------------------------------------------+ +| zfs_nocacheflush | Notes | ++===================+=================================================+ +| Tags | `disks <#disks>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If the storage device has nonvolatile cache, | +| | then disabling cache flush can save the cost of | +| | occasional cache flush comamnds | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=send cache flush commands, 1=do not send | +| | cache flush commands | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +zfs_nopwrite_enabled +~~~~~~~~~~~~~~~~~~~~ + +The NOP-write feature is enabled by default when a +crytographically-secure checksum algorithm is in use by the dataset. +``zfs_nopwrite_enabled`` allows the NOP-write feature to be completely +disabled. + ++----------------------+----------------------------------------------+ +| zfs_nopwrite_enabled | Notes | ++======================+==============================================+ +| Tags | `checksum <#checksum>`__, `debug <#debug>`__ | ++----------------------+----------------------------------------------+ +| When to change | TBD | ++----------------------+----------------------------------------------+ +| Data Type | boolean | ++----------------------+----------------------------------------------+ +| Range | 0=disable NOP-write feature, 1=enable | +| | NOP-write feature | ++----------------------+----------------------------------------------+ +| Default | 1 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | v0.6.0 and later | ++----------------------+----------------------------------------------+ + +zfs_dmu_offset_next_sync +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_dmu_offset_next_sync`` enables forcing txg sync to find holes. +This causes ZFS to act like older versions when ``SEEK_HOLE`` or +``SEEK_DATA`` flags are used: when a dirty dnode causes txgs to be +synced so the previous data can be found. + ++--------------------------+------------------------------------------+ +| zfs_dmu_offset_next_sync | Notes | ++==========================+==========================================+ +| Tags | `DMU <#dmu>`__ | ++--------------------------+------------------------------------------+ +| When to change | TBD | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=do not force txg sync to find holes, | +| | 1=force txg sync to find holes | ++--------------------------+------------------------------------------+ +| Default | 0 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++--------------------------+------------------------------------------+ + +zfs_pd_bytes_max +~~~~~~~~~~~~~~~~ + +``zfs_pd_bytes_max`` limits the number of bytes prefetched during a pool +traversal (eg ``zfs send`` or other data crawling operations). These +prefetches are referred to as "prescient prefetches" and are always 100% +hit rate. The traversal operations do not use the default data or +metadata prefetcher. + +================= ========================================== +zfs_pd_bytes_max Notes +================= ========================================== +Tags `prefetch <#prefetch>`__, `send <#send>`__ +When to change TBD +Data Type int32 +Units bytes +Range 0 to INT32_MAX +Default 52,428,800 (50 MiB) +Change Dynamic +Versions Affected TBD +================= ========================================== + +zfs_per_txg_dirty_frees_percent +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_per_txg_dirty_frees_percent`` as a percentage of +`zfs_dirty_data_max <#zfs_dirty_data_max>`__ controls the percentage of +dirtied blocks from frees in one txg. After the threshold is crossed, +additional dirty blocks from frees wait until the next txg. Thus, when +deleting large files, filling consecutive txgs with deletes/frees, does +not throttle other, perhaps more important, writes. + +A side effect of this throttle can impact ``zfs receive`` workloads that +contain a large number of frees and the +`ignore_hole_birth <#ignore_hole_birth>`__ optimization is disabled. The +symptom is that the receive workload causes an increase in the frequency +of txg commits. The frequency of txg commits is observable via the +``otime`` column of ``/proc/spl/kstat/zfs/POOLNAME/txgs``. Since txg +commits also flush data from volatile caches in HDDs to media, HDD +performance can be negatively impacted. Also, since the frees do not +consume much bandwidth over the pipe, the pipe can appear to stall. Thus +the overall progress of receives is slower than expected. + +A value of zero will disable this throttle. + ++---------------------------------+-----------------------------------+ +| zfs_per_txg_dirty_frees_percent | Notes | ++=================================+===================================+ +| Tags | `delete <#delete>`__ | ++---------------------------------+-----------------------------------+ +| When to change | For ``zfs receive`` workloads, | +| | consider increasing or disabling. | +| | See section `ZFS I/O | +| | S | +| | cheduler `__ | ++---------------------------------+-----------------------------------+ +| Data Type | ulong | ++---------------------------------+-----------------------------------+ +| Units | percent | ++---------------------------------+-----------------------------------+ +| Range | 0 to 100 | ++---------------------------------+-----------------------------------+ +| Default | 30 | ++---------------------------------+-----------------------------------+ +| Change | Dynamic | ++---------------------------------+-----------------------------------+ +| Versions Affected | v0.7.0 and later | ++---------------------------------+-----------------------------------+ + +zfs_prefetch_disable +~~~~~~~~~~~~~~~~~~~~ + +``zfs_prefetch_disable`` controls the predictive prefetcher. + +Note that it leaves "prescient" prefetch (eg prefetch for ``zfs send``) +intact (see `zfs_pd_bytes_max <#zfs_pd_bytes_max>`__) + ++----------------------+----------------------------------------------+ +| zfs_prefetch_disable | Notes | ++======================+==============================================+ +| Tags | `prefetch <#prefetch>`__ | ++----------------------+----------------------------------------------+ +| When to change | In some case where the workload is | +| | completely random reads, overall performance | +| | can be better if prefetch is disabled | ++----------------------+----------------------------------------------+ +| Data Type | boolean | ++----------------------+----------------------------------------------+ +| Range | 0=prefetch enabled, 1=prefetch disabled | ++----------------------+----------------------------------------------+ +| Default | 0 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Verification | prefetch efficacy is observed by | +| | ``arcstat``, ``arc_summary``, and the | +| | relevant entries in | +| | ``/proc/spl/kstat/zfs/arcstats`` | ++----------------------+----------------------------------------------+ +| Versions Affected | all | ++----------------------+----------------------------------------------+ + +zfs_read_chunk_size +~~~~~~~~~~~~~~~~~~~ + +``zfs_read_chunk_size`` is the limit for ZFS filesystem reads. If an +application issues a ``read()`` larger than ``zfs_read_chunk_size``, +then the ``read()`` is divided into multiple operations no larger than +``zfs_read_chunk_size`` + +=================== ============================ +zfs_read_chunk_size Notes +=================== ============================ +Tags `filesystem <#filesystem>`__ +When to change TBD +Data Type ulong +Units bytes +Range 512 to ULONG_MAX +Default 1,048,576 +Change Dynamic +Versions Affected all +=================== ============================ + +zfs_read_history +~~~~~~~~~~~~~~~~ + +Historical statistics for the last ``zfs_read_history`` reads are +available in ``/proc/spl/kstat/zfs/POOL_NAME/reads`` + +================= ================================= +zfs_read_history Notes +================= ================================= +Tags `debug <#debug>`__ +When to change To observe read operation details +Data Type int +Units lines +Range 0 to INT_MAX +Default 0 +Change Dynamic +Versions Affected all +================= ================================= + +zfs_read_history_hits +~~~~~~~~~~~~~~~~~~~~~ + +When `zfs_read_history <#zfs_read_history>`__\ ``> 0``, +zfs_read_history_hits controls whether ARC hits are displayed in the +read history file, ``/proc/spl/kstat/zfs/POOL_NAME/reads`` + ++-----------------------+---------------------------------------------+ +| zfs_read_history_hits | Notes | ++=======================+=============================================+ +| Tags | `debug <#debug>`__ | ++-----------------------+---------------------------------------------+ +| When to change | To observe read operation details with ARC | +| | hits | ++-----------------------+---------------------------------------------+ +| Data Type | boolean | ++-----------------------+---------------------------------------------+ +| Range | 0=do not include data for ARC hits, | +| | 1=include ARC hit data | ++-----------------------+---------------------------------------------+ +| Default | 0 | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | all | ++-----------------------+---------------------------------------------+ + +zfs_recover +~~~~~~~~~~~ + +``zfs_recover`` can be set to true (1) to attempt to recover from +otherwise-fatal errors, typically caused by on-disk corruption. When +set, calls to ``zfs_panic_recover()`` will turn into warning messages +rather than calling ``panic()`` + ++-------------------+-------------------------------------------------+ +| zfs_recover | Notes | ++===================+=================================================+ +| Tags | `import <#import>`__ | ++-------------------+-------------------------------------------------+ +| When to change | zfs_recover should only be used as a last | +| | resort, as it typically results in leaked | +| | space, or worse | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=normal operation, 1=attempt recovery zpool | +| | import | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Verification | check output of ``dmesg`` and other logs for | +| | details | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.4 or later | ++-------------------+-------------------------------------------------+ + +zfs_resilver_min_time_ms +~~~~~~~~~~~~~~~~~~~~~~~~ + +Resilvers are processed by the sync thread in syncing context. While +resilvering, ZFS spends at least ``zfs_resilver_min_time_ms`` time +working on a resilver between txg commits. + +The `zfs_txg_timeout <#zfs_txg_timeout>`__ tunable sets a nominal +timeout value for the txg commits. By default, this timeout is 5 seconds +and the ``zfs_resilver_min_time_ms`` is 3 seconds. However, many +variables contribute to changing the actual txg times. The measured txg +interval is observed as the ``otime`` column (in nanoseconds) in the +``/proc/spl/kstat/zfs/POOL_NAME/txgs`` file. + +See also `zfs_txg_timeout <#zfs_txg_timeout>`__ and +`zfs_scan_min_time_ms <#zfs_scan_min_time_ms>`__ + ++--------------------------+------------------------------------------+ +| zfs_resilver_min_time_ms | Notes | ++==========================+==========================================+ +| Tags | `resilver <#resilver>`__ | ++--------------------------+------------------------------------------+ +| When to change | In some resilvering cases, increasing | +| | ``zfs_resilver_min_time_ms`` can result | +| | in faster completion | ++--------------------------+------------------------------------------+ +| Data Type | int | ++--------------------------+------------------------------------------+ +| Units | milliseconds | ++--------------------------+------------------------------------------+ +| Range | 1 to | +| | `zfs_txg_timeout <#zfs_txg_timeout>`__ | +| | converted to milliseconds | ++--------------------------+------------------------------------------+ +| Default | 3,000 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | all | ++--------------------------+------------------------------------------+ + +zfs_scan_min_time_ms +~~~~~~~~~~~~~~~~~~~~ + +Scrubs are processed by the sync thread in syncing context. While +scrubbing, ZFS spends at least ``zfs_scan_min_time_ms`` time working on +a scrub between txg commits. + +See also `zfs_txg_timeout <#zfs_txg_timeout>`__ and +`zfs_resilver_min_time_ms <#zfs_resilver_min_time_ms>`__ + ++----------------------+----------------------------------------------+ +| zfs_scan_min_time_ms | Notes | ++======================+==============================================+ +| Tags | `scrub <#scrub>`__ | ++----------------------+----------------------------------------------+ +| When to change | In some scrub cases, increasing | +| | ``zfs_scan_min_time_ms`` can result in | +| | faster completion | ++----------------------+----------------------------------------------+ +| Data Type | int | ++----------------------+----------------------------------------------+ +| Units | milliseconds | ++----------------------+----------------------------------------------+ +| Range | 1 to `zfs_txg_timeout <#zfs_txg_timeout>`__ | +| | converted to milliseconds | ++----------------------+----------------------------------------------+ +| Default | 1,000 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | all | ++----------------------+----------------------------------------------+ + +zfs_scan_checkpoint_intval +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To preserve progress across reboots the sequential scan algorithm +periodically needs to stop metadata scanning and issue all the +verifications I/Os to disk every ``zfs_scan_checkpoint_intval`` seconds. + +========================== ============================================ +zfs_scan_checkpoint_intval Notes +========================== ============================================ +Tags `resilver <#resilver>`__, `scrub <#scrub>`__ +When to change TBD +Data Type int +Units seconds +Range 1 to INT_MAX +Default 7,200 (2 hours) +Change Dynamic +Versions Affected v0.8.0 and later +========================== ============================================ + +zfs_scan_fill_weight +~~~~~~~~~~~~~~~~~~~~ + +This tunable affects how scrub and resilver I/O segments are ordered. A +higher number indicates that we care more about how filled in a segment +is, while a lower number indicates we care more about the size of the +extent without considering the gaps within a segment. + +==================== ============================================ +zfs_scan_fill_weight Notes +==================== ============================================ +Tags `resilver <#resilver>`__, `scrub <#scrub>`__ +When to change Testing sequential scrub and resilver +Data Type int +Units scalar +Range 0 to INT_MAX +Default 3 +Change Prior to zfs module load +Versions Affected v0.8.0 and later +==================== ============================================ + +zfs_scan_issue_strategy +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_issue_strategy`` controls the order of data verification +while scrubbing or resilvering. + ++-------+-------------------------------------------------------------+ +| value | description | ++=======+=============================================================+ +| 0 | fs will use strategy 1 during normal verification and | +| | strategy 2 while taking a checkpoint | ++-------+-------------------------------------------------------------+ +| 1 | data is verified as sequentially as possible, given the | +| | amount of memory reserved for scrubbing (see | +| | `zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__). This | +| | can improve scrub performance if the pool's data is heavily | +| | fragmented. | ++-------+-------------------------------------------------------------+ +| 2 | the largest mostly-contiguous chunk of found data is | +| | verified first. By deferring scrubbing of small segments, | +| | we may later find adjacent data to coalesce and increase | +| | the segment size. | ++-------+-------------------------------------------------------------+ + +======================= ============================================ +zfs_scan_issue_strategy Notes +======================= ============================================ +Tags `resilver <#resilver>`__, `scrub <#scrub>`__ +When to change TBD +Data Type enum +Range 0 to 2 +Default 0 +Change Dynamic +Versions Affected TBD +======================= ============================================ + +zfs_scan_legacy +~~~~~~~~~~~~~~~ + +Setting ``zfs_scan_legacy = 1`` enables the legacy scan and scrub +behavior instead of the newer sequential behavior. + ++-------------------+-------------------------------------------------+ +| zfs_scan_legacy | Notes | ++===================+=================================================+ +| Tags | `resilver <#resilver>`__, `scrub <#scrub>`__ | ++-------------------+-------------------------------------------------+ +| When to change | In some cases, the new scan mode can consumer | +| | more memory as it collects and sorts I/Os; | +| | using the legacy algorithm can be more memory | +| | efficient at the expense of HDD read efficiency | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=use new method: scrubs and resilvers will | +| | gather metadata in memory before issuing | +| | sequential I/O, 1=use legacy algorithm will be | +| | used where I/O is initiated as soon as it is | +| | discovered | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic, however changing to 0 does not affect | +| | in-progress scrubs or resilvers | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.8.0 and later | ++-------------------+-------------------------------------------------+ + +zfs_scan_max_ext_gap +~~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_max_ext_gap`` limits the largest gap in bytes between scrub +and resilver I/Os that will still be considered sequential for sorting +purposes. + ++----------------------+----------------------------------------------+ +| zfs_scan_max_ext_gap | Notes | ++======================+==============================================+ +| Tags | `resilver <#resilver>`__, `scrub <#scrub>`__ | ++----------------------+----------------------------------------------+ +| When to change | TBD | ++----------------------+----------------------------------------------+ +| Data Type | ulong | ++----------------------+----------------------------------------------+ +| Units | bytes | ++----------------------+----------------------------------------------+ +| Range | 512 to ULONG_MAX | ++----------------------+----------------------------------------------+ +| Default | 2,097,152 (2 MiB) | ++----------------------+----------------------------------------------+ +| Change | Dynamic, however changing to 0 does not | +| | affect in-progress scrubs or resilvers | ++----------------------+----------------------------------------------+ +| Versions Affected | v0.8.0 and later | ++----------------------+----------------------------------------------+ + +zfs_scan_mem_lim_fact +~~~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_mem_lim_fact`` limits the maximum fraction of RAM used for +I/O sorting by sequential scan algorithm. When the limit is reached +scanning metadata is stopped and data verification I/O is started. Data +verification I/O continues until the memory used by the sorting +algorithm drops below below +`zfs_scan_mem_lim_soft_fact <#zfs_scan_mem_lim_soft_fact>`__ + +Memory used by the sequential scan algorithm can be observed as the kmem +sio_cache. This is visible from procfs as +``grep sio_cache /proc/slabinfo`` and can be monitored using +slab-monitoring tools such as ``slabtop`` + ++-----------------------+---------------------------------------------+ +| zfs_scan_mem_lim_fact | Notes | ++=======================+=============================================+ +| Tags | `memory <#memory>`__, | +| | `resilver <#resilver>`__, | +| | `scrub <#scrub>`__ | ++-----------------------+---------------------------------------------+ +| When to change | TBD | ++-----------------------+---------------------------------------------+ +| Data Type | int | ++-----------------------+---------------------------------------------+ +| Units | divisor of physical RAM | ++-----------------------+---------------------------------------------+ +| Range | TBD | ++-----------------------+---------------------------------------------+ +| Default | 20 (physical RAM / 20 or 5%) | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.8.0 and later | ++-----------------------+---------------------------------------------+ + +zfs_scan_mem_lim_soft_fact +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_mem_lim_soft_fact`` sets the fraction of the hard limit, +`zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__, used to determined +the RAM soft limit for I/O sorting by the sequential scan algorithm. +After `zfs_scan_mem_lim_fact <#zfs_scan_mem_lim_fact>`__ has been +reached, metadata scanning is stopped until the RAM usage drops below +``zfs_scan_mem_lim_soft_fact`` + ++----------------------------+----------------------------------------+ +| zfs_scan_mem_lim_soft_fact | Notes | ++============================+========================================+ +| Tags | `resilver <#resilver>`__, | +| | `scrub <#scrub>`__ | ++----------------------------+----------------------------------------+ +| When to change | TBD | ++----------------------------+----------------------------------------+ +| Data Type | int | ++----------------------------+----------------------------------------+ +| Units | divisor of (physical RAM / | +| | `zfs_scan_mem | +| | _lim_fact <#zfs_scan_mem_lim_fact>`__) | ++----------------------------+----------------------------------------+ +| Range | 1 to INT_MAX | ++----------------------------+----------------------------------------+ +| Default | 20 (for default | +| | `zfs_scan_mem | +| | _lim_fact <#zfs_scan_mem_lim_fact>`__, | +| | 0.25% of physical RAM) | ++----------------------------+----------------------------------------+ +| Change | Dynamic | ++----------------------------+----------------------------------------+ +| Versions Affected | v0.8.0 and later | ++----------------------------+----------------------------------------+ + +zfs_scan_vdev_limit +~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_vdev_limit`` is the maximum amount of data that can be +concurrently issued at once for scrubs and resilvers per leaf vdev. +``zfs_scan_vdev_limit`` attempts to strike a balance between keeping the +leaf vdev queues full of I/Os while not overflowing the queues causing +high latency resulting in long txg sync times. While +``zfs_scan_vdev_limit`` represents a bandwidth limit, the existing I/O +limit of `zfs_vdev_scrub_max_active <#zfs_vdev_scrub_max_active>`__ +remains in effect, too. + ++---------------------+-----------------------------------------------+ +| zfs_scan_vdev_limit | Notes | ++=====================+===============================================+ +| Tags | `resilver <#resilver>`__, `scrub <#scrub>`__, | +| | `vdev <#vdev>`__ | ++---------------------+-----------------------------------------------+ +| When to change | TBD | ++---------------------+-----------------------------------------------+ +| Data Type | ulong | ++---------------------+-----------------------------------------------+ +| Units | bytes | ++---------------------+-----------------------------------------------+ +| Range | 512 to ULONG_MAX | ++---------------------+-----------------------------------------------+ +| Default | 4,194,304 (4 MiB) | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.8.0 and later | ++---------------------+-----------------------------------------------+ + +zfs_send_corrupt_data +~~~~~~~~~~~~~~~~~~~~~ + +``zfs_send_corrupt_data`` enables ``zfs send`` to send of corrupt data +by ignoring read and checksum errors. The corrupted or unreadable blocks +are replaced with the value ``0x2f5baddb10c`` (ZFS bad block) + ++-----------------------+---------------------------------------------+ +| zfs_send_corrupt_data | Notes | ++=======================+=============================================+ +| Tags | `send <#send>`__ | ++-----------------------+---------------------------------------------+ +| When to change | When data corruption exists and an attempt | +| | to recover at least some data via | +| | ``zfs send`` is needed | ++-----------------------+---------------------------------------------+ +| Data Type | boolean | ++-----------------------+---------------------------------------------+ +| Range | 0=do not send corrupt data, 1=replace | +| | corrupt data with cookie | ++-----------------------+---------------------------------------------+ +| Default | 0 | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.6.0 and later | ++-----------------------+---------------------------------------------+ + +zfs_sync_pass_deferred_free +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The SPA sync process is performed in multiple passes. Once the pass +number reaches ``zfs_sync_pass_deferred_free``, frees are no long +processed and must wait for the next SPA sync. + +The ``zfs_sync_pass_deferred_free`` value is expected to be removed as a +tunable once the optimal value is determined during field testing. + +The ``zfs_sync_pass_deferred_free`` pass must be greater than 1 to +ensure that regular blocks are not deferred. + +=========================== ======================== +zfs_sync_pass_deferred_free Notes +=========================== ======================== +Tags `SPA <#spa>`__ +When to change Testing SPA sync process +Data Type int +Units SPA sync passes +Range 1 to INT_MAX +Default 2 +Change Dynamic +Versions Affected all +=========================== ======================== + +zfs_sync_pass_dont_compress +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The SPA sync process is performed in multiple passes. Once the pass +number reaches ``zfs_sync_pass_dont_compress``, data block compression +is no longer processed and must wait for the next SPA sync. + +The ``zfs_sync_pass_dont_compress`` value is expected to be removed as a +tunable once the optimal value is determined during field testing. + +=========================== ======================== +zfs_sync_pass_dont_compress Notes +=========================== ======================== +Tags `SPA <#spa>`__ +When to change Testing SPA sync process +Data Type int +Units SPA sync passes +Range 1 to INT_MAX +Default 5 +Change Dynamic +Versions Affected all +=========================== ======================== + +zfs_sync_pass_rewrite +~~~~~~~~~~~~~~~~~~~~~ + +The SPA sync process is performed in multiple passes. Once the pass +number reaches ``zfs_sync_pass_rewrite``, blocks can be split into gang +blocks. + +The ``zfs_sync_pass_rewrite`` value is expected to be removed as a +tunable once the optimal value is determined during field testing. + +===================== ======================== +zfs_sync_pass_rewrite Notes +===================== ======================== +Tags `SPA <#spa>`__ +When to change Testing SPA sync process +Data Type int +Units SPA sync passes +Range 1 to INT_MAX +Default 2 +Change Dynamic +Versions Affected all +===================== ======================== + +zfs_sync_taskq_batch_pct +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_sync_taskq_batch_pct`` controls the number of threads used by the +DSL pool sync taskq, ``dp_sync_taskq`` + ++--------------------------+------------------------------------------+ +| zfs_sync_taskq_batch_pct | Notes | ++==========================+==========================================+ +| Tags | `SPA <#spa>`__ | ++--------------------------+------------------------------------------+ +| When to change | to adjust the number of | +| | ``dp_sync_taskq`` threads | ++--------------------------+------------------------------------------+ +| Data Type | int | ++--------------------------+------------------------------------------+ +| Units | percent of number of online CPUs | ++--------------------------+------------------------------------------+ +| Range | 1 to 100 | ++--------------------------+------------------------------------------+ +| Default | 75 | ++--------------------------+------------------------------------------+ +| Change | Prior to zfs module load | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++--------------------------+------------------------------------------+ + +zfs_txg_history +~~~~~~~~~~~~~~~ + +Historical statistics for the last ``zfs_txg_history`` txg commits are +available in ``/proc/spl/kstat/zfs/POOL_NAME/txgs`` + +The work required to measure the txg commit (SPA statistics) is low. +However, for debugging purposes, it can be useful to observe the SPA +statistics. + +================= ====================================================== +zfs_txg_history Notes +================= ====================================================== +Tags `debug <#debug>`__ +When to change To observe details of SPA sync behavior. +Data Type int +Units lines +Range 0 to INT_MAX +Default 0 for version v0.6.0 to v0.7.6, 100 for version v0.8.0 +Change Dynamic +Versions Affected all +================= ====================================================== + +zfs_txg_timeout +~~~~~~~~~~~~~~~ + +The open txg is committed to the pool periodically (SPA sync) and +``zfs_txg_timeout`` represents the default target upper limit. + +txg commits can occur more frequently and a rapid rate of txg commits +often indicates a busy write workload, quota limits reached, or the free +space is critically low. + +Many variables contribute to changing the actual txg times. txg commits +can also take longer than ``zfs_txg_timeout`` if the ZFS write throttle +is not properly tuned or the time to sync is otherwise delayed (eg slow +device). Shorter txg commit intervals can occur due to +`zfs_dirty_data_sync <#zfs_dirty_data_sync>`__ for write-intensive +workloads. The measured txg interval is observed as the ``otime`` column +(in nanoseconds) in the ``/proc/spl/kstat/zfs/POOL_NAME/txgs`` file. + +See also `zfs_dirty_data_sync <#zfs_dirty_data_sync>`__ and +`zfs_txg_history <#zfs_txg_history>`__ + ++-------------------+-------------------------------------------------+ +| zfs_txg_timeout | Notes | ++===================+=================================================+ +| Tags | `SPA <#spa>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------+-------------------------------------------------+ +| When to change | To optimize the work done by txg commit | +| | relative to the pool requirements. See also | +| | section `ZFS I/O | +| | Scheduler `__ | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Units | seconds | ++-------------------+-------------------------------------------------+ +| Range | 1 to INT_MAX | ++-------------------+-------------------------------------------------+ +| Default | 5 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +zfs_vdev_aggregation_limit +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To reduce IOPs, small, adjacent I/Os can be aggregated (coalesced) into +a large I/O. For reads, aggregations occur across small adjacency gaps. +For writes, aggregation can occur at the ZFS or disk level. +``zfs_vdev_aggregation_limit`` is the upper bound on the size of the +larger, aggregated I/O. + +Setting ``zfs_vdev_aggregation_limit = 0`` effectively disables +aggregation by ZFS. However, the block device scheduler can still merge +(aggregate) I/Os. Also, many devices, such as modern HDDs, contain +schedulers that can aggregate I/Os. + +In general, I/O aggregation can improve performance for devices, such as +HDDs, where ordering I/O operations for contiguous LBAs is a benefit. +For random access devices, such as SSDs, aggregation might not improve +performance relative to the CPU cycles needed to aggregate. For devices +that represent themselves as having no rotation, the +`zfs_vdev_aggregation_limit_non_rotating <#zfs_vdev_aggregation_limit_non_rotating>`__ +parameter is used instead of ``zfs_vdev_aggregation_limit`` + ++----------------------------+----------------------------------------+ +| zfs_vdev_aggregation_limit | Notes | ++============================+========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++----------------------------+----------------------------------------+ +| When to change | If the workload does not benefit from | +| | aggregation, the | +| | ``zfs_vdev_aggregation_limit`` can be | +| | reduced to avoid aggregation attempts | ++----------------------------+----------------------------------------+ +| Data Type | int | ++----------------------------+----------------------------------------+ +| Units | bytes | ++----------------------------+----------------------------------------+ +| Range | 0 to 1,048,576 (default) or 16,777,216 | +| | (if ``zpool`` ``large_blocks`` feature | +| | is enabled) | ++----------------------------+----------------------------------------+ +| Default | 1,048,576, or 131,072 for `__, | +| | `vdev_cache <#vdev_cache>`__ | ++---------------------+-----------------------------------------------+ +| When to change | Do not change | ++---------------------+-----------------------------------------------+ +| Data Type | int | ++---------------------+-----------------------------------------------+ +| Units | bytes | ++---------------------+-----------------------------------------------+ +| Range | 0 to MAX_INT | ++---------------------+-----------------------------------------------+ +| Default | 0 (vdev cache is disabled) | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Verification | vdev cache statistics are availabe in the | +| | ``/proc/spl/kstat/zfs/vdev_cache_stats`` file | ++---------------------+-----------------------------------------------+ +| Versions Affected | all | ++---------------------+-----------------------------------------------+ + +zfs_vdev_cache_bshift +~~~~~~~~~~~~~~~~~~~~~ + +Note: with the current ZFS code, the vdev cache is not helpful and in +some cases actually harmful. Thus it is disabled by setting the +`zfs_vdev_cache_size <#zfs_vdev_cache_size>`__ to zero. This related +tunable is, by default, inoperative. + +All read I/Os smaller than `zfs_vdev_cache_max <#zfs_vdev_cache_max>`__ +are turned into (``1 << zfs_vdev_cache_bshift``) byte reads by the vdev +cache. At most `zfs_vdev_cache_size <#zfs_vdev_cache_size>`__ bytes will +be kept in each vdev's cache. + +===================== ============================================== +zfs_vdev_cache_bshift Notes +===================== ============================================== +Tags `vdev <#vdev>`__, `vdev_cache <#vdev_cache>`__ +When to change Do not change +Data Type int +Units shift +Range 1 to INT_MAX +Default 16 (65,536 bytes) +Change Dynamic +Versions Affected all +===================== ============================================== + +zfs_vdev_cache_max +~~~~~~~~~~~~~~~~~~ + +Note: with the current ZFS code, the vdev cache is not helpful and in +some cases actually harmful. Thus it is disabled by setting the +`zfs_vdev_cache_size <#zfs_vdev_cache_size>`__ to zero. This related +tunable is, by default, inoperative. + +All read I/Os smaller than zfs_vdev_cache_max will be turned into +(``1 <<``\ `zfs_vdev_cache_bshift <#zfs_vdev_cache_bshift>`__ byte reads +by the vdev cache. At most ``zfs_vdev_cache_size`` bytes will be kept in +each vdev's cache. + +================== ============================================== +zfs_vdev_cache_max Notes +================== ============================================== +Tags `vdev <#vdev>`__, `vdev_cache <#vdev_cache>`__ +When to change Do not change +Data Type int +Units bytes +Range 512 to INT_MAX +Default 16,384 (16 KiB) +Change Dynamic +Versions Affected all +================== ============================================== + +zfs_vdev_mirror_rotating_inc +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The mirror read algorithm uses current load and an incremental weighting +value to determine the vdev to service a read operation. Lower values +determine the preferred vdev. The weighting value is +``zfs_vdev_mirror_rotating_inc`` for rotating media and +`zfs_vdev_mirror_non_rotating_inc <#zfs_vdev_mirror_non_rotating_inc>`__ +for nonrotating media. + +Verify the rotational setting described by a block device in sysfs by +observing ``/sys/block/DISK_NAME/queue/rotational`` + ++------------------------------+--------------------------------------+ +| zfs_vdev_mirror_rotating_inc | Notes | ++==============================+======================================+ +| Tags | `vdev <#vdev>`__, | +| | `mirror <#mirror>`__, `HDD <#hdd>`__ | ++------------------------------+--------------------------------------+ +| When to change | Increasing for mirrors with both | +| | rotating and nonrotating media more | +| | strongly favors the nonrotating | +| | media | ++------------------------------+--------------------------------------+ +| Data Type | int | ++------------------------------+--------------------------------------+ +| Units | scalar | ++------------------------------+--------------------------------------+ +| Range | 0 to MAX_INT | ++------------------------------+--------------------------------------+ +| Default | 0 | ++------------------------------+--------------------------------------+ +| Change | Dynamic | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.7.0 and later | ++------------------------------+--------------------------------------+ + +zfs_vdev_mirror_non_rotating_inc +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The mirror read algorithm uses current load and an incremental weighting +value to determine the vdev to service a read operation. Lower values +determine the preferred vdev. The weighting value is +`zfs_vdev_mirror_rotating_inc <#zfs_vdev_mirror_rotating_inc>`__ for +rotating media and ``zfs_vdev_mirror_non_rotating_inc`` for nonrotating +media. + +Verify the rotational setting described by a block device in sysfs by +observing ``/sys/block/DISK_NAME/queue/rotational`` + ++----------------------------------+----------------------------------+ +| zfs_vdev_mirror_non_rotating_inc | Notes | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `mirror <#mirror>`__, | +| | `SSD <#ssd>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | scalar | ++----------------------------------+----------------------------------+ +| Range | 0 to INT_MAX | ++----------------------------------+----------------------------------+ +| Default | 0 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------------+----------------------------------+ + +zfs_vdev_mirror_rotating_seek_inc +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For rotating media in a mirror, if the next I/O offset is within +`zfs_vdev_mirror_rotating_seek_offset <#zfs_vdev_mirror_rotating_seek_offset>`__ +then the weighting factor is incremented by +(``zfs_vdev_mirror_rotating_seek_inc / 2``). Otherwise the weighting +factor is increased by ``zfs_vdev_mirror_rotating_seek_inc``. This +algorithm prefers rotating media with lower seek distance. + +Verify the rotational setting described by a block device in sysfs by +observing ``/sys/block/DISK_NAME/queue/rotational`` + ++----------------------------------+----------------------------------+ +| z | Notes | +| fs_vdev_mirror_rotating_seek_inc | | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `mirror <#mirror>`__, | +| | `HDD <#hdd>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | scalar | ++----------------------------------+----------------------------------+ +| Range | 0 to INT_MAX | ++----------------------------------+----------------------------------+ +| Default | 5 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------------+----------------------------------+ + +zfs_vdev_mirror_rotating_seek_offset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For rotating media in a mirror, if the next I/O offset is within +``zfs_vdev_mirror_rotating_seek_offset`` then the weighting factor is +incremented by +(`zfs_vdev_mirror_rotating_seek_inc <#zfs_vdev_mirror_rotating_seek_inc>`__\ ``/ 2``). +Otherwise the weighting factor is increased by +``zfs_vdev_mirror_rotating_seek_inc``. This algorithm prefers rotating +media with lower seek distance. + +Verify the rotational setting described by a block device in sysfs by +observing ``/sys/block/DISK_NAME/queue/rotational`` + ++----------------------------------+----------------------------------+ +| zfs_ | Notes | +| vdev_mirror_rotating_seek_offset | | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `mirror <#mirror>`__, | +| | `HDD <#hdd>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | bytes | ++----------------------------------+----------------------------------+ +| Range | 0 to INT_MAX | ++----------------------------------+----------------------------------+ +| Default | 1,048,576 (1 MiB) | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------------+----------------------------------+ + +zfs_vdev_mirror_non_rotating_seek_inc +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For nonrotating media in a mirror, a seek penalty is applied as +sequential I/O's can be aggregated into fewer operations, avoiding +unnecessary per-command overhead, often boosting performance. + +Verify the rotational setting described by a block device in SysFS by +observing ``/sys/block/DISK_NAME/queue/rotational`` + ++----------------------------------+----------------------------------+ +| zfs_v | Notes | +| dev_mirror_non_rotating_seek_inc | | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `mirror <#mirror>`__, | +| | `SSD <#ssd>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | scalar | ++----------------------------------+----------------------------------+ +| Range | 0 to INT_MAX | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------------+----------------------------------+ + +zfs_vdev_read_gap_limit +~~~~~~~~~~~~~~~~~~~~~~~ + +To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into +into a large I/O. For reads, aggregations occur across small adjacency +gaps where the gap is less than ``zfs_vdev_read_gap_limit`` + ++-------------------------+-------------------------------------------+ +| zfs_vdev_read_gap_limit | Notes | ++=========================+===========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------------+-------------------------------------------+ +| When to change | TBD | ++-------------------------+-------------------------------------------+ +| Data Type | int | ++-------------------------+-------------------------------------------+ +| Units | bytes | ++-------------------------+-------------------------------------------+ +| Range | 0 to INT_MAX | ++-------------------------+-------------------------------------------+ +| Default | 32,768 (32 KiB) | ++-------------------------+-------------------------------------------+ +| Change | Dynamic | ++-------------------------+-------------------------------------------+ +| Versions Affected | all | ++-------------------------+-------------------------------------------+ + +zfs_vdev_write_gap_limit +~~~~~~~~~~~~~~~~~~~~~~~~ + +To reduce IOPs, small, adjacent I/Os are aggregated (coalesced) into +into a large I/O. For writes, aggregations occur across small adjacency +gaps where the gap is less than ``zfs_vdev_write_gap_limit`` + ++--------------------------+------------------------------------------+ +| zfs_vdev_write_gap_limit | Notes | ++==========================+==========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------+------------------------------------------+ +| When to change | TBD | ++--------------------------+------------------------------------------+ +| Data Type | int | ++--------------------------+------------------------------------------+ +| Units | bytes | ++--------------------------+------------------------------------------+ +| Range | 0 to INT_MAX | ++--------------------------+------------------------------------------+ +| Default | 4,096 (4 KiB) | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | all | ++--------------------------+------------------------------------------+ + +zfs_vdev_scheduler +~~~~~~~~~~~~~~~~~~ + +When the pool is imported, for whole disk vdevs, the block device I/O +scheduler is set to ``zfs_vdev_scheduler``. The most common schedulers +are: *noop*, *cfq*, *bfq*, and *deadline*. + +In some cases, the scheduler is not changeable using this method. Known +schedulers that cannot be changed are: *scsi_mq* and *none*. In these +cases, the scheduler is unchanged and an error message can be reported +to logs. + ++--------------------+------------------------------------------------+ +| zfs_vdev_scheduler | Notes | ++====================+================================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------+------------------------------------------------+ +| When to change | since ZFS has its own I/O scheduler, using a | +| | simple scheduler can result in more consistent | +| | performance | ++--------------------+------------------------------------------------+ +| Data Type | string | ++--------------------+------------------------------------------------+ +| Range | expected: *noop*, *cfq*, *bfq*, and *deadline* | ++--------------------+------------------------------------------------+ +| Default | *noop* | ++--------------------+------------------------------------------------+ +| Change | Dynamic, but takes effect upon pool creation | +| | or import | ++--------------------+------------------------------------------------+ +| Versions Affected | all | ++--------------------+------------------------------------------------+ + +zfs_vdev_raidz_impl +~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_raidz_impl`` overrides the raidz parity algorithm. By +default, the algorithm is selected at zfs module load time by the +results of a microbenchmark of algorithms based on the current hardware. + +Once the module is loaded, the content of +``/sys/module/zfs/parameters/zfs_vdev_raidz_impl`` shows available +options with the currently selected enclosed in ``[]``. Details of the +results of the microbenchmark are observable in the +``/proc/spl/kstat/zfs/vdev_raidz_bench`` file. + ++----------------+----------------------+-------------------------+ +| algorithm | architecture | description | ++================+======================+=========================+ +| fastest | all | fastest implementation | +| | | selected by | +| | | microbenchmark | ++----------------+----------------------+-------------------------+ +| original | all | original raidz | +| | | implementation | ++----------------+----------------------+-------------------------+ +| scalar | all | scalar raidz | +| | | implementation | ++----------------+----------------------+-------------------------+ +| sse2 | 64-bit x86 | uses SSE2 instruction | +| | | set | ++----------------+----------------------+-------------------------+ +| ssse3 | 64-bit x86 | uses SSSE3 instruction | +| | | set | ++----------------+----------------------+-------------------------+ +| avx2 | 64-bit x86 | uses AVX2 instruction | +| | | set | ++----------------+----------------------+-------------------------+ +| avx512f | 64-bit x86 | uses AVX512F | +| | | instruction set | ++----------------+----------------------+-------------------------+ +| avx512bw | 64-bit x86 | uses AVX512F & AVX512BW | +| | | instruction sets | ++----------------+----------------------+-------------------------+ +| aarch64_neon | aarch64/64 bit ARMv8 | uses NEON | ++----------------+----------------------+-------------------------+ +| aarch64_neonx2 | aarch64/64 bit ARMv8 | uses NEON with more | +| | | unrolling | ++----------------+----------------------+-------------------------+ + +=================== ==================================================== +zfs_vdev_raidz_impl Notes +=================== ==================================================== +Tags `CPU <#cpu>`__, `raidz <#raidz>`__, `vdev <#vdev>`__ +When to change testing raidz algorithms +Data Type string +Range see table above +Default *fastest* +Change Dynamic +Versions Affected v0.7.0 and later +=================== ==================================================== + +zfs_zevent_cols +~~~~~~~~~~~~~~~ + +``zfs_zevent_cols`` is a soft wrap limit in columns (characters) for ZFS +events logged to the console. + +================= ========================== +zfs_zevent_cols Notes +================= ========================== +Tags `debug <#debug>`__ +When to change if 80 columns isn't enough +Data Type int +Units characters +Range 1 to INT_MAX +Default 80 +Change Dynamic +Versions Affected all +================= ========================== + +zfs_zevent_console +~~~~~~~~~~~~~~~~~~ + +If ``zfs_zevent_console`` is true (1), then ZFS events are logged to the +console. + +More logging and log filtering capabilities are provided by ``zed`` + +================== ========================================= +zfs_zevent_console Notes +================== ========================================= +Tags `debug <#debug>`__ +When to change to log ZFS events to the console +Data Type boolean +Range 0=do not log to console, 1=log to console +Default 0 +Change Dynamic +Versions Affected all +================== ========================================= + +zfs_zevent_len_max +~~~~~~~~~~~~~~~~~~ + +``zfs_zevent_len_max`` is the maximum ZFS event queue length. A value of +0 results in a calculated value (16 \* number of CPUs) with a minimum of +64. Events in the queue can be viewed with the ``zpool events`` command. + +================== ================================ +zfs_zevent_len_max Notes +================== ================================ +Tags `debug <#debug>`__ +When to change increase to see more ZFS events +Data Type int +Units events +Range 0 to INT_MAX +Default 0 (calculate as described above) +Change Dynamic +Versions Affected all +================== ================================ + +zfs_zil_clean_taskq_maxalloc +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +During a SPA sync, intent log transaction groups (itxg) are cleaned. The +cleaning work is dispatched to the DSL pool ZIL clean taskq +(``dp_zil_clean_taskq``). +`zfs_zil_clean_taskq_minalloc <#zfs_zil_clean_taskq_minalloc>`__ is the +minumum and ``zfs_zil_clean_taskq_maxalloc`` is the maximum number of +cached taskq entries for ``dp_zil_clean_taskq``. The actual number of +taskq entries dynamically varies between these values. + +When ``zfs_zil_clean_taskq_maxalloc`` is exceeded transaction records +(itxs) are cleaned synchronously with possible negative impact to the +performance of SPA sync. + +Ideally taskq entries are pre-allocated prior to being needed by +``zil_clean()``, thus avoiding dynamic allocation of new taskq entries. + ++------------------------------+--------------------------------------+ +| zfs_zil_clean_taskq_maxalloc | Notes | ++==============================+======================================+ +| Tags | `ZIL <#zil>`__ | ++------------------------------+--------------------------------------+ +| When to change | If more ``dp_zil_clean_taskq`` | +| | entries are needed to prevent the | +| | itxs from being synchronously | +| | cleaned | ++------------------------------+--------------------------------------+ +| Data Type | int | ++------------------------------+--------------------------------------+ +| Units | ``dp_zil_clean_taskq`` taskq entries | ++------------------------------+--------------------------------------+ +| Range | `zfs_zil_clean_taskq_minallo | +| | c <#zfs_zil_clean_taskq_minalloc>`__ | +| | to ``INT_MAX`` | ++------------------------------+--------------------------------------+ +| Default | 1,048,576 | ++------------------------------+--------------------------------------+ +| Change | Dynamic, takes effect per-pool when | +| | the pool is imported | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.8.0 | ++------------------------------+--------------------------------------+ + +zfs_zil_clean_taskq_minalloc +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +During a SPA sync, intent log transaction groups (itxg) are cleaned. The +cleaning work is dispatched to the DSL pool ZIL clean taskq +(``dp_zil_clean_taskq``). ``zfs_zil_clean_taskq_minalloc`` is the +minumum and +`zfs_zil_clean_taskq_maxalloc <#zfs_zil_clean_taskq_maxalloc>`__ is the +maximum number of cached taskq entries for ``dp_zil_clean_taskq``. The +actual number of taskq entries dynamically varies between these values. + +``zfs_zil_clean_taskq_minalloc`` is the minimum number of ZIL +transaction records (itxs). + +Ideally taskq entries are pre-allocated prior to being needed by +``zil_clean()``, thus avoiding dynamic allocation of new taskq entries. + ++------------------------------+--------------------------------------+ +| zfs_zil_clean_taskq_minalloc | Notes | ++==============================+======================================+ +| Tags | `ZIL <#zil>`__ | ++------------------------------+--------------------------------------+ +| When to change | TBD | ++------------------------------+--------------------------------------+ +| Data Type | int | ++------------------------------+--------------------------------------+ +| Units | dp_zil_clean_taskq taskq entries | ++------------------------------+--------------------------------------+ +| Range | 1 to | +| | `zfs_zil_clean_taskq_maxallo | +| | c <#zfs_zil_clean_taskq_maxalloc>`__ | ++------------------------------+--------------------------------------+ +| Default | 1,024 | ++------------------------------+--------------------------------------+ +| Change | Dynamic, takes effect per-pool when | +| | the pool is imported | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.8.0 | ++------------------------------+--------------------------------------+ + +zfs_zil_clean_taskq_nthr_pct +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_zil_clean_taskq_nthr_pct`` controls the number of threads used by +the DSL pool ZIL clean taskq (``dp_zil_clean_taskq``). The default value +of 100% will create a maximum of one thread per cpu. + ++------------------------------+--------------------------------------+ +| zfs_zil_clean_taskq_nthr_pct | Notes | ++==============================+======================================+ +| Tags | `taskq <#taskq>`__, `ZIL <#zil>`__ | ++------------------------------+--------------------------------------+ +| When to change | Testing ZIL clean and SPA sync | +| | performance | ++------------------------------+--------------------------------------+ +| Data Type | int | ++------------------------------+--------------------------------------+ +| Units | percent of number of CPUs | ++------------------------------+--------------------------------------+ +| Range | 1 to 100 | ++------------------------------+--------------------------------------+ +| Default | 100 | ++------------------------------+--------------------------------------+ +| Change | Dynamic, takes effect per-pool when | +| | the pool is imported | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.8.0 | ++------------------------------+--------------------------------------+ + +zil_replay_disable +~~~~~~~~~~~~~~~~~~ + +If ``zil_replay_disable = 1``, then when a volume or filesystem is +brought online, no attempt to replay the ZIL is made and any existing +ZIL is destroyed. This can result in loss of data without notice. + +================== ================================== +zil_replay_disable Notes +================== ================================== +Tags `debug <#debug>`__, `ZIL <#zil>`__ +When to change Do not change +Data Type boolean +Range 0=replay ZIL, 1=destroy ZIL +Default 0 +Change Dynamic +Versions Affected v0.6.5 +================== ================================== + +zil_slog_bulk +~~~~~~~~~~~~~ + +``zil_slog_bulk`` is the log device write size limit per commit executed +with synchronous priority. Writes below ``zil_slog_bulk`` are executed +with synchronous priority. Writes above ``zil_slog_bulk`` are executed +with lower (asynchronous) priority to reduct potential log device abuse +by a single active ZIL writer. + ++-------------------+-------------------------------------------------+ +| zil_slog_bulk | Notes | ++===================+=================================================+ +| Tags | `ZIL <#zil>`__ | ++-------------------+-------------------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++-------------------+-------------------------------------------------+ +| Data Type | ulong | ++-------------------+-------------------------------------------------+ +| Units | bytes | ++-------------------+-------------------------------------------------+ +| Range | 0 to ULONG_MAX | ++-------------------+-------------------------------------------------+ +| Default | 786,432 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.8.0 | ++-------------------+-------------------------------------------------+ + +zio_delay_max +~~~~~~~~~~~~~ + +If a ZFS I/O operation takes more than ``zio_delay_max`` milliseconds to +complete, then an event is logged. Note that this is only a logging +facility, not a timeout on operations. See also ``zpool events`` + +================= ======================= +zio_delay_max Notes +================= ======================= +Tags `debug <#debug>`__ +When to change when debugging slow I/O +Data Type int +Units milliseconds +Range 1 to INT_MAX +Default 30,000 (30 seconds) +Change Dynamic +Versions Affected all +================= ======================= + +zio_dva_throttle_enabled +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zio_dva_throttle_enabled`` controls throttling of block allocations in +the ZFS I/O (ZIO) pipeline. When enabled, the maximum number of pending +allocations per top-level vdev is limited by +`zfs_vdev_queue_depth_pct <#zfs_vdev_queue_depth_pct>`__ + ++--------------------------+------------------------------------------+ +| zio_dva_throttle_enabled | Notes | ++==========================+==========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------+------------------------------------------+ +| When to change | Testing ZIO block allocation algorithms | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=do not throttle ZIO block allocations, | +| | 1=throttle ZIO block allocations | ++--------------------------+------------------------------------------+ +| Default | 1 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++--------------------------+------------------------------------------+ + +zio_requeue_io_start_cut_in_line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zio_requeue_io_start_cut_in_line`` controls prioritization of a +re-queued ZFS I/O (ZIO) in the ZIO pipeline by the ZIO taskq. + ++----------------------------------+----------------------------------+ +| zio_requeue_io_start_cut_in_line | Notes | ++==================================+==================================+ +| Tags | `Z | +| | IO_scheduler <#zio_scheduler>`__ | ++----------------------------------+----------------------------------+ +| When to change | Do not change | ++----------------------------------+----------------------------------+ +| Data Type | boolean | ++----------------------------------+----------------------------------+ +| Range | 0=don't prioritize re-queued | +| | I/Os, 1=prioritize re-queued | +| | I/Os | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | all | ++----------------------------------+----------------------------------+ + +zio_taskq_batch_pct +~~~~~~~~~~~~~~~~~~~ + +``zio_taskq_batch_pct`` sets the number of I/O worker threads as a +percentage of online CPUs. These workers threads are responsible for IO +work such as compression and checksum calculations. + +Each block is handled by one worker thread, so maximum overall worker +thread throughput is function of the number of concurrent blocks being +processed, the number of worker threads, and the algorithms used. The +default value of 75% is chosen to avoid using all CPUs which can result +in latency issues and inconsistent application performance, especially +when high compression is enabled. + +The taskq batch processes are: + ++-------------+--------------+---------------------------------------+ +| taskq | process name | Notes | ++=============+==============+=======================================+ +| Write issue | z_wr_iss[_#] | Can be CPU intensive, runs at lower | +| | | priority than other taskqs | ++-------------+--------------+---------------------------------------+ + +Other taskqs exist, but most have fixed numbers of instances and +therefore require recompiling the kernel module to adjust. + ++---------------------+-----------------------------------------------+ +| zio_taskq_batch_pct | Notes | ++=====================+===============================================+ +| Tags | `taskq <#taskq>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++---------------------+-----------------------------------------------+ +| When to change | To tune parallelism in multiprocessor systems | ++---------------------+-----------------------------------------------+ +| Data Type | int | ++---------------------+-----------------------------------------------+ +| Units | percent of number of CPUs | ++---------------------+-----------------------------------------------+ +| Range | 1 to 100, fractional number of CPUs are | +| | rounded down | ++---------------------+-----------------------------------------------+ +| Default | 75 | ++---------------------+-----------------------------------------------+ +| Change | Prior to zfs module load | ++---------------------+-----------------------------------------------+ +| Verification | The number of taskqs for each batch group can | +| | be observed using ``ps`` and counting the | +| | threads | ++---------------------+-----------------------------------------------+ +| Versions Affected | TBD | ++---------------------+-----------------------------------------------+ + +zvol_inhibit_dev +~~~~~~~~~~~~~~~~ + +``zvol_inhibit_dev`` controls the creation of volume device nodes upon +pool import. + ++-------------------+-------------------------------------------------+ +| zvol_inhibit_dev | Notes | ++===================+=================================================+ +| Tags | `import <#import>`__, `volume <#volume>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Inhibiting can slightly improve startup time on | +| | systems with a very large number of volumes | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=create volume device nodes, 1=do not create | +| | volume device nodes | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic, takes effect per-pool when the pool is | +| | imported | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.0 and later | ++-------------------+-------------------------------------------------+ + +zvol_major +~~~~~~~~~~ + +``zvol_major`` is the default major number for volume devices. + ++-------------------+-------------------------------------------------+ +| zvol_major | Notes | ++===================+=================================================+ +| Tags | `volume <#volume>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Do not change | ++-------------------+-------------------------------------------------+ +| Data Type | uint | ++-------------------+-------------------------------------------------+ +| Default | 230 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic, takes effect per-pool when the pool is | +| | imported or volumes are created | ++-------------------+-------------------------------------------------+ +| Versions Affected | all | ++-------------------+-------------------------------------------------+ + +zvol_max_discard_blocks +~~~~~~~~~~~~~~~~~~~~~~~ + +Discard (aka ATA TRIM or SCSI UNMAP) operations done on volumes are done +in batches ``zvol_max_discard_blocks`` blocks. The block size is +determined by the ``volblocksize`` property of a volume. + +Some applications, such as ``mkfs``, discard the whole volume at once +using the maximum possible discard size. As a result, many gigabytes of +discard requests are not uncommon. Unfortunately, if a large amount of +data is already allocated in the volume, ZFS can be quite slow to +process discard requests. This is especially true if the volblocksize is +small (eg default=8KB). As a result, very large discard requests can +take a very long time (perhaps minutes under heavy load) to complete. +This can cause a number of problems, most notably if the volume is +accessed remotely (eg via iSCSI), in which case the client has a high +probability of timing out on the request. + +Limiting the ``zvol_max_discard_blocks`` can decrease the amount of +discard workload request by setting the ``discard_max_bytes`` and +``discard_max_hw_bytes`` for the volume's block device in SysFS. This +value is readable by volume device consumers. + ++-------------------------+-------------------------------------------+ +| zvol_max_discard_blocks | Notes | ++=========================+===========================================+ +| Tags | `discard <#discard>`__, | +| | `volume <#volume>`__ | ++-------------------------+-------------------------------------------+ +| When to change | if volume discard activity severely | +| | impacts other workloads | ++-------------------------+-------------------------------------------+ +| Data Type | ulong | ++-------------------------+-------------------------------------------+ +| Units | number of blocks of size volblocksize | ++-------------------------+-------------------------------------------+ +| Range | 0 to ULONG_MAX | ++-------------------------+-------------------------------------------+ +| Default | 16,384 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic, takes effect per-pool when the | +| | pool is imported or volumes are created | ++-------------------------+-------------------------------------------+ +| Verification | Observe value of | +| | ``/sys/block/ | +| | VOLUME_INSTANCE/queue/discard_max_bytes`` | ++-------------------------+-------------------------------------------+ +| Versions Affected | v0.6.0 and later | ++-------------------------+-------------------------------------------+ + +zvol_prefetch_bytes +~~~~~~~~~~~~~~~~~~~ + +When importing a pool with volumes or adding a volume to a pool, +``zvol_prefetch_bytes`` are prefetch from the start and end of the +volume. Prefetching these regions of the volume is desirable because +they are likely to be accessed immediately by ``blkid(8)`` or by the +kernel scanning for a partition table. + +=================== ============================================== +zvol_prefetch_bytes Notes +=================== ============================================== +Tags `prefetch <#prefetch>`__, `volume <#volume>`__ +When to change TBD +Data Type uint +Units bytes +Range 0 to UINT_MAX +Default 131,072 +Change Dynamic +Versions Affected v0.6.5 and later +=================== ============================================== + +zvol_request_sync +~~~~~~~~~~~~~~~~~ + +When processing I/O requests for a volume submit them synchronously. +This effectively limits the queue depth to 1 for each I/O submitter. +When set to 0 requests are handled asynchronously by the "zvol" thread +pool. + +See also `zvol_threads <#zvol_threads>`__ + ++-------------------+-------------------------------------------------+ +| zvol_request_sync | Notes | ++===================+=================================================+ +| Tags | `volume <#volume>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Testing concurrent volume requests | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=do concurrent (async) volume requests, 1=do | +| | sync volume requests | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.7.2 and later | ++-------------------+-------------------------------------------------+ + +zvol_threads +~~~~~~~~~~~~ + +zvol_threads controls the maximum number of threads handling concurrent +volume I/O requests. + +The default of 32 threads behaves similarly to a disk with a 32-entry +command queue. The actual number of threads required can vary widely by +workload and available CPUs. If lock analysis shows high contention in +the zvol taskq threads, then reducing the number of zvol_threads or +workload queue depth can improve overall throughput. + +See also `zvol_request_sync <#zvol_request_sync>`__ + ++-------------------+-------------------------------------------------+ +| zvol_threads | Notes | ++===================+=================================================+ +| Tags | `volume <#volume>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Matching the number of concurrent volume | +| | requests with workload requirements can improve | +| | concurrency | ++-------------------+-------------------------------------------------+ +| Data Type | uint | ++-------------------+-------------------------------------------------+ +| Units | threads | ++-------------------+-------------------------------------------------+ +| Range | 1 to UINT_MAX | ++-------------------+-------------------------------------------------+ +| Default | 32 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic, takes effect per-volume when the pool | +| | is imported or volumes are created | ++-------------------+-------------------------------------------------+ +| Verification | ``iostat`` using ``avgqu-sz`` or ``aqu-sz`` | +| | results | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++-------------------+-------------------------------------------------+ + +zvol_volmode +~~~~~~~~~~~~ + +``zvol_volmode`` defines volume block devices behaviour when the +``volmode`` property is set to ``default`` + +Note: to maintain compatibility with ZFS on BSD, "geom" is synonymous +with "full" + +===== ======= =========================================== +value volmode Description +===== ======= =========================================== +1 full legacy fully functional behaviour (default) +2 dev hide partitions on volume block devices +3 none not exposing volumes outside ZFS +===== ======= =========================================== + +================= ==================== +zvol_volmode Notes +================= ==================== +Tags `volume <#volume>`__ +When to change TBD +Data Type enum +Range 1, 2, or 3 +Default 1 +Change Dynamic +Versions Affected v0.7.0 and later +================= ==================== + +zfs_qat_disable +~~~~~~~~~~~~~~~ + +``zfs_qat_disable`` controls the Intel QuickAssist Technology (QAT) +driver providing hardware acceleration for gzip compression. When the +QAT hardware is present and qat driver available, the default behaviour +is to enable QAT. + ++-------------------+-------------------------------------------------+ +| zfs_qat_disable | Notes | ++===================+=================================================+ +| Tags | `compression <#compression>`__, `QAT <#qat>`__ | ++-------------------+-------------------------------------------------+ +| When to change | Testing QAT functionality | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=use QAT acceleration if available, 1=do not | +| | use QAT acceleration | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.7, renamed to | +| | `zfs_qat_ | +| | compress_disable <#zfs_qat_compress_disable>`__ | +| | in v0.8 | ++-------------------+-------------------------------------------------+ + +zfs_qat_checksum_disable +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_qat_checksum_disable`` controls the Intel QuickAssist Technology +(QAT) driver providing hardware acceleration for checksums. When the QAT +hardware is present and qat driver available, the default behaviour is +to enable QAT. + ++--------------------------+------------------------------------------+ +| zfs_qat_checksum_disable | Notes | ++==========================+==========================================+ +| Tags | `checksum <#checksum>`__, `QAT <#qat>`__ | ++--------------------------+------------------------------------------+ +| When to change | Testing QAT functionality | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=use QAT acceleration if available, | +| | 1=do not use QAT acceleration | ++--------------------------+------------------------------------------+ +| Default | 0 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.8.0 | ++--------------------------+------------------------------------------+ + +zfs_qat_compress_disable +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_qat_compress_disable`` controls the Intel QuickAssist Technology +(QAT) driver providing hardware acceleration for gzip compression. When +the QAT hardware is present and qat driver available, the default +behaviour is to enable QAT. + ++--------------------------+------------------------------------------+ +| zfs_qat_compress_disable | Notes | ++==========================+==========================================+ +| Tags | `compression <#compression>`__, | +| | `QAT <#qat>`__ | ++--------------------------+------------------------------------------+ +| When to change | Testing QAT functionality | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=use QAT acceleration if available, | +| | 1=do not use QAT acceleration | ++--------------------------+------------------------------------------+ +| Default | 0 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.8.0 | ++--------------------------+------------------------------------------+ + +zfs_qat_encrypt_disable +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_qat_encrypt_disable`` controls the Intel QuickAssist Technology +(QAT) driver providing hardware acceleration for encryption. When the +QAT hardware is present and qat driver available, the default behaviour +is to enable QAT. + ++-------------------------+-------------------------------------------+ +| zfs_qat_encrypt_disable | Notes | ++=========================+===========================================+ +| Tags | `encryption <#encryption>`__, | +| | `QAT <#qat>`__ | ++-------------------------+-------------------------------------------+ +| When to change | Testing QAT functionality | ++-------------------------+-------------------------------------------+ +| Data Type | boolean | ++-------------------------+-------------------------------------------+ +| Range | 0=use QAT acceleration if available, 1=do | +| | not use QAT acceleration | ++-------------------------+-------------------------------------------+ +| Default | 0 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic | ++-------------------------+-------------------------------------------+ +| Versions Affected | v0.8.0 | ++-------------------------+-------------------------------------------+ + +dbuf_cache_hiwater_pct +~~~~~~~~~~~~~~~~~~~~~~ + +The ``dbuf_cache_hiwater_pct`` and +`dbuf_cache_lowater_pct <#dbuf_cache_lowater_pct>`__ define the +operating range for dbuf cache evict thread. The hiwater and lowater are +percentages of the `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ +value. When the dbuf cache grows above ((100% + +``dbuf_cache_hiwater_pct``) \* +`dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__) then the dbuf cache +thread begins evicting. When the dbug cache falls below ((100% - +`dbuf_cache_lowater_pct <#dbuf_cache_lowater_pct>`__) \* +`dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__) then the dbuf cache +thread stops evicting. + +====================== ============================= +dbuf_cache_hiwater_pct Notes +====================== ============================= +Tags `dbuf_cache <#dbuf_cache>`__ +When to change Testing dbuf cache algorithms +Data Type uint +Units percent +Range 0 to UINT_MAX +Default 10 +Change Dynamic +Versions Affected v0.7.0 and later +====================== ============================= + +dbuf_cache_lowater_pct +~~~~~~~~~~~~~~~~~~~~~~ + +The dbuf_cache_hiwater_pct and dbuf_cache_lowater_pct define the +operating range for dbuf cache evict thread. The hiwater and lowater are +percentages of the `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ +value. When the dbuf cache grows above ((100% + +`dbuf_cache_hiwater_pct <#dbuf_cache_hiwater_pct>`__) \* +`dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__) then the dbuf cache +thread begins evicting. When the dbug cache falls below ((100% - +``dbuf_cache_lowater_pct``) \* +`dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__) then the dbuf cache +thread stops evicting. + +====================== ============================= +dbuf_cache_lowater_pct Notes +====================== ============================= +Tags `dbuf_cache <#dbuf_cache>`__ +When to change Testing dbuf cache algorithms +Data Type uint +Units percent +Range 0 to UINT_MAX +Default 10 +Change Dynamic +Versions Affected v0.7.0 and later +====================== ============================= + +dbuf_cache_max_bytes +~~~~~~~~~~~~~~~~~~~~ + +The dbuf cache maintains a list of dbufs that are not currently held but +have been recently released. These dbufs are not eligible for ARC +eviction until they are aged out of the dbuf cache. Dbufs are added to +the dbuf cache once the last hold is released. If a dbuf is later +accessed and still exists in the dbuf cache, then it will be removed +from the cache and later re-added to the head of the cache. Dbufs that +are aged out of the cache will be immediately destroyed and become +eligible for ARC eviction. + +The size of the dbuf cache is set by ``dbuf_cache_max_bytes``. The +actual size is dynamically adjusted to the minimum of current ARC target +size (``c``) >> `dbuf_cache_max_shift <#dbuf_cache_max_shift>`__ and the +default ``dbuf_cache_max_bytes`` + +==================== ============================= +dbuf_cache_max_bytes Notes +==================== ============================= +Tags `dbuf_cache <#dbuf_cache>`__ +When to change Testing dbuf cache algorithms +Data Type ulong +Units bytes +Range 16,777,216 to ULONG_MAX +Default 104,857,600 (100 MiB) +Change Dynamic +Versions Affected v0.7.0 and later +==================== ============================= + +dbuf_cache_max_shift +~~~~~~~~~~~~~~~~~~~~ + +The `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ minimum is the +lesser of `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ and the +current ARC target size (``c``) >> ``dbuf_cache_max_shift`` + +==================== ============================= +dbuf_cache_max_shift Notes +==================== ============================= +Tags `dbuf_cache <#dbuf_cache>`__ +When to change Testing dbuf cache algorithms +Data Type int +Units shift +Range 1 to 63 +Default 5 +Change Dynamic +Versions Affected v0.7.0 and later +==================== ============================= + +dmu_object_alloc_chunk_shift +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each of the concurrent object allocators grabs +``2^dmu_object_alloc_chunk_shift`` dnode slots at a time. The default is +to grab 128 slots, or 4 blocks worth. This default value was +experimentally determined to be the lowest value that eliminates the +measurable effect of lock contention in the DMU object allocation code +path. + ++------------------------------+--------------------------------------+ +| dmu_object_alloc_chunk_shift | Notes | ++==============================+======================================+ +| Tags | `allocation <#allocation>`__, | +| | `DMU <#dmu>`__ | ++------------------------------+--------------------------------------+ +| When to change | If the workload creates many files | +| | concurrently on a system with many | +| | CPUs, then increasing | +| | ``dmu_object_alloc_chunk_shift`` can | +| | decrease lock contention | ++------------------------------+--------------------------------------+ +| Data Type | int | ++------------------------------+--------------------------------------+ +| Units | shift | ++------------------------------+--------------------------------------+ +| Range | 7 to 9 | ++------------------------------+--------------------------------------+ +| Default | 7 | ++------------------------------+--------------------------------------+ +| Change | Dynamic | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.7.0 and later | ++------------------------------+--------------------------------------+ + +send_holes_without_birth_time +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Alias for `ignore_hole_birth <#ignore_hole_birth>`__ + +zfs_abd_scatter_enabled +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_abd_scatter_enabled`` controls the ARC Buffer Data (ABD) +scatter/gather feature. + +When disabled, the legacy behaviour is selected using linear buffers. +For linear buffers, all the data in the ABD is stored in one contiguous +buffer in memory (from a ``zio_[data_]buf_*`` kmem cache). + +When enabled (default), the data in the ABD is split into equal-sized +chunks (from the ``abd_chunk_cache`` kmem_cache), with pointers to the +chunks recorded in an array at the end of the ABD structure. This allows +more efficient memory allocation for buffers, especially when large +recordsizes are used. + ++-------------------------+-------------------------------------------+ +| zfs_abd_scatter_enabled | Notes | ++=========================+===========================================+ +| Tags | `ABD <#abd>`__, `memory <#memory>`__ | ++-------------------------+-------------------------------------------+ +| When to change | Testing ABD | ++-------------------------+-------------------------------------------+ +| Data Type | boolean | ++-------------------------+-------------------------------------------+ +| Range | 0=use linear allocation only, 1=allow | +| | scatter/gather | ++-------------------------+-------------------------------------------+ +| Default | 1 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic | ++-------------------------+-------------------------------------------+ +| Verification | ABD statistics are observable in | +| | ``/proc/spl/kstat/zfs/abdstats``. Slab | +| | allocations are observable in | +| | ``/proc/slabinfo`` | ++-------------------------+-------------------------------------------+ +| Versions Affected | v0.7.0 and later | ++-------------------------+-------------------------------------------+ + +zfs_abd_scatter_max_order +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_abd_scatter_max_order`` sets the maximum order for physical page +allocation when ABD is enabled (see +`zfs_abd_scatter_enabled <#zfs_abd_scatter_enabled>`__) + +See also Buddy Memory Allocation in the Linux kernel documentation. + ++---------------------------+-----------------------------------------+ +| zfs_abd_scatter_max_order | Notes | ++===========================+=========================================+ +| Tags | `ABD <#abd>`__, `memory <#memory>`__ | ++---------------------------+-----------------------------------------+ +| When to change | Testing ABD features | ++---------------------------+-----------------------------------------+ +| Data Type | int | ++---------------------------+-----------------------------------------+ +| Units | orders | ++---------------------------+-----------------------------------------+ +| Range | 1 to 10 (upper limit is | +| | hardware-dependent) | ++---------------------------+-----------------------------------------+ +| Default | 10 | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Verification | ABD statistics are observable in | +| | ``/proc/spl/kstat/zfs/abdstats`` | ++---------------------------+-----------------------------------------+ +| Versions Affected | v0.7.0 and later | ++---------------------------+-----------------------------------------+ + +zfs_compressed_arc_enabled +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When compression is enabled for a dataset, later reads of the data can +store the blocks in ARC in their on-disk, compressed state. This can +increse the effective size of the ARC, as counted in blocks, and thus +improve the ARC hit ratio. + ++----------------------------+----------------------------------------+ +| zfs_compressed_arc_enabled | Notes | ++============================+========================================+ +| Tags | `ABD <#abd>`__, | +| | `compression <#compression>`__ | ++----------------------------+----------------------------------------+ +| When to change | Testing ARC compression feature | ++----------------------------+----------------------------------------+ +| Data Type | boolean | ++----------------------------+----------------------------------------+ +| Range | 0=compressed ARC disabled (legacy | +| | behaviour), 1=compress ARC data | ++----------------------------+----------------------------------------+ +| Default | 1 | ++----------------------------+----------------------------------------+ +| Change | Dynamic | ++----------------------------+----------------------------------------+ +| Verification | raw ARC statistics are observable in | +| | ``/proc/spl/kstat/zfs/arcstats`` and | +| | ARC hit ratios can be observed using | +| | ``arcstat`` | ++----------------------------+----------------------------------------+ +| Versions Affected | v0.7.0 and later | ++----------------------------+----------------------------------------+ + +zfs_key_max_salt_uses +~~~~~~~~~~~~~~~~~~~~~ + +For encrypted datasets, the salt is regenerated every +``zfs_key_max_salt_uses`` blocks. This automatic regeneration reduces +the probability of collisions due to the Birthday problem. When set to +the default (400,000,000) the probability of collision is approximately +1 in 1 trillion. + +===================== ============================ +zfs_key_max_salt_uses Notes +===================== ============================ +Tags `encryption <#encryption>`__ +When to change Testing encryption features +Data Type ulong +Units blocks encrypted +Range 1 to ULONG_MAX +Default 400,000,000 +Change Dynamic +Versions Affected v0.8.0 and later +===================== ============================ + +zfs_object_mutex_size +~~~~~~~~~~~~~~~~~~~~~ + +``zfs_object_mutex_size`` facilitates resizing the the per-dataset znode +mutex array for testing deadlocks therein. + +===================== =================================== +zfs_object_mutex_size Notes +===================== =================================== +Tags `debug <#debug>`__ +When to change Testing znode mutex array deadlocks +Data Type uint +Units orders +Range 1 to UINT_MAX +Default 64 +Change Dynamic +Versions Affected v0.7.0 and later +===================== =================================== + +zfs_scan_strict_mem_lim +~~~~~~~~~~~~~~~~~~~~~~~ + +When scrubbing or resilvering, by default, ZFS checks to ensure it is +not over the hard memory limit before each txg commit. If finer-grained +control of this is needed ``zfs_scan_strict_mem_lim`` can be set to 1 to +enable checking before scanning each block. + ++-------------------------+-------------------------------------------+ +| zfs_scan_strict_mem_lim | Notes | ++=========================+===========================================+ +| Tags | `memory <#memory>`__, | +| | `resilver <#resilver>`__, | +| | `scrub <#scrub>`__ | ++-------------------------+-------------------------------------------+ +| When to change | Do not change | ++-------------------------+-------------------------------------------+ +| Data Type | boolean | ++-------------------------+-------------------------------------------+ +| Range | 0=normal scan behaviour, 1=check hard | +| | memory limit strictly during scan | ++-------------------------+-------------------------------------------+ +| Default | 0 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic | ++-------------------------+-------------------------------------------+ +| Versions Affected | v0.8.0 | ++-------------------------+-------------------------------------------+ + +zfs_send_queue_length +~~~~~~~~~~~~~~~~~~~~~ + +``zfs_send_queue_length`` is the maximum number of bytes allowed in the +zfs send queue. + ++-----------------------+---------------------------------------------+ +| zfs_send_queue_length | Notes | ++=======================+=============================================+ +| Tags | `send <#send>`__ | ++-----------------------+---------------------------------------------+ +| When to change | When using the largest recordsize or | +| | volblocksize (16 MiB), increasing can | +| | improve send efficiency | ++-----------------------+---------------------------------------------+ +| Data Type | int | ++-----------------------+---------------------------------------------+ +| Units | bytes | ++-----------------------+---------------------------------------------+ +| Range | Must be at least twice the maximum | +| | recordsize or volblocksize in use | ++-----------------------+---------------------------------------------+ +| Default | 16,777,216 bytes (16 MiB) | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.8.1 | ++-----------------------+---------------------------------------------+ + +zfs_recv_queue_length +~~~~~~~~~~~~~~~~~~~~~ + +``zfs_recv_queue_length`` is the maximum number of bytes allowed in the +zfs receive queue. + ++-----------------------+---------------------------------------------+ +| zfs_recv_queue_length | Notes | ++=======================+=============================================+ +| Tags | `receive <#receive>`__ | ++-----------------------+---------------------------------------------+ +| When to change | When using the largest recordsize or | +| | volblocksize (16 MiB), increasing can | +| | improve receive efficiency | ++-----------------------+---------------------------------------------+ +| Data Type | int | ++-----------------------+---------------------------------------------+ +| Units | bytes | ++-----------------------+---------------------------------------------+ +| Range | Must be at least twice the maximum | +| | recordsize or volblocksize in use | ++-----------------------+---------------------------------------------+ +| Default | 16,777,216 bytes (16 MiB) | ++-----------------------+---------------------------------------------+ +| Change | Dynamic | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.8.1 | ++-----------------------+---------------------------------------------+ + +zfs_arc_min_prefetch_lifespan +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``arc_min_prefetch_lifespan`` is the minimum time for a prefetched block +to remain in ARC before it is eligible for eviction. + +============================= ====================================== +zfs_arc_min_prefetch_lifespan Notes +============================= ====================================== +Tags `ARC <#ARC>`__ +When to change TBD +Data Type int +Units clock ticks +Range 0 = use default value +Default 1 second (as expressed in clock ticks) +Change Dynamic +Versions Affected v0.7.0 +============================= ====================================== + +zfs_scan_ignore_errors +~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_ignore_errors`` allows errors discovered during scrub or +resilver to be ignored. This can be tuned as a workaround to remove the +dirty time list (DTL) when completing a pool scan. It is intended to be +used during pool repair or recovery to prevent resilvering when the pool +is imported. + ++------------------------+--------------------------------------------+ +| zfs_scan_ignore_errors | Notes | ++========================+============================================+ +| Tags | `resilver <#resilver>`__ | ++------------------------+--------------------------------------------+ +| When to change | See description above | ++------------------------+--------------------------------------------+ +| Data Type | boolean | ++------------------------+--------------------------------------------+ +| Range | 0 = do not ignore errors, 1 = ignore | +| | errors during pool scrub or resilver | ++------------------------+--------------------------------------------+ +| Default | 0 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | v0.8.1 | ++------------------------+--------------------------------------------+ + +zfs_top_maxinflight +~~~~~~~~~~~~~~~~~~~ + +``zfs_top_maxinflight`` is used to limit the maximum number of I/Os +queued to top-level vdevs during scrub or resilver operations. The +actual top-level vdev limit is calculated by multiplying the number of +child vdevs by ``zfs_top_maxinflight`` This limit is an additional cap +over and above the scan limits + ++---------------------+-----------------------------------------------+ +| zfs_top_maxinflight | Notes | ++=====================+===============================================+ +| Tags | `resilver <#resilver>`__, `scrub <#scrub>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++---------------------+-----------------------------------------------+ +| When to change | for modern ZFS versions, the ZIO scheduler | +| | limits usually take precedence | ++---------------------+-----------------------------------------------+ +| Data Type | int | ++---------------------+-----------------------------------------------+ +| Units | I/O operations | ++---------------------+-----------------------------------------------+ +| Range | 1 to MAX_INT | ++---------------------+-----------------------------------------------+ +| Default | 32 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.6.0 | ++---------------------+-----------------------------------------------+ + +zfs_resilver_delay +~~~~~~~~~~~~~~~~~~ + +``zfs_resilver_delay`` sets a time-based delay for resilver I/Os. This +delay is in addition to the ZIO scheduler's treatement of scrub +workloads. See also `zfs_scan_idle <#zfs_scan_idle>`__ + ++--------------------+------------------------------------------------+ +| zfs_resilver_delay | Notes | ++====================+================================================+ +| Tags | `resilver <#resilver>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------+------------------------------------------------+ +| When to change | increasing can reduce impact of resilver | +| | workload on dynamic workloads | ++--------------------+------------------------------------------------+ +| Data Type | int | ++--------------------+------------------------------------------------+ +| Units | clock ticks | ++--------------------+------------------------------------------------+ +| Range | 0 to MAX_INT | ++--------------------+------------------------------------------------+ +| Default | 2 | ++--------------------+------------------------------------------------+ +| Change | Dynamic | ++--------------------+------------------------------------------------+ +| Versions Affected | v0.6.0 | ++--------------------+------------------------------------------------+ + +zfs_scrub_delay +~~~~~~~~~~~~~~~ + +``zfs_scrub_delay`` sets a time-based delay for scrub I/Os. This delay +is in addition to the ZIO scheduler's treatment of scrub workloads. See +also `zfs_scan_idle <#zfs_scan_idle>`__ + ++-------------------+-------------------------------------------------+ +| zfs_scrub_delay | Notes | ++===================+=================================================+ +| Tags | `scrub <#scrub>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------+-------------------------------------------------+ +| When to change | increasing can reduce impact of scrub workload | +| | on dynamic workloads | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Units | clock ticks | ++-------------------+-------------------------------------------------+ +| Range | 0 to MAX_INT | ++-------------------+-------------------------------------------------+ +| Default | 4 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.0 | ++-------------------+-------------------------------------------------+ + +zfs_scan_idle +~~~~~~~~~~~~~ + +When a non-scan I/O has occurred in the past ``zfs_scan_idle`` clock +ticks, then `zfs_resilver_delay <#zfs_resilver_delay>`__ or +`zfs_scrub_delay <#zfs_scrub_delay>`__ are enabled. + ++-------------------+-------------------------------------------------+ +| zfs_scan_idle | Notes | ++===================+=================================================+ +| Tags | `resilver <#resilver>`__, `scrub <#scrub>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------+-------------------------------------------------+ +| When to change | as part of a resilver/scrub tuning effort | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Units | clock ticks | ++-------------------+-------------------------------------------------+ +| Range | 0 to MAX_INT | ++-------------------+-------------------------------------------------+ +| Default | 50 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.0 | ++-------------------+-------------------------------------------------+ + +icp_aes_impl +~~~~~~~~~~~~ + +By default, ZFS will choose the highest performance, hardware-optimized +implementation of the AES encryption algorithm. The ``icp_aes_impl`` +tunable overrides this automatic choice. + +Note: ``icp_aes_impl`` is set in the ``icp`` kernel module, not the +``zfs`` kernel module. + +To observe the available options +``cat /sys/module/icp/parameters/icp_aes_impl`` The default option is +shown in brackets '[]' + +================= ==================================== +icp_aes_impl Notes +================= ==================================== +Tags `encryption <#encryption>`__ +Kernel module icp +When to change debugging ZFS encryption on hardware +Data Type string +Range varies by hardware +Default automatic, depends on the hardware +Change dynamic +Versions Affected planned for v2 +================= ==================================== + +icp_gcm_impl +~~~~~~~~~~~~ + +By default, ZFS will choose the highest performance, hardware-optimized +implementation of the GCM encryption algorithm. The ``icp_gcm_impl`` +tunable overrides this automatic choice. + +Note: ``icp_gcm_impl`` is set in the ``icp`` kernel module, not the +``zfs`` kernel module. + +To observe the available options +``cat /sys/module/icp/parameters/icp_gcm_impl`` The default option is +shown in brackets '[]' + +================= ==================================== +icp_gcm_impl Notes +================= ==================================== +Tags `encryption <#encryption>`__ +Kernel module icp +When to change debugging ZFS encryption on hardware +Data Type string +Range varies by hardware +Default automatic, depends on the hardware +Change Dynamic +Versions Affected planned for v2 +================= ==================================== + +zfs_abd_scatter_min_size +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_abd_scatter_min_size`` changes the ARC buffer data (ABD) +allocator's threshold for using linear or page-based scatter buffers. +Allocations smaller than ``zfs_abd_scatter_min_size`` use linear ABDs. + +Scatter ABD's use at least one page each, so sub-page allocations waste +some space when allocated as scatter allocations. For example, 2KB +scatter allocation wastes half of each page. Using linear ABD's for +small allocations results in slabs containing many allocations. This can +improve memory efficiency, at the expense of more work for ARC evictions +attempting to free pages, because all the buffers on one slab need to be +freed in order to free the slab and its underlying pages. + +Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it's +possible for them to actually waste more memory than scatter +allocations: + +- one page per buf = wasting 3/4 or 7/8 +- one buf per slab = wasting 15/16 + +Spill blocks are typically 512B and are heavily used on systems running +*selinux* with the default dnode size and the ``xattr=sa`` property set. + +By default, linear allocations for 512B and 1KB, and scatter allocations +for larger (>= 1.5KB) allocation requests. + ++--------------------------+------------------------------------------+ +| zfs_abd_scatter_min_size | Notes | ++==========================+==========================================+ +| Tags | `ARC <#ARC>`__ | ++--------------------------+------------------------------------------+ +| When to change | debugging memory allocation, especially | +| | for large pages | ++--------------------------+------------------------------------------+ +| Data Type | int | ++--------------------------+------------------------------------------+ +| Units | bytes | ++--------------------------+------------------------------------------+ +| Range | 0 to MAX_INT | ++--------------------------+------------------------------------------+ +| Default | 1536 (512B and 1KB allocations will be | +| | linear) | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | planned for v2 | ++--------------------------+------------------------------------------+ + +zfs_unlink_suspend_progress +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_unlink_suspend_progress`` changes the policy for removing pending +unlinks. When enabled, files will not be asynchronously removed from the +list of pending unlinks and the space they consume will be leaked. Once +this option has been disabled and the dataset is remounted, the pending +unlinks will be processed and the freed space returned to the pool. + ++-----------------------------+---------------------------------------+ +| zfs_unlink_suspend_progress | Notes | ++=============================+=======================================+ +| Tags | | ++-----------------------------+---------------------------------------+ +| When to change | used by the ZFS test suite (ZTS) to | +| | facilitate testing | ++-----------------------------+---------------------------------------+ +| Data Type | boolean | ++-----------------------------+---------------------------------------+ +| Range | 0 = use async unlink removal, 1 = do | +| | not async unlink thus leaking space | ++-----------------------------+---------------------------------------+ +| Default | 0 | ++-----------------------------+---------------------------------------+ +| Change | prior to dataset mount | ++-----------------------------+---------------------------------------+ +| Versions Affected | planned for v2 | ++-----------------------------+---------------------------------------+ + +spa_load_verify_shift +~~~~~~~~~~~~~~~~~~~~~ + +``spa_load_verify_shift`` sets the fraction of ARC that can be used by +inflight I/Os when verifying the pool during import. This value is a +"shift" representing the fraction of ARC target size +(``grep -w c /proc/spl/kstat/zfs/arcstats``). The ARC target size is +shifted to the right. Thus a value of '2' results in the fraction = 1/4, +while a value of '4' results in the fraction = 1/8. + +For large memory machines, pool import can consume large amounts of ARC: +much larger than the value of maxinflight. This can result in +`spa_load_verify_maxinflight <#spa_load_verify_maxinflight>`__ having a +value of 0 causing the system to hang. Setting ``spa_load_verify_shift`` +can reduce this limit and allow importing without hanging. + ++-----------------------+---------------------------------------------+ +| spa_load_verify_shift | Notes | ++=======================+=============================================+ +| Tags | `import <#import>`__, `ARC <#ARC>`__, | +| | `SPA <#SPA>`__ | ++-----------------------+---------------------------------------------+ +| When to change | troubleshooting pool import on large memory | +| | machines | ++-----------------------+---------------------------------------------+ +| Data Type | int | ++-----------------------+---------------------------------------------+ +| Units | shift | ++-----------------------+---------------------------------------------+ +| Range | 1 to MAX_INT | ++-----------------------+---------------------------------------------+ +| Default | 4 | ++-----------------------+---------------------------------------------+ +| Change | prior to importing a pool | ++-----------------------+---------------------------------------------+ +| Versions Affected | planned for v2 | ++-----------------------+---------------------------------------------+ + +spa_load_print_vdev_tree +~~~~~~~~~~~~~~~~~~~~~~~~ + +``spa_load_print_vdev_tree`` enables printing of the attempted pool +import's vdev tree to kernel message to the ZFS debug message log +``/proc/spl/kstat/zfs/dbgmsg`` Both the provided vdev tree and MOS vdev +tree are printed, which can be useful for debugging problems with the +zpool ``cachefile`` + ++--------------------------+------------------------------------------+ +| spa_load_print_vdev_tree | Notes | ++==========================+==========================================+ +| Tags | `import <#import>`__, `SPA <#SPA>`__ | ++--------------------------+------------------------------------------+ +| When to change | troubleshooting pool import failures | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0 = do not print pool configuration in | +| | logs, 1 = print pool configuration in | +| | logs | ++--------------------------+------------------------------------------+ +| Default | 0 | ++--------------------------+------------------------------------------+ +| Change | prior to pool import | ++--------------------------+------------------------------------------+ +| Versions Affected | planned for v2 | ++--------------------------+------------------------------------------+ + +zfs_max_missing_tvds +~~~~~~~~~~~~~~~~~~~~ + +When importing a pool in readonly mode +(``zpool import -o readonly=on ...``) then up to +``zfs_max_missing_tvds`` top-level vdevs can be missing, but the import +can attempt to progress. + +Note: This is strictly intended for advanced pool recovery cases since +missing data is almost inevitable. Pools with missing devices can only +be imported read-only for safety reasons, and the pool's ``failmode`` +property is automatically set to ``continue`` + +The expected use case is to recover pool data immediately after +accidentally adding a non-protected vdev to a protected pool. + +- With 1 missing top-level vdev, ZFS should be able to import the pool + and mount all datasets. User data that was not modified after the + missing device has been added should be recoverable. Thus snapshots + created prior to the addition of that device should be completely + intact. + +- With 2 missing top-level vdevs, some datasets may fail to mount since + there are dataset statistics that are stored as regular metadata. + Some data might be recoverable if those vdevs were added recently. + +- With 3 or more top-level missing vdevs, the pool is severely damaged + and MOS entries may be missing entirely. Chances of data recovery are + very low. Note that there are also risks of performing an inadvertent + rewind as we might be missing all the vdevs with the latest + uberblocks. + +==================== ========================================== +zfs_max_missing_tvds Notes +==================== ========================================== +Tags `import <#import>`__ +When to change troubleshooting pools with missing devices +Data Type int +Units missing top-level vdevs +Range 0 to MAX_INT +Default 0 +Change prior to pool import +Versions Affected planned for v2 +==================== ========================================== + +dbuf_metadata_cache_shift +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``dbuf_metadata_cache_shift`` sets the size of the dbuf metadata cache +as a fraction of ARC target size. This is an alternate method for +setting dbuf metadata cache size than +`dbuf_metadata_cache_max_bytes <#dbuf_metadata_cache_max_bytes>`__. + +`dbuf_metadata_cache_max_bytes <#dbuf_metadata_cache_max_bytes>`__ +overrides ``dbuf_metadata_cache_shift`` + +This value is a "shift" representing the fraction of ARC target size +(``grep -w c /proc/spl/kstat/zfs/arcstats``). The ARC target size is +shifted to the right. Thus a value of '2' results in the fraction = 1/4, +while a value of '6' results in the fraction = 1/64. + ++---------------------------+-----------------------------------------+ +| dbuf_metadata_cache_shift | Notes | ++===========================+=========================================+ +| Tags | `ARC <#ARC>`__, | +| | `dbuf_cache <#dbuf_cache>`__ | ++---------------------------+-----------------------------------------+ +| When to change | | ++---------------------------+-----------------------------------------+ +| Data Type | int | ++---------------------------+-----------------------------------------+ +| Units | shift | ++---------------------------+-----------------------------------------+ +| Range | practical range is | +| | (` | +| | dbuf_cache_shift <#dbuf_cache_shift>`__ | +| | + 1) to MAX_INT | ++---------------------------+-----------------------------------------+ +| Default | 6 | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | planned for v2 | ++---------------------------+-----------------------------------------+ + +dbuf_metadata_cache_max_bytes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``dbuf_metadata_cache_max_bytes`` sets the size of the dbuf metadata +cache as a number of bytes. This is an alternate method for setting dbuf +metadata cache size than +`dbuf_metadata_cache_shift <#dbuf_metadata_cache_shift>`__ + +`dbuf_metadata_cache_max_bytes <#dbuf_metadata_cache_max_bytes>`__ +overrides ``dbuf_metadata_cache_shift`` + ++-------------------------------+-------------------------------------+ +| dbuf_metadata_cache_max_bytes | Notes | ++===============================+=====================================+ +| Tags | `dbuf_cache <#dbuf_cache>`__ | ++-------------------------------+-------------------------------------+ +| When to change | | ++-------------------------------+-------------------------------------+ +| Data Type | int | ++-------------------------------+-------------------------------------+ +| Units | bytes | ++-------------------------------+-------------------------------------+ +| Range | 0 = use | +| | `dbuf_metadata_cache_sh | +| | ift <#dbuf_metadata_cache_shift>`__ | +| | to ARC ``c_max`` | ++-------------------------------+-------------------------------------+ +| Default | 0 | ++-------------------------------+-------------------------------------+ +| Change | Dynamic | ++-------------------------------+-------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------------------+-------------------------------------+ + +dbuf_cache_shift +~~~~~~~~~~~~~~~~ + +``dbuf_cache_shift`` sets the size of the dbuf cache as a fraction of +ARC target size. This is an alternate method for setting dbuf cache size +than `dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__. + +`dbuf_cache_max_bytes <#dbuf_cache_max_bytes>`__ overrides +``dbuf_cache_shift`` + +This value is a "shift" representing the fraction of ARC target size +(``grep -w c /proc/spl/kstat/zfs/arcstats``). The ARC target size is +shifted to the right. Thus a value of '2' results in the fraction = 1/4, +while a value of '5' results in the fraction = 1/32. + +Performance tuning of dbuf cache can be monitored using: + +- ``dbufstat`` command +- `node_exporter `__ ZFS + module for prometheus environments +- `telegraf `__ ZFS plugin for + general-purpose metric collection +- ``/proc/spl/kstat/zfs/dbufstats`` kstat + ++-------------------+-------------------------------------------------+ +| dbuf_cache_shift | Notes | ++===================+=================================================+ +| Tags | `ARC <#ARC>`__, `dbuf_cache <#dbuf_cache>`__ | ++-------------------+-------------------------------------------------+ +| When to change | to improve performance of read-intensive | +| | channel programs | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Units | shift | ++-------------------+-------------------------------------------------+ +| Range | 5 to MAX_INT | ++-------------------+-------------------------------------------------+ +| Default | 5 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------+-------------------------------------------------+ + +.. _dbuf_cache_max_bytes-1: + +dbuf_cache_max_bytes +~~~~~~~~~~~~~~~~~~~~ + +``dbuf_cache_max_bytes`` sets the size of the dbuf cache in bytes. This +is an alternate method for setting dbuf cache size than +`dbuf_cache_shift <#dbuf_cache_shift>`__ + +Performance tuning of dbuf cache can be monitored using: + +- ``dbufstat`` command +- `node_exporter `__ ZFS + module for prometheus environments +- `telegraf `__ ZFS plugin for + general-purpose metric collection +- ``/proc/spl/kstat/zfs/dbufstats`` kstat + ++----------------------+----------------------------------------------+ +| dbuf_cache_max_bytes | Notes | ++======================+==============================================+ +| Tags | `ARC <#ARC>`__, `dbuf_cache <#dbuf_cache>`__ | ++----------------------+----------------------------------------------+ +| When to change | | ++----------------------+----------------------------------------------+ +| Data Type | int | ++----------------------+----------------------------------------------+ +| Units | bytes | ++----------------------+----------------------------------------------+ +| Range | 0 = use | +| | `dbuf_cache_shift <#dbuf_cache_shift>`__ to | +| | ARC ``c_max`` | ++----------------------+----------------------------------------------+ +| Default | 0 | ++----------------------+----------------------------------------------+ +| Change | Dynamic | ++----------------------+----------------------------------------------+ +| Versions Affected | planned for v2 | ++----------------------+----------------------------------------------+ + +metaslab_force_ganging +~~~~~~~~~~~~~~~~~~~~~~ + +When testing allocation code, ``metaslab_force_ganging`` forces blocks +above the specified size to be ganged. + +====================== ========================================== +metaslab_force_ganging Notes +====================== ========================================== +Tags `allocation <#allocation>`__ +When to change for development testing purposes only +Data Type ulong +Units bytes +Range SPA_MINBLOCKSIZE to (SPA_MAXBLOCKSIZE + 1) +Default SPA_MAXBLOCKSIZE + 1 (16,777,217 bytes) +Change Dynamic +Versions Affected planned for v2 +====================== ========================================== + +zfs_vdev_default_ms_count +~~~~~~~~~~~~~~~~~~~~~~~~~ + +When adding a top-level vdev, ``zfs_vdev_default_ms_count`` is the +target number of metaslabs. + ++---------------------------+-----------------------------------------+ +| zfs_vdev_default_ms_count | Notes | ++===========================+=========================================+ +| Tags | `allocation <#allocation>`__ | ++---------------------------+-----------------------------------------+ +| When to change | for development testing purposes only | ++---------------------------+-----------------------------------------+ +| Data Type | int | ++---------------------------+-----------------------------------------+ +| Range | 16 to MAX_INT | ++---------------------------+-----------------------------------------+ +| Default | 200 | ++---------------------------+-----------------------------------------+ +| Change | prior to creating a pool or adding a | +| | top-level vdev | ++---------------------------+-----------------------------------------+ +| Versions Affected | planned for v2 | ++---------------------------+-----------------------------------------+ + +vdev_removal_max_span +~~~~~~~~~~~~~~~~~~~~~ + +During top-level vdev removal, chunks of data are copied from the vdev +which may include free space in order to trade bandwidth for IOPS. +``vdev_removal_max_span`` sets the maximum span of free space included +as unnecessary data in a chunk of copied data. + +===================== ================================ +vdev_removal_max_span Notes +===================== ================================ +Tags `vdev_removal <#vdev_removal>`__ +When to change TBD +Data Type int +Units bytes +Range 0 to MAX_INT +Default 32,768 (32 MiB) +Change Dynamic +Versions Affected planned for v2 +===================== ================================ + +zfs_removal_ignore_errors +~~~~~~~~~~~~~~~~~~~~~~~~~ + +When removing a device, ``zfs_removal_ignore_errors`` controls the +process for handling hard I/O errors. When set, if a device encounters a +hard IO error during the removal process the removal will not be +cancelled. This can result in a normally recoverable block becoming +permanently damaged and is not recommended. This should only be used as +a last resort when the pool cannot be returned to a healthy state prior +to removing the device. + ++---------------------------+-----------------------------------------+ +| zfs_removal_ignore_errors | Notes | ++===========================+=========================================+ +| Tags | `vdev_removal <#vdev_removal>`__ | ++---------------------------+-----------------------------------------+ +| When to change | See description for caveat | ++---------------------------+-----------------------------------------+ +| Data Type | boolean | ++---------------------------+-----------------------------------------+ +| Range | during device removal: 0 = hard errors | +| | are not ignored, 1 = hard errors are | +| | ignored | ++---------------------------+-----------------------------------------+ +| Default | 0 | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | planned for v2 | ++---------------------------+-----------------------------------------+ + +zfs_removal_suspend_progress +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_removal_suspend_progress`` is used during automated testing of the +ZFS code to incease test coverage. + +============================ ====================================== +zfs_removal_suspend_progress Notes +============================ ====================================== +Tags `vdev_removal <#vdev_removal>`__ +When to change do not change +Data Type boolean +Range 0 = do not suspend during vdev removal +Default 0 +Change Dynamic +Versions Affected planned for v2 +============================ ====================================== + +zfs_condense_indirect_commit_entry_delay_ms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +During vdev removal, the vdev indirection layer sleeps for +``zfs_condense_indirect_commit_entry_delay_ms`` milliseconds during +mapping geenration. This parameter is used during automated testing of +the ZFS code to improve test coverage. + ++----------------------------------+----------------------------------+ +| zfs_condens | Notes | +| e_indirect_commit_entry_delay_ms | | ++==================================+==================================+ +| Tags | `vdev_removal <#vdev_removal>`__ | ++----------------------------------+----------------------------------+ +| When to change | do not change | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | milliseconds | ++----------------------------------+----------------------------------+ +| Range | 0 to MAX_INT | ++----------------------------------+----------------------------------+ +| Default | 0 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_condense_indirect_vdevs_enable +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +During vdev removal, condensing process is an attempt to save memory by +removing obsolete mappings. ``zfs_condense_indirect_vdevs_enable`` +enables condensing indirect vdev mappings. When set, ZFS attempts to +condense indirect vdev mappings if the mapping uses more than +`zfs_condense_min_mapping_bytes <#zfs_condense_min_mapping_bytes>`__ +bytes of memory and if the obsolete space map object uses more than +`zfs_condense_max_obsolete_bytes <#zfs_condense_max_obsolete_bytes>`__ +bytes on disk. + ++----------------------------------+----------------------------------+ +| zf | Notes | +| s_condense_indirect_vdevs_enable | | ++==================================+==================================+ +| Tags | `vdev_removal <#vdev_removal>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | boolean | ++----------------------------------+----------------------------------+ +| Range | 0 = do not save memory, 1 = save | +| | memory by condensing obsolete | +| | mapping after vdev removal | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_condense_max_obsolete_bytes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After vdev removal, ``zfs_condense_max_obsolete_bytes`` sets the limit +for beginning the condensing process. Condensing begins if the obsolete +space map takes up more than ``zfs_condense_max_obsolete_bytes`` of +space on disk (logically). The default of 1 GiB is small enough relative +to a typical pool that the space consumed by the obsolete space map is +minimal. + +See also +`zfs_condense_indirect_vdevs_enable <#zfs_condense_indirect_vdevs_enable>`__ + +=============================== ================================ +zfs_condense_max_obsolete_bytes Notes +=============================== ================================ +Tags `vdev_removal <#vdev_removal>`__ +When to change no not change +Data Type ulong +Units bytes +Range 0 to MAX_ULONG +Default 1,073,741,824 (1 GiB) +Change Dynamic +Versions Affected planned for v2 +=============================== ================================ + +zfs_condense_min_mapping_bytes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After vdev removal, ``zfs_condense_min_mapping_bytes`` is the lower +limit for determining when to condense the in-memory obsolete space map. +The condensing process will not continue unless a minimum of +``zfs_condense_min_mapping_bytes`` of memory can be freed. + +See also +`zfs_condense_indirect_vdevs_enable <#zfs_condense_indirect_vdevs_enable>`__ + +============================== ================================ +zfs_condense_min_mapping_bytes Notes +============================== ================================ +Tags `vdev_removal <#vdev_removal>`__ +When to change do not change +Data Type ulong +Units bytes +Range 0 to MAX_ULONG +Default 128 KiB +Change Dynamic +Versions Affected planned for v2 +============================== ================================ + +zfs_vdev_initializing_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_initializing_max_active`` sets the maximum initializing I/Os +active to each device. + ++----------------------------------+----------------------------------+ +| zfs_vdev_initializing_max_active | Notes | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `Z | +| | IO_scheduler <#zio_scheduler>`__ | ++----------------------------------+----------------------------------+ +| When to change | See `ZFS I/O | +| | Sch | +| | eduler `__ | ++----------------------------------+----------------------------------+ +| Data Type | uint32 | ++----------------------------------+----------------------------------+ +| Units | I/O operations | ++----------------------------------+----------------------------------+ +| Range | 1 to | +| | `zfs_vdev_max_ | +| | active <#zfs_vdev_max_active>`__ | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_vdev_initializing_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_initializing_min_active`` sets the minimum initializing I/Os +active to each device. + ++----------------------------------+----------------------------------+ +| zfs_vdev_initializing_min_active | Notes | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `Z | +| | IO_scheduler <#zio_scheduler>`__ | ++----------------------------------+----------------------------------+ +| When to change | See `ZFS I/O | +| | Sch | +| | eduler `__ | ++----------------------------------+----------------------------------+ +| Data Type | uint32 | ++----------------------------------+----------------------------------+ +| Units | I/O operations | ++----------------------------------+----------------------------------+ +| Range | 1 to | +| | `zfs_vde | +| | v_initializing_max_active <#zfs_ | +| | vdev_initializing_max_active>`__ | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_vdev_removal_max_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_removal_max_active`` sets the maximum top-level vdev removal +I/Os active to each device. + ++-----------------------------+---------------------------------------+ +| zfs_vdev_removal_max_active | Notes | ++=============================+=======================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-----------------------------+---------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++-----------------------------+---------------------------------------+ +| Data Type | uint32 | ++-----------------------------+---------------------------------------+ +| Units | I/O operations | ++-----------------------------+---------------------------------------+ +| Range | 1 to | +| | `zfs_vdev | +| | _max_active <#zfs_vdev_max_active>`__ | ++-----------------------------+---------------------------------------+ +| Default | 2 | ++-----------------------------+---------------------------------------+ +| Change | Dynamic | ++-----------------------------+---------------------------------------+ +| Versions Affected | planned for v2 | ++-----------------------------+---------------------------------------+ + +zfs_vdev_removal_min_active +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_removal_min_active`` sets the minimum top-level vdev removal +I/Os active to each device. + ++-----------------------------+---------------------------------------+ +| zfs_vdev_removal_min_active | Notes | ++=============================+=======================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-----------------------------+---------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++-----------------------------+---------------------------------------+ +| Data Type | uint32 | ++-----------------------------+---------------------------------------+ +| Units | I/O operations | ++-----------------------------+---------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_removal_max_act | +| | ive <#zfs_vdev_removal_max_active>`__ | ++-----------------------------+---------------------------------------+ +| Default | 1 | ++-----------------------------+---------------------------------------+ +| Change | Dynamic | ++-----------------------------+---------------------------------------+ +| Versions Affected | planned for v2 | ++-----------------------------+---------------------------------------+ + +zfs_vdev_trim_max_active +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_trim_max_active`` sets the maximum trim I/Os active to each +device. + ++--------------------------+------------------------------------------+ +| zfs_vdev_trim_max_active | Notes | ++==========================+==========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------+------------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------+------------------------------------------+ +| Data Type | uint32 | ++--------------------------+------------------------------------------+ +| Units | I/O operations | ++--------------------------+------------------------------------------+ +| Range | 1 to | +| | `zfs_v | +| | dev_max_active <#zfs_vdev_max_active>`__ | ++--------------------------+------------------------------------------+ +| Default | 2 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | planned for v2 | ++--------------------------+------------------------------------------+ + +zfs_vdev_trim_min_active +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_trim_min_active`` sets the minimum trim I/Os active to each +device. + ++--------------------------+------------------------------------------+ +| zfs_vdev_trim_min_active | Notes | ++==========================+==========================================+ +| Tags | `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++--------------------------+------------------------------------------+ +| When to change | See `ZFS I/O | +| | Scheduler `__ | ++--------------------------+------------------------------------------+ +| Data Type | uint32 | ++--------------------------+------------------------------------------+ +| Units | I/O operations | ++--------------------------+------------------------------------------+ +| Range | 1 to | +| | `zfs_vdev_trim_m | +| | ax_active <#zfs_vdev_trim_max_active>`__ | ++--------------------------+------------------------------------------+ +| Default | 1 | ++--------------------------+------------------------------------------+ +| Change | Dynamic | ++--------------------------+------------------------------------------+ +| Versions Affected | planned for v2 | ++--------------------------+------------------------------------------+ + +zfs_initialize_value +~~~~~~~~~~~~~~~~~~~~ + +When initializing a vdev, ZFS writes patterns of +``zfs_initialize_value`` bytes to the device. + ++----------------------+----------------------------------------------+ +| zfs_initialize_value | Notes | ++======================+==============================================+ +| Tags | `vdev_initialize <#vdev_initialize>`__ | ++----------------------+----------------------------------------------+ +| When to change | when debugging initialization code | ++----------------------+----------------------------------------------+ +| Data Type | uint32 or uint64 | ++----------------------+----------------------------------------------+ +| Default | 0xdeadbeef for 32-bit systems, | +| | 0xdeadbeefdeadbeee for 64-bit systems | ++----------------------+----------------------------------------------+ +| Change | prior to running ``zpool initialize`` | ++----------------------+----------------------------------------------+ +| Versions Affected | planned for v2 | ++----------------------+----------------------------------------------+ + +zfs_lua_max_instrlimit +~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_lua_max_instrlimit`` limits the maximum time for a ZFS channel +program to run. + ++------------------------+--------------------------------------------+ +| zfs_lua_max_instrlimit | Notes | ++========================+============================================+ +| Tags | `channel_programs <#channel_programs>`__ | ++------------------------+--------------------------------------------+ +| When to change | to enforce a CPU usage limit on ZFS | +| | channel programs | ++------------------------+--------------------------------------------+ +| Data Type | ulong | ++------------------------+--------------------------------------------+ +| Units | LUA instructions | ++------------------------+--------------------------------------------+ +| Range | 0 to MAX_ULONG | ++------------------------+--------------------------------------------+ +| Default | 100,000,000 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | planned for v2 | ++------------------------+--------------------------------------------+ + +zfs_lua_max_memlimit +~~~~~~~~~~~~~~~~~~~~ + +'zfs_lua_max_memlimit' is the maximum memory limit for a ZFS channel +program. + +==================== ======================================== +zfs_lua_max_memlimit Notes +==================== ======================================== +Tags `channel_programs <#channel_programs>`__ +When to change +Data Type ulong +Units bytes +Range 0 to MAX_ULONG +Default 104,857,600 (100 MiB) +Change Dynamic +Versions Affected planned for v2 +==================== ======================================== + +zfs_max_dataset_nesting +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_max_dataset_nesting`` limits the depth of nested datasets. Deeply +nested datasets can overflow the stack. The maximum stack depth depends +on kernel compilation options, so it is impractical to predict the +possible limits. For kernels compiled with small stack sizes, +``zfs_max_dataset_nesting`` may require changes. + ++-------------------------+-------------------------------------------+ +| zfs_max_dataset_nesting | Notes | ++=========================+===========================================+ +| Tags | `dataset <#dataset>`__ | ++-------------------------+-------------------------------------------+ +| When to change | can be tuned temporarily to fix existing | +| | datasets that exceed the predefined limit | ++-------------------------+-------------------------------------------+ +| Data Type | int | ++-------------------------+-------------------------------------------+ +| Units | datasets | ++-------------------------+-------------------------------------------+ +| Range | 0 to MAX_INT | ++-------------------------+-------------------------------------------+ +| Default | 50 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic, though once on-disk the value | +| | for the pool is set | ++-------------------------+-------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------------+-------------------------------------------+ + +zfs_ddt_data_is_special +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_ddt_data_is_special`` enables the deduplication table (DDT) to +reside on a special top-level vdev. + ++-------------------------+-------------------------------------------+ +| zfs_ddt_data_is_special | Notes | ++=========================+===========================================+ +| Tags | `dedup <#dedup>`__, | +| | `special_vdev <#special_vdev>`__ | ++-------------------------+-------------------------------------------+ +| When to change | when using a special top-level vdev and | +| | no dedup top-level vdev and it is desired | +| | to store the DDT in the main pool | +| | top-level vdevs | ++-------------------------+-------------------------------------------+ +| Data Type | boolean | ++-------------------------+-------------------------------------------+ +| Range | 0=do not use special vdevs to store DDT, | +| | 1=store DDT in special vdevs | ++-------------------------+-------------------------------------------+ +| Default | 1 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic | ++-------------------------+-------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------------+-------------------------------------------+ + +zfs_user_indirect_is_special +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If special vdevs are in use, ``zfs_user_indirect_is_special`` enables +user data indirect blocks (a form of metadata) to be written to the +special vdevs. + ++------------------------------+--------------------------------------+ +| zfs_user_indirect_is_special | Notes | ++==============================+======================================+ +| Tags | `special_vdev <#special_vdev>`__ | ++------------------------------+--------------------------------------+ +| When to change | to force user data indirect blocks | +| | to remain in the main pool top-level | +| | vdevs | ++------------------------------+--------------------------------------+ +| Data Type | boolean | ++------------------------------+--------------------------------------+ +| Range | 0=do not write user indirect blocks | +| | to a special vdev, 1=write user | +| | indirect blocks to a special vdev | ++------------------------------+--------------------------------------+ +| Default | 1 | ++------------------------------+--------------------------------------+ +| Change | Dynamic | ++------------------------------+--------------------------------------+ +| Versions Affected | planned for v2 | ++------------------------------+--------------------------------------+ + +zfs_reconstruct_indirect_combinations_max +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +After device removal, if an indirect split block contains more than +``zfs_reconstruct_indirect_combinations_max`` many possible unique +combinations when being reconstructed, it can be considered too +computationally expensive to check them all. Instead, at most +``zfs_reconstruct_indirect_combinations_max`` randomly-selected +combinations are attempted each time the block is accessed. This allows +all segment copies to participate fairly in the reconstruction when all +combinations cannot be checked and prevents repeated use of one bad +copy. + ++----------------------------------+----------------------------------+ +| zfs_recon | Notes | +| struct_indirect_combinations_max | | ++==================================+==================================+ +| Tags | `vdev_removal <#vdev_removal>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | attempts | ++----------------------------------+----------------------------------+ +| Range | 0=do not limit attempts, 1 to | +| | MAX_INT = limit for attempts | ++----------------------------------+----------------------------------+ +| Default | 4096 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_send_unmodified_spill_blocks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_send_unmodified_spill_blocks`` enables sending of unmodified spill +blocks in the send stream. Under certain circumstances, previous +versions of ZFS could incorrectly remove the spill block from an +existing object. Including unmodified copies of the spill blocks creates +a backwards compatible stream which will recreate a spill block if it +was incorrectly removed. + ++----------------------------------+----------------------------------+ +| zfs_send_unmodified_spill_blocks | Notes | ++==================================+==================================+ +| Tags | `send <#send>`__ | ++----------------------------------+----------------------------------+ +| When to change | TBD | ++----------------------------------+----------------------------------+ +| Data Type | boolean | ++----------------------------------+----------------------------------+ +| Range | 0=do not send unmodified spill | +| | blocks, 1=send unmodified spill | +| | blocks | ++----------------------------------+----------------------------------+ +| Default | 1 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_spa_discard_memory_limit +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_spa_discard_memory_limit`` sets the limit for maximum memory used +for prefetching a pool's checkpoint space map on each vdev while +discarding a pool checkpoint. + +============================ ============================ +zfs_spa_discard_memory_limit Notes +============================ ============================ +Tags `checkpoint <#checkpoint>`__ +When to change TBD +Data Type int +Units bytes +Range 0 to MAX_INT +Default 16,777,216 (16 MiB) +Change Dynamic +Versions Affected planned for v2 +============================ ============================ + +zfs_special_class_metadata_reserve_pct +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_special_class_metadata_reserve_pct`` sets a threshold for space in +special vdevs to be reserved exclusively for metadata. This prevents +small blocks or dedup table from completely consuming a special vdev. + +====================================== ================================ +zfs_special_class_metadata_reserve_pct Notes +====================================== ================================ +Tags `special_vdev <#special_vdev>`__ +When to change TBD +Data Type int +Units percent +Range 0 to 100 +Default 25 +Change Dynamic +Versions Affected planned for v2 +====================================== ================================ + +zfs_trim_extent_bytes_max +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_trim_extent_bytes_max`` sets the maximum size of a trim (aka +discard, scsi unmap) command. Ranges larger than +``zfs_trim_extent_bytes_max`` are split in to chunks no larger than +``zfs_trim_extent_bytes_max`` bytes prior to being issued to the device. +Use ``zpool iostat -w`` to observe the latency of trim commands. + ++---------------------------+-----------------------------------------+ +| zfs_trim_extent_bytes_max | Notes | ++===========================+=========================================+ +| Tags | `trim <#trim>`__ | ++---------------------------+-----------------------------------------+ +| When to change | if the device can efficiently handle | +| | larger trim requests | ++---------------------------+-----------------------------------------+ +| Data Type | uint | ++---------------------------+-----------------------------------------+ +| Units | bytes | ++---------------------------+-----------------------------------------+ +| Range | `zfs_trim_extent_by | +| | tes_min <#zfs_trim_extent_bytes_min>`__ | +| | to MAX_UINT | ++---------------------------+-----------------------------------------+ +| Default | 134,217,728 (128 MiB) | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | planned for v2 | ++---------------------------+-----------------------------------------+ + +zfs_trim_extent_bytes_min +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_trim_extent_bytes_min`` sets the minimum size of trim (aka +discard, scsi unmap) commands. Trim ranges smaller than +``zfs_trim_extent_bytes_min`` are skipped unless they're part of a +larger range which was broken in to chunks. Some devices have +performance degradation during trim operations, so using a larger +``zfs_trim_extent_bytes_min`` can reduce the total amount of space +trimmed. Use ``zpool iostat -w`` to observe the latency of trim +commands. + ++---------------------------+-----------------------------------------+ +| zfs_trim_extent_bytes_min | Notes | ++===========================+=========================================+ +| Tags | `trim <#trim>`__ | ++---------------------------+-----------------------------------------+ +| When to change | when trim is in use and device | +| | performance suffers from trimming small | +| | allocations | ++---------------------------+-----------------------------------------+ +| Data Type | uint | ++---------------------------+-----------------------------------------+ +| Units | bytes | ++---------------------------+-----------------------------------------+ +| Range | 0=trim all unallocated space, otherwise | +| | minimum physical block size to MAX\_ | ++---------------------------+-----------------------------------------+ +| Default | 32,768 (32 KiB) | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | planned for v2 | ++---------------------------+-----------------------------------------+ + +zfs_trim_metaslab_skip +~~~~~~~~~~~~~~~~~~~~~~ + +| ``zfs_trim_metaslab_skip`` enables uninitialized metaslabs to be + skipped during the trim (aka discard, scsi unmap) process. + ``zfs_trim_metaslab_skip`` can be useful for pools constructed from + large thinly-provisioned devices where trim operations perform slowly. +| As a pool ages an increasing fraction of the pool's metaslabs are + initialized, progressively degrading the usefulness of this option. + This setting is stored when starting a manual trim and persists for + the duration of the requested trim. Use ``zpool iostat -w`` to observe + the latency of trim commands. + ++------------------------+--------------------------------------------+ +| zfs_trim_metaslab_skip | Notes | ++========================+============================================+ +| Tags | `trim <#trim>`__ | ++------------------------+--------------------------------------------+ +| When to change | | ++------------------------+--------------------------------------------+ +| Data Type | boolean | ++------------------------+--------------------------------------------+ +| Range | 0=do not skip unitialized metaslabs during | +| | trim, 1=skip unitialized metaslabs during | +| | trim | ++------------------------+--------------------------------------------+ +| Default | 0 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | planned for v2 | ++------------------------+--------------------------------------------+ + +zfs_trim_queue_limit +~~~~~~~~~~~~~~~~~~~~ + +``zfs_trim_queue_limit`` sets the maximum queue depth for leaf vdevs. +See also `zfs_vdev_trim_max_active <#zfs_vdev_trim_max_active>`__ and +`zfs_trim_extent_bytes_max <#zfs_trim_extent_bytes_max>`__ Use +``zpool iostat -q`` to observe trim queue depth. + ++----------------------+------------------------------------------------------+ +| zfs_trim_queue_limit | Notes | ++======================+======================================================+ +| Tags | `trim <#trim>`__ | ++----------------------+------------------------------------------------------+ +| When to change | to restrict the number of trim commands in the queue | ++----------------------+------------------------------------------------------+ +| Data Type | uint | ++----------------------+------------------------------------------------------+ +| Units | I/O operations | ++----------------------+------------------------------------------------------+ +| Range | 1 to MAX_UINT | ++----------------------+------------------------------------------------------+ +| Default | 10 | ++----------------------+------------------------------------------------------+ +| Change | Dynamic | ++----------------------+------------------------------------------------------+ +| Versions Affected | planned for v2 | ++----------------------+------------------------------------------------------+ + +zfs_trim_txg_batch +~~~~~~~~~~~~~~~~~~ + +``zfs_trim_txg_batch`` sets the number of transaction groups worth of +frees which should be aggregated before trim (aka discard, scsi unmap) +commands are issued to a device. This setting represents a trade-off +between issuing larger, more efficient trim commands and the delay +before the recently trimmed space is available for use by the device. + +Increasing this value will allow frees to be aggregated for a longer +time. This will result is larger trim operations and potentially +increased memory usage. Decreasing this value will have the opposite +effect. The default value of 32 was empirically determined to be a +reasonable compromise. + +================== =================== +zfs_trim_txg_batch Notes +================== =================== +Tags `trim <#trim>`__ +When to change TBD +Data Type uint +Units metaslabs to stride +Range 1 to MAX_UINT +Default 32 +Change Dynamic +Versions Affected planned for v2 +================== =================== + +zfs_vdev_aggregate_trim +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_aggregate_trim`` allows trim I/Os to be aggregated. This is +normally not helpful because the extents to be trimmed will have been +already been aggregated by the metaslab. + ++-------------------------+-------------------------------------------+ +| zfs_vdev_aggregate_trim | Notes | ++=========================+===========================================+ +| Tags | `trim <#trim>`__, `vdev <#vdev>`__, | +| | `ZIO_scheduler <#zio_scheduler>`__ | ++-------------------------+-------------------------------------------+ +| When to change | when debugging trim code or trim | +| | performance issues | ++-------------------------+-------------------------------------------+ +| Data Type | boolean | ++-------------------------+-------------------------------------------+ +| Range | 0=do not attempt to aggregate trim | +| | commands, 1=attempt to aggregate trim | +| | commands | ++-------------------------+-------------------------------------------+ +| Default | 0 | ++-------------------------+-------------------------------------------+ +| Change | Dynamic | ++-------------------------+-------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------------+-------------------------------------------+ + +zfs_vdev_aggregation_limit_non_rotating +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_aggregation_limit_non_rotating`` is the equivalent of +`zfs_vdev_aggregation_limit <#zfs_vdev_aggregation_limit>`__ for devices +which represent themselves as non-rotating to the Linux blkdev +interfaces. Such devices have a value of 0 in +``/sys/block/DEVICE/queue/rotational`` and are expected to be SSDs. + ++----------------------------------+----------------------------------+ +| zfs_vde | Notes | +| v_aggregation_limit_non_rotating | | ++==================================+==================================+ +| Tags | `vdev <#vdev>`__, | +| | `Z | +| | IO_scheduler <#zio_scheduler>`__ | ++----------------------------------+----------------------------------+ +| When to change | see | +| | `zfs_vdev_aggregation_limit | +| | <#zfs_vdev_aggregation_limit>`__ | ++----------------------------------+----------------------------------+ +| Data Type | int | ++----------------------------------+----------------------------------+ +| Units | bytes | ++----------------------------------+----------------------------------+ +| Range | 0 to MAX_INT | ++----------------------------------+----------------------------------+ +| Default | 131,072 bytes (128 KiB) | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zil_nocacheflush +~~~~~~~~~~~~~~~~ + +ZFS uses barriers (volatile cache flush commands) to ensure data is +committed to permanent media by devices. This ensures consistent +on-media state for devices where caches are volatile (eg HDDs). + +``zil_nocacheflush`` disables the cache flush commands that are normally +sent to devices by the ZIL after a log write has completed. + +The difference between ``zil_nocacheflush`` and +`zfs_nocacheflush <#zfs_nocacheflush>`__ is ``zil_nocacheflush`` applies +to ZIL writes while `zfs_nocacheflush <#zfs_nocacheflush>`__ disables +barrier writes to the pool devices at the end of tranaction group syncs. + +WARNING: setting this can cause ZIL corruption on power loss if the +device has a volatile write cache. + ++-------------------+-------------------------------------------------+ +| zil_nocacheflush | Notes | ++===================+=================================================+ +| Tags | `disks <#disks>`__, `ZIL <#ZIL>`__ | ++-------------------+-------------------------------------------------+ +| When to change | If the storage device has nonvolatile cache, | +| | then disabling cache flush can save the cost of | +| | occasional cache flush comamnds | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=send cache flush commands, 1=do not send | +| | cache flush commands | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------+-------------------------------------------------+ + +zio_deadman_log_all +~~~~~~~~~~~~~~~~~~~ + +``zio_deadman_log_all`` enables debugging messages for all ZFS I/Os, +rather than only for leaf ZFS I/Os for a vdev. This is meant to be used +by developers to gain diagnostic information for hang conditions which +don't involve a mutex or other locking primitive. Typically these are +conditions where a thread in the zio pipeline is looping indefinitely. + +See also `zfs_dbgmsg_enable <#zfs_dbgmsg_enable>`__ + ++---------------------+-----------------------------------------------+ +| zio_deadman_log_all | Notes | ++=====================+===============================================+ +| Tags | `debug <#debug>`__ | ++---------------------+-----------------------------------------------+ +| When to change | when debugging ZFS I/O pipeline | ++---------------------+-----------------------------------------------+ +| Data Type | boolean | ++---------------------+-----------------------------------------------+ +| Range | 0=do not log all deadman events, 1=log all | +| | deadman events | ++---------------------+-----------------------------------------------+ +| Default | 0 | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | planned for v2 | ++---------------------+-----------------------------------------------+ + +zio_decompress_fail_fraction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If non-zero, ``zio_decompress_fail_fraction`` represents the denominator +of the probability that ZFS should induce a decompression failure. For +instance, for a 5% decompression failure rate, this value should be set +to 20. + ++------------------------------+--------------------------------------+ +| zio_decompress_fail_fraction | Notes | ++==============================+======================================+ +| Tags | `debug <#debug>`__ | ++------------------------------+--------------------------------------+ +| When to change | when debugging ZFS internal | +| | compressed buffer code | ++------------------------------+--------------------------------------+ +| Data Type | ulong | ++------------------------------+--------------------------------------+ +| Units | probability of induced decompression | +| | failure is | +| | 1/``zio_decompress_fail_fraction`` | ++------------------------------+--------------------------------------+ +| Range | 0 = do not induce failures, or 1 to | +| | MAX_ULONG | ++------------------------------+--------------------------------------+ +| Default | 0 | ++------------------------------+--------------------------------------+ +| Change | Dynamic | ++------------------------------+--------------------------------------+ +| Versions Affected | planned for v2 | ++------------------------------+--------------------------------------+ + +zio_slow_io_ms +~~~~~~~~~~~~~~ + +An I/O operation taking more than ``zio_slow_io_ms`` milliseconds to +complete is marked as a slow I/O. Slow I/O counters can be observed with +``zpool status -s``. Each slow I/O causes a delay zevent, observable +using ``zpool events``. See also ``zfs-events(5)``. + ++-------------------+-------------------------------------------------+ +| zio_slow_io_ms | Notes | ++===================+=================================================+ +| Tags | `vdev <#vdev>`__, `zed <#zed>`__ | ++-------------------+-------------------------------------------------+ +| When to change | when debugging slow devices and the default | +| | value is inappropriate | ++-------------------+-------------------------------------------------+ +| Data Type | int | ++-------------------+-------------------------------------------------+ +| Units | milliseconds | ++-------------------+-------------------------------------------------+ +| Range | 0 to MAX_INT | ++-------------------+-------------------------------------------------+ +| Default | 30,000 (30 seconds) | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------+-------------------------------------------------+ + +vdev_validate_skip +~~~~~~~~~~~~~~~~~~ + +``vdev_validate_skip`` disables label validation steps during pool +import. Changing is not recommended unless you know what you are doing +and are recovering a damaged label. + ++--------------------+------------------------------------------------+ +| vdev_validate_skip | Notes | ++====================+================================================+ +| Tags | `vdev <#vdev>`__ | ++--------------------+------------------------------------------------+ +| When to change | do not change | ++--------------------+------------------------------------------------+ +| Data Type | boolean | ++--------------------+------------------------------------------------+ +| Range | 0=validate labels during pool import, 1=do not | +| | validate vdev labels during pool import | ++--------------------+------------------------------------------------+ +| Default | 0 | ++--------------------+------------------------------------------------+ +| Change | prior to pool import | ++--------------------+------------------------------------------------+ +| Versions Affected | planned for v2 | ++--------------------+------------------------------------------------+ + +zfs_async_block_max_blocks +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_async_block_max_blocks`` limits the number of blocks freed in a +single transaction group commit. During deletes of large objects, such +as snapshots, the number of freed blocks can cause the DMU to extend txg +sync times well beyond `zfs_txg_timeout <#zfs_txg_timeout>`__. +``zfs_async_block_max_blocks`` is used to limit these effects. + +========================== ==================================== +zfs_async_block_max_blocks Notes +========================== ==================================== +Tags `delete <#delete>`__, `DMU <#DMU>`__ +When to change TBD +Data Type ulong +Units blocks +Range 1 to MAX_ULONG +Default MAX_ULONG (do not limit) +Change Dynamic +Versions Affected planned for v2 +========================== ==================================== + +zfs_checksum_events_per_second +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_checksum_events_per_second`` is a rate limit for checksum events. +Note that this should not be set below the ``zed`` thresholds (currently +10 checksums over 10 sec) or else ``zed`` may not trigger any action. + +============================== ============================= +zfs_checksum_events_per_second Notes +============================== ============================= +Tags `vdev <#vdev>`__ +When to change TBD +Data Type uint +Units checksum events +Range ``zed`` threshold to MAX_UINT +Default 20 +Change Dynamic +Versions Affected planned for v2 +============================== ============================= + +zfs_disable_ivset_guid_check +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_disable_ivset_guid_check`` disables requirement for IVset guids to +be present and match when doing a raw receive of encrypted datasets. +Intended for users whose pools were created with ZFS on Linux +pre-release versions and now have compatibility issues. + +For a ZFS raw receive, from a send stream created by ``zfs send --raw``, +the crypt_keydata nvlist includes a to_ivset_guid to be set on the new +snapshot. This value will override the value generated by the snapshot +code. However, this value may not be present, because older +implementations of the raw send code did not include this value. When +``zfs_disable_ivset_guid_check`` is enabled, the receive proceeds and a +newly-generated value is used. + ++------------------------------+--------------------------------------+ +| zfs_disable_ivset_guid_check | Notes | ++==============================+======================================+ +| Tags | `receive <#receive>`__ | ++------------------------------+--------------------------------------+ +| When to change | debugging pre-release ZFS raw sends | ++------------------------------+--------------------------------------+ +| Data Type | boolean | ++------------------------------+--------------------------------------+ +| Range | 0=check IVset guid, 1=do not check | +| | IVset guid | ++------------------------------+--------------------------------------+ +| Default | 0 | ++------------------------------+--------------------------------------+ +| Change | Dynamic | ++------------------------------+--------------------------------------+ +| Versions Affected | planned for v2 | ++------------------------------+--------------------------------------+ + +zfs_obsolete_min_time_ms +~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_obsolete_min_time_ms`` is similar to +`zfs_free_min_time_ms <#zfs_free_min_time_ms>`__ and used for cleanup of +old indirection records for vdevs removed using the ``zpool remove`` +command. + +======================== ========================================== +zfs_obsolete_min_time_ms Notes +======================== ========================================== +Tags `delete <#delete>`__, `remove <#remove>`__ +When to change TBD +Data Type int +Units milliseconds +Range 0 to MAX_INT +Default 500 +Change Dynamic +Versions Affected planned for v2 +======================== ========================================== + +zfs_override_estimate_recordsize +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_override_estimate_recordsize`` overrides the default logic for +estimating block sizes when doing a zfs send. The default heuristic is +that the average block size will be the current recordsize. + ++----------------------------------+----------------------------------+ +| zfs_override_estimate_recordsize | Notes | ++==================================+==================================+ +| Tags | `send <#send>`__ | ++----------------------------------+----------------------------------+ +| When to change | if most data in your dataset is | +| | not of the current recordsize | +| | and you require accurate zfs | +| | send size estimates | ++----------------------------------+----------------------------------+ +| Data Type | ulong | ++----------------------------------+----------------------------------+ +| Units | bytes | ++----------------------------------+----------------------------------+ +| Range | 0=do not override, 1 to | +| | MAX_ULONG | ++----------------------------------+----------------------------------+ +| Default | 0 | ++----------------------------------+----------------------------------+ +| Change | Dynamic | ++----------------------------------+----------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------------+----------------------------------+ + +zfs_remove_max_segment +~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_remove_max_segment`` sets the largest contiguous segment that ZFS +attempts to allocate when removing a vdev. This can be no larger than +16MB. If there is a performance problem with attempting to allocate +large blocks, consider decreasing this. The value is rounded up to a +power-of-2. + ++------------------------+--------------------------------------------+ +| zfs_remove_max_segment | Notes | ++========================+============================================+ +| Tags | `remove <#remove>`__ | ++------------------------+--------------------------------------------+ +| When to change | after removing a top-level vdev, consider | +| | decreasing if there is a performance | +| | degradation when attempting to allocate | +| | large blocks | ++------------------------+--------------------------------------------+ +| Data Type | int | ++------------------------+--------------------------------------------+ +| Units | bytes | ++------------------------+--------------------------------------------+ +| Range | maximum of the physical block size of all | +| | vdevs in the pool to 16,777,216 bytes (16 | +| | MiB) | ++------------------------+--------------------------------------------+ +| Default | 16,777,216 bytes (16 MiB) | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | planned for v2 | ++------------------------+--------------------------------------------+ + +zfs_resilver_disable_defer +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_resilver_disable_defer`` disables the ``resilver_defer`` pool +feature. The ``resilver_defer`` feature allows ZFS to postpone new +resilvers if an existing resilver is in progress. + ++----------------------------+----------------------------------------+ +| zfs_resilver_disable_defer | Notes | ++============================+========================================+ +| Tags | `resilver <#resilver>`__ | ++----------------------------+----------------------------------------+ +| When to change | if resilver postponement is not | +| | desired due to overall resilver time | +| | constraints | ++----------------------------+----------------------------------------+ +| Data Type | boolean | ++----------------------------+----------------------------------------+ +| Range | 0=allow ``resilver_defer`` to postpone | +| | new resilver operations, 1=immediately | +| | restart resilver when needed | ++----------------------------+----------------------------------------+ +| Default | 0 | ++----------------------------+----------------------------------------+ +| Change | Dynamic | ++----------------------------+----------------------------------------+ +| Versions Affected | planned for v2 | ++----------------------------+----------------------------------------+ + +zfs_scan_suspend_progress +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_scan_suspend_progress`` causes a scrub or resilver scan to freeze +without actually pausing. + +========================= ============================================ +zfs_scan_suspend_progress Notes +========================= ============================================ +Tags `resilver <#resilver>`__, `scrub <#scrub>`__ +When to change testing or debugging scan code +Data Type boolean +Range 0=do not freeze scans, 1=freeze scans +Default 0 +Change Dynamic +Versions Affected planned for v2 +========================= ============================================ + +zfs_scrub_min_time_ms +~~~~~~~~~~~~~~~~~~~~~ + +Scrubs are processed by the sync thread. While scrubbing at least +``zfs_scrub_min_time_ms`` time is spent working on a scrub between txg +syncs. + +===================== ================================================= +zfs_scrub_min_time_ms Notes +===================== ================================================= +Tags `scrub <#scrub>`__ +When to change +Data Type int +Units milliseconds +Range 1 to (`zfs_txg_timeout <#zfs_txg_timeout>`__ - 1) +Default 1,000 +Change Dynamic +Versions Affected planned for v2 +===================== ================================================= + +zfs_slow_io_events_per_second +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_slow_io_events_per_second`` is a rate limit for slow I/O events. +Note that this should not be set below the ``zed`` thresholds (currently +10 checksums over 10 sec) or else ``zed`` may not trigger any action. + +============================= ============================= +zfs_slow_io_events_per_second Notes +============================= ============================= +Tags `vdev <#vdev>`__ +When to change TBD +Data Type uint +Units slow I/O events +Range ``zed`` threshold to MAX_UINT +Default 20 +Change Dynamic +Versions Affected planned for v2 +============================= ============================= + +zfs_vdev_min_ms_count +~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_min_ms_count`` is the minimum number of metaslabs to create +in a top-level vdev. + ++-----------------------+---------------------------------------------+ +| zfs_vdev_min_ms_count | Notes | ++=======================+=============================================+ +| Tags | `metaslab <#metaslab>`__, `vdev <#vdev>`__ | ++-----------------------+---------------------------------------------+ +| When to change | TBD | ++-----------------------+---------------------------------------------+ +| Data Type | int | ++-----------------------+---------------------------------------------+ +| Units | metaslabs | ++-----------------------+---------------------------------------------+ +| Range | 16 to | +| | `zfs_vdev_m | +| | s_count_limit <#zfs_vdev_ms_count_limit>`__ | ++-----------------------+---------------------------------------------+ +| Default | 16 | ++-----------------------+---------------------------------------------+ +| Change | prior to creating a pool or adding a | +| | top-level vdev | ++-----------------------+---------------------------------------------+ +| Versions Affected | planned for v2 | ++-----------------------+---------------------------------------------+ + +zfs_vdev_ms_count_limit +~~~~~~~~~~~~~~~~~~~~~~~ + +``zfs_vdev_ms_count_limit`` is the practical upper limit for the number +of metaslabs per top-level vdev. + ++-------------------------+-------------------------------------------+ +| zfs_vdev_ms_count_limit | Notes | ++=========================+===========================================+ +| Tags | `metaslab <#metaslab>`__, | +| | `vdev <#vdev>`__ | ++-------------------------+-------------------------------------------+ +| When to change | TBD | ++-------------------------+-------------------------------------------+ +| Data Type | int | ++-------------------------+-------------------------------------------+ +| Units | metaslabs | ++-------------------------+-------------------------------------------+ +| Range | `zfs_vdev | +| | _min_ms_count <#zfs_vdev_min_ms_count>`__ | +| | to 131,072 | ++-------------------------+-------------------------------------------+ +| Default | 131,072 | ++-------------------------+-------------------------------------------+ +| Change | prior to creating a pool or adding a | +| | top-level vdev | ++-------------------------+-------------------------------------------+ +| Versions Affected | planned for v2 | ++-------------------------+-------------------------------------------+ + +spl_hostid +~~~~~~~~~~ + +| ``spl_hostid`` is a unique system id number. It orginated in Sun's + products where most systems had a unique id assigned at the factory. + This assignment does not exist in modern hardware. +| In ZFS, the hostid is stored in the vdev label and can be used to + determine if another system had imported the pool. When set + ``spl_hostid`` can be used to uniquely identify a system. By default + this value is set to zero which indicates the hostid is disabled. It + can be explicitly enabled by placing a unique non-zero value in the + file shown in `spl_hostid_path <#spl_hostid_path>`__ + ++-------------------+-------------------------------------------------+ +| spl_hostid | Notes | ++===================+=================================================+ +| Tags | `hostid <#hostid>`__, `MMP <#MMP>`__ | ++-------------------+-------------------------------------------------+ +| Kernel module | spl | ++-------------------+-------------------------------------------------+ +| When to change | to uniquely identify a system when vdevs can be | +| | shared across multiple systems | ++-------------------+-------------------------------------------------+ +| Data Type | ulong | ++-------------------+-------------------------------------------------+ +| Range | 0=ignore hostid, 1 to 4,294,967,295 (32-bits or | +| | 0xffffffff) | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | prior to importing pool | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.1 | ++-------------------+-------------------------------------------------+ + +spl_hostid_path +~~~~~~~~~~~~~~~ + +``spl_hostid_path`` is the path name for a file that can contain a +unique hostid. For testing purposes, ``spl_hostid_path`` can be +overridden by the ZFS_HOSTID environment variable. + ++-------------------+-------------------------------------------------+ +| spl_hostid_path | Notes | ++===================+=================================================+ +| Tags | `hostid <#hostid>`__, `MMP <#MMP>`__ | ++-------------------+-------------------------------------------------+ +| Kernel module | spl | ++-------------------+-------------------------------------------------+ +| When to change | when creating a new ZFS distribution where the | +| | default value is inappropriate | ++-------------------+-------------------------------------------------+ +| Data Type | string | ++-------------------+-------------------------------------------------+ +| Default | "/etc/hostid" | ++-------------------+-------------------------------------------------+ +| Change | read-only, can only be changed prior to spl | +| | module load | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.6.1 | ++-------------------+-------------------------------------------------+ + +spl_kmem_alloc_max +~~~~~~~~~~~~~~~~~~ + +Large ``kmem_alloc()`` allocations fail if they exceed KMALLOC_MAX_SIZE, +as determined by the kernel source. Allocations which are marginally +smaller than this limit may succeed but should still be avoided due to +the expense of locating a contiguous range of free pages. Therefore, a +maximum kmem size with reasonable safely margin of 4x is set. +``kmem_alloc()`` allocations larger than this maximum will quickly fail. +``vmem_alloc()`` allocations less than or equal to this value will use +``kmalloc()``, but shift to ``vmalloc()`` when exceeding this value. + +================== ==================== +spl_kmem_alloc_max Notes +================== ==================== +Tags `memory <#memory>`__ +Kernel module spl +When to change TBD +Data Type uint +Units bytes +Range TBD +Default KMALLOC_MAX_SIZE / 4 +Change Dynamic +Versions Affected v0.7.0 +================== ==================== + +spl_kmem_alloc_warn +~~~~~~~~~~~~~~~~~~~ + +As a general rule ``kmem_alloc()`` allocations should be small, +preferably just a few pages since they must by physically contiguous. +Therefore, a rate limited warning is printed to the console for any +``kmem_alloc()`` which exceeds the threshold ``spl_kmem_alloc_warn`` + +The default warning threshold is set to eight pages but capped at 32K to +accommodate systems using large pages. This value was selected to be +small enough to ensure the largest allocations are quickly noticed and +fixed. But large enough to avoid logging any warnings when a allocation +size is larger than optimal but not a serious concern. Since this value +is tunable, developers are encouraged to set it lower when testing so +any new largish allocations are quickly caught. These warnings may be +disabled by setting the threshold to zero. + ++---------------------+-----------------------------------------------+ +| spl_kmem_alloc_warn | Notes | ++=====================+===============================================+ +| Tags | `memory <#memory>`__ | ++---------------------+-----------------------------------------------+ +| Kernel module | spl | ++---------------------+-----------------------------------------------+ +| When to change | developers are encouraged lower when testing | +| | so any new, large allocations are quickly | +| | caught | ++---------------------+-----------------------------------------------+ +| Data Type | uint | ++---------------------+-----------------------------------------------+ +| Units | bytes | ++---------------------+-----------------------------------------------+ +| Range | 0=disable the warnings, | ++---------------------+-----------------------------------------------+ +| Default | 32,768 (32 KiB) | ++---------------------+-----------------------------------------------+ +| Change | Dynamic | ++---------------------+-----------------------------------------------+ +| Versions Affected | v0.7.0 | ++---------------------+-----------------------------------------------+ + +spl_kmem_cache_expire +~~~~~~~~~~~~~~~~~~~~~ + +Cache expiration is part of default illumos cache behavior. The idea is +that objects in magazines which have not been recently accessed should +be returned to the slabs periodically. This is known as cache aging and +when enabled objects will be typically returned after 15 seconds. + +On the other hand Linux slabs are designed to never move objects back to +the slabs unless there is memory pressure. This is possible because +under Linux the cache will be notified when memory is low and objects +can be released. + +By default only the Linux method is enabled. It has been shown to +improve responsiveness on low memory systems and not negatively impact +the performance of systems with more memory. This policy may be changed +by setting the ``spl_kmem_cache_expire`` bit mask as follows, both +policies may be enabled concurrently. + +===================== ================================================= +spl_kmem_cache_expire Notes +===================== ================================================= +Tags `memory <#memory>`__ +Kernel module spl +When to change TBD +Data Type bitmask +Range 0x01 - Aging (illumos), 0x02 - Low memory (Linux) +Default 0x02 +Change Dynamic +Versions Affected v0.6.1 +===================== ================================================= + +spl_kmem_cache_kmem_limit +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Depending on the size of a memory cache object it may be backed by +``kmalloc()`` or ``vmalloc()`` memory. This is because the size of the +required allocation greatly impacts the best way to allocate the memory. + +When objects are small and only a small number of memory pages need to +be allocated, ideally just one, then ``kmalloc()`` is very efficient. +However, allocating multiple pages with ``kmalloc()`` gets increasingly +expensive because the pages must be physically contiguous. + +For this reason we shift to ``vmalloc()`` for slabs of large objects +which which removes the need for contiguous pages. ``vmalloc()`` cannot +be used in all cases because there is significant locking overhead +involved. This function takes a single global lock over the entire +virtual address range which serializes all allocations. Using slightly +different allocation functions for small and large objects allows us to +handle a wide range of object sizes. + +The ``spl_kmem_cache_kmem_limit`` value is used to determine this cutoff +size. One quarter of the kernel's compiled PAGE_SIZE is used as the +default value because +`spl_kmem_cache_obj_per_slab <#spl_kmem_cache_obj_per_slab>`__ defaults +to 16. With these default values, at most four contiguous pages are +allocated. + +========================= ==================== +spl_kmem_cache_kmem_limit Notes +========================= ==================== +Tags `memory <#memory>`__ +Kernel module spl +When to change TBD +Data Type uint +Units pages +Range TBD +Default PAGE_SIZE / 4 +Change Dynamic +Versions Affected v0.7.0 +========================= ==================== + +spl_kmem_cache_max_size +~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_kmem_cache_max_size`` is the maximum size of a kmem cache slab in +MiB. This effectively limits the maximum cache object size to +``spl_kmem_cache_max_size`` / +`spl_kmem_cache_obj_per_slab <#spl_kmem_cache_obj_per_slab>`__ Kmem +caches may not be created with object sized larger than this limit. + +======================= ========================================= +spl_kmem_cache_max_size Notes +======================= ========================================= +Tags `memory <#memory>`__ +Kernel module spl +When to change TBD +Data Type uint +Units MiB +Range TBD +Default 4 for 32-bit kernel, 32 for 64-bit kernel +Change Dynamic +Versions Affected v0.7.0 +======================= ========================================= + +spl_kmem_cache_obj_per_slab +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_kmem_cache_obj_per_slab`` is the preferred number of objects per +slab in the kmem cache. In general, a larger value will increase the +caches memory footprint while decreasing the time required to perform an +allocation. Conversely, a smaller value will minimize the footprint and +improve cache reclaim time but individual allocations may take longer. + +=========================== ==================== +spl_kmem_cache_obj_per_slab Notes +=========================== ==================== +Tags `memory <#memory>`__ +Kernel module spl +When to change TBD +Data Type uint +Units kmem cache objects +Range TBD +Default 8 +Change Dynamic +Versions Affected v0.7.0 +=========================== ==================== + +spl_kmem_cache_obj_per_slab_min +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_kmem_cache_obj_per_slab_min`` is the minimum number of objects +allowed per slab. Normally slabs will contain +`spl_kmem_cache_obj_per_slab <#spl_kmem_cache_obj_per_slab>`__ objects +but for caches that contain very large objects it's desirable to only +have a few, or even just one, object per slab. + +=============================== =============================== +spl_kmem_cache_obj_per_slab_min Notes +=============================== =============================== +Tags `memory <#memory>`__ +Kernel module spl +When to change debugging kmem cache operations +Data Type uint +Units kmem cache objects +Range TBD +Default 1 +Change Dynamic +Versions Affected v0.7.0 +=============================== =============================== + +spl_kmem_cache_reclaim +~~~~~~~~~~~~~~~~~~~~~~ + +``spl_kmem_cache_reclaim`` prevents Linux from being able to rapidly +reclaim all the memory held by the kmem caches. This may be useful in +circumstances where it's preferable that Linux reclaim memory from some +other subsystem first. Setting ``spl_kmem_cache_reclaim`` increases the +likelihood out of memory events on a memory constrained system. + ++------------------------+--------------------------------------------+ +| spl_kmem_cache_reclaim | Notes | ++========================+============================================+ +| Tags | `memory <#memory>`__ | ++------------------------+--------------------------------------------+ +| Kernel module | spl | ++------------------------+--------------------------------------------+ +| When to change | TBD | ++------------------------+--------------------------------------------+ +| Data Type | boolean | ++------------------------+--------------------------------------------+ +| Range | 0=enable rapid memory reclaim from kmem | +| | caches, 1=disable rapid memory reclaim | +| | from kmem caches | ++------------------------+--------------------------------------------+ +| Default | 0 | ++------------------------+--------------------------------------------+ +| Change | Dynamic | ++------------------------+--------------------------------------------+ +| Versions Affected | v0.7.0 | ++------------------------+--------------------------------------------+ + +spl_kmem_cache_slab_limit +~~~~~~~~~~~~~~~~~~~~~~~~~ + +For small objects the Linux slab allocator should be used to make the +most efficient use of the memory. However, large objects are not +supported by the Linux slab allocator and therefore the SPL +implementation is preferred. ``spl_kmem_cache_slab_limit`` is used to +determine the cutoff between a small and large object. + +Objects of ``spl_kmem_cache_slab_limit`` or smaller will be allocated +using the Linux slab allocator, large objects use the SPL allocator. A +cutoff of 16 KiB was determined to be optimal for architectures using 4 +KiB pages. + ++---------------------------+-----------------------------------------+ +| spl_kmem_cache_slab_limit | Notes | ++===========================+=========================================+ +| Tags | `memory <#memory>`__ | ++---------------------------+-----------------------------------------+ +| Kernel module | spl | ++---------------------------+-----------------------------------------+ +| When to change | TBD | ++---------------------------+-----------------------------------------+ +| Data Type | uint | ++---------------------------+-----------------------------------------+ +| Units | bytes | ++---------------------------+-----------------------------------------+ +| Range | TBD | ++---------------------------+-----------------------------------------+ +| Default | 16,384 (16 KiB) when kernel PAGE_SIZE = | +| | 4KiB, 0 for other PAGE_SIZE values | ++---------------------------+-----------------------------------------+ +| Change | Dynamic | ++---------------------------+-----------------------------------------+ +| Versions Affected | v0.7.0 | ++---------------------------+-----------------------------------------+ + +spl_max_show_tasks +~~~~~~~~~~~~~~~~~~ + +``spl_max_show_tasks`` is the limit of tasks per pending list in each +taskq shown in ``/proc/spl/taskq`` and ``/proc/spl/taskq-all``. Reading +the ProcFS files walks the lists with lock held and it could cause a +lock up if the list grow too large. If the list is larger than the +limit, the string \`"(truncated)" is printed. + +================== =================================== +spl_max_show_tasks Notes +================== =================================== +Tags `taskq <#taskq>`__ +Kernel module spl +When to change TBD +Data Type uint +Units tasks reported +Range 0 disables the limit, 1 to MAX_UINT +Default 512 +Change Dynamic +Versions Affected v0.7.0 +================== =================================== + +spl_panic_halt +~~~~~~~~~~~~~~ + +``spl_panic_halt`` enables kernel panic upon assertion failures. When +not enabled, the asserting thread is halted to facilitate further +debugging. + ++-------------------+-------------------------------------------------+ +| spl_panic_halt | Notes | ++===================+=================================================+ +| Tags | `debug <#debug>`__, `panic <#panic>`__ | ++-------------------+-------------------------------------------------+ +| Kernel module | spl | ++-------------------+-------------------------------------------------+ +| When to change | when debugging assertions and kernel core dumps | +| | are desired | ++-------------------+-------------------------------------------------+ +| Data Type | boolean | ++-------------------+-------------------------------------------------+ +| Range | 0=halt thread upon assertion, 1=panic kernel | +| | upon assertion | ++-------------------+-------------------------------------------------+ +| Default | 0 | ++-------------------+-------------------------------------------------+ +| Change | Dynamic | ++-------------------+-------------------------------------------------+ +| Versions Affected | v0.7.0 | ++-------------------+-------------------------------------------------+ + +spl_taskq_kick +~~~~~~~~~~~~~~ + +Upon writing a non-zero value to ``spl_taskq_kick``, all taskqs are +scanned. If any taskq has a pending task more than 5 seconds old, the +taskq spawns more threads. This can be useful in rare deadlock +situations caused by one or more taskqs not spawning a thread when it +should. + +================= ===================== +spl_taskq_kick Notes +================= ===================== +Tags `taskq <#taskq>`__ +Kernel module spl +When to change See description above +Data Type uint +Units N/A +Default 0 +Change Dynamic +Versions Affected v0.7.0 +================= ===================== + +spl_taskq_thread_bind +~~~~~~~~~~~~~~~~~~~~~ + +``spl_taskq_thread_bind`` enables binding taskq threads to specific +CPUs, distributed evenly over the available CPUs. By default, this +behavior is disabled to allow the Linux scheduler the maximum +flexibility to determine where a thread should run. + ++-----------------------+---------------------------------------------+ +| spl_taskq_thread_bind | Notes | ++=======================+=============================================+ +| Tags | `CPU <#CPU>`__, `taskq <#taskq>`__ | ++-----------------------+---------------------------------------------+ +| Kernel module | spl | ++-----------------------+---------------------------------------------+ +| When to change | when debugging CPU scheduling options | ++-----------------------+---------------------------------------------+ +| Data Type | boolean | ++-----------------------+---------------------------------------------+ +| Range | 0=taskqs are not bound to specific CPUs, | +| | 1=taskqs are bound to CPUs | ++-----------------------+---------------------------------------------+ +| Default | 0 | ++-----------------------+---------------------------------------------+ +| Change | prior to loading spl kernel module | ++-----------------------+---------------------------------------------+ +| Versions Affected | v0.7.0 | ++-----------------------+---------------------------------------------+ + +spl_taskq_thread_dynamic +~~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_taskq_thread_dynamic`` enables taskqs to set the TASKQ_DYNAMIC +flag will by default create only a single thread. New threads will be +created on demand up to a maximum allowed number to facilitate the +completion of outstanding tasks. Threads which are no longer needed are +promptly destroyed. By default this behavior is enabled but it can be d. + +See also +`zfs_zil_clean_taskq_nthr_pct <#zfs_zil_clean_taskq_nthr_pct>`__, +`zio_taskq_batch_pct <#zio_taskq_batch_pct>`__ + ++--------------------------+------------------------------------------+ +| spl_taskq_thread_dynamic | Notes | ++==========================+==========================================+ +| Tags | `taskq <#taskq>`__ | ++--------------------------+------------------------------------------+ +| Kernel module | spl | ++--------------------------+------------------------------------------+ +| When to change | disable for performance analysis or | +| | troubleshooting | ++--------------------------+------------------------------------------+ +| Data Type | boolean | ++--------------------------+------------------------------------------+ +| Range | 0=taskq threads are not dynamic, 1=taskq | +| | threads are dynamically created and | +| | destroyed | ++--------------------------+------------------------------------------+ +| Default | 1 | ++--------------------------+------------------------------------------+ +| Change | prior to loading spl kernel module | ++--------------------------+------------------------------------------+ +| Versions Affected | v0.7.0 | ++--------------------------+------------------------------------------+ + +spl_taskq_thread_priority +~~~~~~~~~~~~~~~~~~~~~~~~~ + +| ``spl_taskq_thread_priority`` allows newly created taskq threads to + set a non-default scheduler priority. When enabled the priority + specified when a taskq is created will be applied to all threads + created by that taskq. +| When disabled all threads will use the default Linux kernel thread + priority. + ++---------------------------+-----------------------------------------+ +| spl_taskq_thread_priority | Notes | ++===========================+=========================================+ +| Tags | `CPU <#CPU>`__, `taskq <#taskq>`__ | ++---------------------------+-----------------------------------------+ +| Kernel module | spl | ++---------------------------+-----------------------------------------+ +| When to change | when troubleshooting CPU | +| | scheduling-related performance issues | ++---------------------------+-----------------------------------------+ +| Data Type | boolean | ++---------------------------+-----------------------------------------+ +| Range | 0=taskq threads use the default Linux | +| | kernel thread priority, 1= | ++---------------------------+-----------------------------------------+ +| Default | 1 | ++---------------------------+-----------------------------------------+ +| Change | prior to loading spl kernel module | ++---------------------------+-----------------------------------------+ +| Versions Affected | v0.7.0 | ++---------------------------+-----------------------------------------+ + +spl_taskq_thread_sequential +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_taskq_thread_sequential`` is the number of items a taskq worker +thread must handle without interruption before requesting a new worker +thread be spawned. ``spl_taskq_thread_sequential`` controls how quickly +taskqs ramp up the number of threads processing the queue. Because Linux +thread creation and destruction are relatively inexpensive a small +default value has been selected. Thus threads are created aggressively, +which is typically desirable. Increasing this value results in a slower +thread creation rate which may be preferable for some configurations. + +=========================== ================================== +spl_taskq_thread_sequential Notes +=========================== ================================== +Tags `CPU <#CPU>`__, `taskq <#taskq>`__ +Kernel module spl +When to change TBD +Data Type int +Units taskq items +Range 1 to MAX_INT +Default 4 +Change Dynamic +Versions Affected v0.7.0 +=========================== ================================== + +spl_kmem_cache_kmem_threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_kmem_cache_kmem_threads`` shows the current number of +``spl_kmem_cache`` threads. This task queue is responsible for +allocating new slabs for use by the kmem caches. For the majority of +systems and workloads only a small number of threads are required. + ++-----------------------------+---------------------------------------+ +| spl_kmem_cache_kmem_threads | Notes | ++=============================+=======================================+ +| Tags | `CPU <#CPU>`__, `memory <#memory>`__ | ++-----------------------------+---------------------------------------+ +| Kernel module | spl | ++-----------------------------+---------------------------------------+ +| When to change | read-only | ++-----------------------------+---------------------------------------+ +| Data Type | int | ++-----------------------------+---------------------------------------+ +| Range | 1 to MAX_INT | ++-----------------------------+---------------------------------------+ +| Units | threads | ++-----------------------------+---------------------------------------+ +| Default | 4 | ++-----------------------------+---------------------------------------+ +| Change | read-only, can only be changed prior | +| | to spl module load | ++-----------------------------+---------------------------------------+ +| Versions Affected | v0.7.0 | ++-----------------------------+---------------------------------------+ + +spl_kmem_cache_magazine_size +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``spl_kmem_cache_magazine_size`` shows the current . Cache magazines are +an optimization designed to minimize the cost of allocating memory. They +do this by keeping a per-cpu cache of recently freed objects, which can +then be reallocated without taking a lock. This can improve performance +on highly contended caches. However, because objects in magazines will +prevent otherwise empty slabs from being immediately released this may +not be ideal for low memory machines. + +For this reason spl_kmem_cache_magazine_size can be used to set a +maximum magazine size. When this value is set to 0 the magazine size +will be automatically determined based on the object size. Otherwise +magazines will be limited to 2-256 objects per magazine (eg per CPU). +Magazines cannot be disabled entirely in this implementation. + ++------------------------------+--------------------------------------+ +| spl_kmem_cache_magazine_size | Notes | ++==============================+======================================+ +| Tags | `CPU <#CPU>`__, `memory <#memory>`__ | ++------------------------------+--------------------------------------+ +| Kernel module | spl | ++------------------------------+--------------------------------------+ +| When to change | | ++------------------------------+--------------------------------------+ +| Data Type | int | ++------------------------------+--------------------------------------+ +| Units | threads | ++------------------------------+--------------------------------------+ +| Range | 0=automatically scale magazine size, | +| | otherwise 2 to 256 | ++------------------------------+--------------------------------------+ +| Default | 0 | ++------------------------------+--------------------------------------+ +| Change | read-only, can only be changed prior | +| | to spl module load | ++------------------------------+--------------------------------------+ +| Versions Affected | v0.7.0 | ++------------------------------+--------------------------------------+ diff --git a/docs/ZIO-Scheduler.rst b/docs/ZIO-Scheduler.rst new file mode 100644 index 0000000..4ffa9d8 --- /dev/null +++ b/docs/ZIO-Scheduler.rst @@ -0,0 +1,98 @@ +ZFS I/O (ZIO) Scheduler +======================= + +ZFS issues I/O operations to leaf vdevs (usually devices) to satisfy and +complete I/Os. The ZIO scheduler determines when and in what order those +operations are issued. Operations into five I/O classes prioritized in +the following order: + ++----------+-------------+-------------------------------------------+ +| Priority | I/O Class | Description | ++==========+=============+===========================================+ +| highest | sync read | most reads | ++----------+-------------+-------------------------------------------+ +| | sync write | as defined by application or via 'zfs' | +| | | 'sync' property | ++----------+-------------+-------------------------------------------+ +| | async read | prefetch reads | ++----------+-------------+-------------------------------------------+ +| | async write | most writes | ++----------+-------------+-------------------------------------------+ +| lowest | scrub read | scan read: includes both scrub and | +| | | resilver | ++----------+-------------+-------------------------------------------+ + +Each queue defines the minimum and maximum number of concurrent +operations issued to the device. In addition, the device has an +aggregate maximum, zfs_vdev_max_active. Note that the sum of the +per-queue minimums must not exceed the aggregate maximum. If the sum of +the per-queue maximums exceeds the aggregate maximum, then the number of +active I/Os may reach zfs_vdev_max_active, in which case no further I/Os +are issued regardless of whether all per-queue minimums have been met. + ++-------------+--------------------------+--------------------------+ +| I/O Class | Min Active Parameter | Max Active Parameter | ++=============+==========================+==========================+ +| sync read | zfs_v | zfs_v | +| | dev_sync_read_min_active | dev_sync_read_max_active | ++-------------+--------------------------+--------------------------+ +| sync write | zfs_vd | zfs_vd | +| | ev_sync_write_min_active | ev_sync_write_max_active | ++-------------+--------------------------+--------------------------+ +| async read | zfs_vd | zfs_vd | +| | ev_async_read_min_active | ev_async_read_max_active | ++-------------+--------------------------+--------------------------+ +| async write | zfs_vde | zfs_vde | +| | v_async_write_min_active | v_async_write_max_active | ++-------------+--------------------------+--------------------------+ +| scrub read | z | z | +| | fs_vdev_scrub_min_active | fs_vdev_scrub_max_active | ++-------------+--------------------------+--------------------------+ + +For many physical devices, throughput increases with the number of +concurrent operations, but latency typically suffers. Further, physical +devices typically have a limit at which more concurrent operations have +no effect on throughput or can actually cause it to performance to +decrease. + +The ZIO scheduler selects the next operation to issue by first looking +for an I/O class whose minimum has not been satisfied. Once all are +satisfied and the aggregate maximum has not been hit, the scheduler +looks for classes whose maximum has not been satisfied. Iteration +through the I/O classes is done in the order specified above. No further +operations are issued if the aggregate maximum number of concurrent +operations has been hit or if there are no operations queued for an I/O +class that has not hit its maximum. Every time an I/O is queued or an +operation completes, the I/O scheduler looks for new operations to +issue. + +In general, smaller max_active's will lead to lower latency of +synchronous operations. Larger max_active's may lead to higher overall +throughput, depending on underlying storage and the I/O mix. + +The ratio of the queues' max_actives determines the balance of +performance between reads, writes, and scrubs. For example, when there +is contention, increasing zfs_vdev_scrub_max_active will cause the scrub +or resilver to complete more quickly, but reads and writes to have +higher latency and lower throughput. + +All I/O classes have a fixed maximum number of outstanding operations +except for the async write class. Asynchronous writes represent the data +that is committed to stable storage during the syncing stage for +transaction groups (txgs). Transaction groups enter the syncing state +periodically so the number of queued async writes quickly bursts up and +then reduce down to zero. The zfs_txg_timeout tunable (default=5 +seconds) sets the target interval for txg sync. Thus a burst of async +writes every 5 seconds is a normal ZFS I/O pattern. + +Rather than servicing I/Os as quickly as possible, the ZIO scheduler +changes the maximum number of active async write I/Os according to the +amount of dirty data in the pool. Since both throughput and latency +typically increase as the number of concurrent operations issued to +physical devices, reducing the burstiness in the number of concurrent +operations also stabilizes the response time of operations from other +queues. This is particular important for the sync read and write queues, +where the periodic async write bursts of the txg sync can lead to +device-level contention. In broad strokes, the ZIO scheduler issues more +concurrent operations from the async write queue as there's more dirty +data in the pool. diff --git a/docs/_Footer.rst b/docs/_Footer.rst new file mode 100644 index 0000000..48aae1d --- /dev/null +++ b/docs/_Footer.rst @@ -0,0 +1,5 @@ +[[Home]] / [[Project and Community]] / [[Developer Resources]] / +[[License]] |Creative Commons License| + +.. |Creative Commons License| image:: https://i.creativecommons.org/l/by-sa/3.0/80x15.png + :target: http://creativecommons.org/licenses/by-sa/3.0/ diff --git a/docs/_Sidebar.rst b/docs/_Sidebar.rst new file mode 100644 index 0000000..af7b4b8 --- /dev/null +++ b/docs/_Sidebar.rst @@ -0,0 +1,49 @@ +- [[Home]] +- [[Getting Started]] + + - `ArchLinux `__ + - [[Debian]] + - [[Fedora]] + - `FreeBSD `__ + - `Gentoo `__ + - `openSUSE `__ + - [[RHEL and CentOS]] + - [[Ubuntu]] + +- [[Project and Community]] + + - [[Admin Documentation]] + - [[FAQ]] + - [[Mailing Lists]] + - `Releases `__ + - [[Signing Keys]] + - `Issue Tracker `__ + - `Roadmap `__ + +- [[Developer Resources]] + + - [[Custom Packages]] + - [[Building ZFS]] + - `Buildbot + Status `__ + - `Buildbot Issue + Tracking `__ + - `Buildbot + Options `__ + - `OpenZFS + Tracking `__ + - [[OpenZFS Patches]] + - [[OpenZFS Exceptions]] + - `OpenZFS + Documentation `__ + - [[Git and GitHub for beginners]] + +- Performance and Tuning + + - [[ZFS on Linux Module Parameters]] + - `ZFS Transaction Delay and Write + Throttle `__ + - [[ZIO Scheduler]] + - [[Checksums]] + - `Asynchronous + Writes `__ diff --git a/docs/dRAID-HOWTO.rst b/docs/dRAID-HOWTO.rst new file mode 100644 index 0000000..95c5b26 --- /dev/null +++ b/docs/dRAID-HOWTO.rst @@ -0,0 +1,411 @@ +Introduction +============ + +raidz vs draid +-------------- + +ZFS users are most likely very familiar with raidz already, so a +comparison with draid would help. The illustrations below are +simplified, but sufficient for the purpose of a comparison. For example, +31 drives can be configured as a zpool of 6 raidz1 vdevs and a hot +spare: |raidz1| + +As shown above, if drive 0 fails and is replaced by the hot spare, only +5 out of the 30 surviving drives will work to resilver: drives 1-4 read, +and drive 30 writes. + +The same 30 drives can be configured as 1 draid1 vdev of the same level +of redundancy (i.e. single parity, 1/4 parity ratio) and single spare +capacity: |draid1| + +The drives are shuffled in a way that, after drive 0 fails, all 30 +surviving drives will work together to restore the lost data/parity: + +- All 30 drives read, because unlike the raidz1 configuration shown + above, in the draid1 configuration the neighbor drives of the failed + drive 0 (i.e. drives in a same data+parity group) are not fixed. +- All 30 drives write, because now there is no dedicated spare drive. + Instead, spare blocks come from all drives. + +To summarize: + +- Normal application IO: draid and raidz are very similar. There's a + slight advantage in draid, since there's no dedicated spare drive + which is idle when not in use. +- Restore lost data/parity: for raidz, not all surviving drives will + work to rebuild, and in addition it's bounded by the write throughput + of a single replacement drive. For draid, the rebuild speed will + scale with the total number of drives because all surviving drives + will work to rebuild. + +The dRAID vdev must shuffle its child drives in a way that regardless of +which drive has failed, the rebuild IO (both read and write) will +distribute evenly among all surviving drives, so the rebuild speed will +scale. The exact mechanism used by the dRAID vdev driver is beyond the +scope of this simple introduction here. If interested, please refer to +the recommended readings in the next section. + +Recommended Reading +------------------- + +Parity declustering (the fancy term for shuffling drives) has been an +active research topic, and many papers have been published in this area. +The `Permutation Development Data +Layout `__ is a +good paper to begin. The dRAID vdev driver uses a shuffling algorithm +loosely based on the mechanism described in this paper. + +Using dRAID +=========== + +First get the code `here `__, +build zfs with *configure --enable-debug*, and install. Then load the +zfs kernel module with the following options which help dRAID rebuild +performance. + +- zfs_vdev_scrub_max_active=10 +- zfs_vdev_async_write_min_active=4 + +Create a dRAID vdev +------------------- + +Similar to raidz vdev a dRAID vdev can be created using the +``zpool create`` command: + +:: + + # zpool create draid[1,2,3][ + +Unlike raidz, additional options may be provided as part of the +``draid`` vdev type to specify an exact dRAID layout. When unspecific +reasonable defaults will be chosen. + +:: + + # zpool create draid[1,2,3][:g][:s][:d][:] + +- groups - Number of redundancy groups (default: 1 group per 12 vdevs) +- spares - Number of distributed hot spares (default: 1) +- data - Number of data devices per group (default: determined by + number of groups) +- iterations - Number of iterations to perform generating a valid dRAID + mapping (default 3). + +*Notes*: + +- The default values are not set in stone and may change. +- For the majority of common configurations we intend to provide + pre-computed balanced dRAID mappings. +- When *data* is specified then: (draid_children - spares) % (parity + + data) == 0, otherwise the pool creation will fail. + +Now the dRAID vdev is online and ready for IO: + +:: + + pool: tank + state: ONLINE + config: + + NAME STATE READ WRITE CKSUM + tank ONLINE 0 0 0 + draid2:4g:2s-0 ONLINE 0 0 0 + L0 ONLINE 0 0 0 + L1 ONLINE 0 0 0 + L2 ONLINE 0 0 0 + L3 ONLINE 0 0 0 + ... + L50 ONLINE 0 0 0 + L51 ONLINE 0 0 0 + L52 ONLINE 0 0 0 + spares + s0-draid2:4g:2s-0 AVAIL + s1-draid2:4g:2s-0 AVAIL + + errors: No known data errors + +There are two logical hot spare vdevs shown above at the bottom: + +- The names begin with a ``s-`` followed by the name of the parent + dRAID vdev. +- These hot spares are logical, made from reserved blocks on all the 53 + child drives of the dRAID vdev. +- Unlike traditional hot spares, the distributed spare can only replace + a drive in its parent dRAID vdev. + +The dRAID vdev behaves just like a raidz vdev of the same parity level. +You can do IO to/from it, scrub it, fail a child drive and it'd operate +in degraded mode. + +Rebuild to distributed spare +---------------------------- + +When there's a failed/offline child drive, the dRAID vdev supports a +completely new mechanism to reconstruct lost data/parity, in addition to +the resilver. First of all, resilver is still supported - if a failed +drive is replaced by another physical drive, the resilver process is +used to reconstruct lost data/parity to the new replacement drive, which +is the same as a resilver in a raidz vdev. + +But if a child drive is replaced with a distributed spare, a new process +called rebuild is used instead of resilver: + +:: + + # zpool offline tank sdo + # zpool replace tank sdo '%draid1-0-s0' + # zpool status + pool: tank + state: DEGRADED + status: One or more devices has been taken offline by the administrator. + Sufficient replicas exist for the pool to continue functioning in a + degraded state. + action: Online the device using 'zpool online' or replace the device with + 'zpool replace'. + scan: rebuilt 2.00G in 0h0m5s with 0 errors on Fri Feb 24 20:37:06 2017 + config: + + NAME STATE READ WRITE CKSUM + tank DEGRADED 0 0 0 + draid1-0 DEGRADED 0 0 0 + sdd ONLINE 0 0 0 + sde ONLINE 0 0 0 + sdf ONLINE 0 0 0 + sdg ONLINE 0 0 0 + sdh ONLINE 0 0 0 + sdu ONLINE 0 0 0 + sdj ONLINE 0 0 0 + sdv ONLINE 0 0 0 + sdl ONLINE 0 0 0 + sdm ONLINE 0 0 0 + sdn ONLINE 0 0 0 + spare-11 DEGRADED 0 0 0 + sdo OFFLINE 0 0 0 + %draid1-0-s0 ONLINE 0 0 0 + sdp ONLINE 0 0 0 + sdq ONLINE 0 0 0 + sdr ONLINE 0 0 0 + sds ONLINE 0 0 0 + sdt ONLINE 0 0 0 + spares + %draid1-0-s0 INUSE currently in use + %draid1-0-s1 AVAIL + +The scan status line of the *zpool status* output now says *"rebuilt"* +instead of *"resilvered"*, because the lost data/parity was rebuilt to +the distributed spare by a brand new process called *"rebuild"*. The +main differences from *resilver* are: + +- The rebuild process does not scan the whole block pointer tree. + Instead, it only scans the spacemap objects. +- The IO from rebuild is sequential, because it rebuilds metaslabs one + by one in sequential order. +- The rebuild process is not limited to block boundaries. For example, + if 10 64K blocks are allocated contiguously, then rebuild will fix + 640K at one time. So rebuild process will generate larger IOs than + resilver. +- For all the benefits above, there is one price to pay. The rebuild + process cannot verify block checksums, since it doesn't have block + pointers. +- Moreover, the rebuild process requires support from on-disk format, + and **only** works on draid and mirror vdevs. Resilver, on the other + hand, works with any vdev (including draid). + +Although rebuild process creates larger IOs, the drives will not +necessarily see large IO requests. The block device queue parameter +*/sys/block/*/queue/max_sectors_kb* must be tuned accordingly. However, +since the rebuild IO is already sequential, the benefits of enabling +larger IO requests might be marginal. + +At this point, redundancy has been fully restored without adding any new +drive to the pool. If another drive is offlined, the pool is still able +to do IO: + +:: + + # zpool offline tank sdj + # zpool status + state: DEGRADED + status: One or more devices has been taken offline by the administrator. + Sufficient replicas exist for the pool to continue functioning in a + degraded state. + action: Online the device using 'zpool online' or replace the device with + 'zpool replace'. + scan: rebuilt 2.00G in 0h0m5s with 0 errors on Fri Feb 24 20:37:06 2017 + config: + + NAME STATE READ WRITE CKSUM + tank DEGRADED 0 0 0 + draid1-0 DEGRADED 0 0 0 + sdd ONLINE 0 0 0 + sde ONLINE 0 0 0 + sdf ONLINE 0 0 0 + sdg ONLINE 0 0 0 + sdh ONLINE 0 0 0 + sdu ONLINE 0 0 0 + sdj OFFLINE 0 0 0 + sdv ONLINE 0 0 0 + sdl ONLINE 0 0 0 + sdm ONLINE 0 0 0 + sdn ONLINE 0 0 0 + spare-11 DEGRADED 0 0 0 + sdo OFFLINE 0 0 0 + %draid1-0-s0 ONLINE 0 0 0 + sdp ONLINE 0 0 0 + sdq ONLINE 0 0 0 + sdr ONLINE 0 0 0 + sds ONLINE 0 0 0 + sdt ONLINE 0 0 0 + spares + %draid1-0-s0 INUSE currently in use + %draid1-0-s1 AVAIL + +As shown above, the *draid1-0* vdev is still in *DEGRADED* mode although +two child drives have failed and it's only single-parity. Since the +*%draid1-0-s1* is still *AVAIL*, full redundancy can be restored by +replacing *sdj* with it, without adding new drive to the pool: + +:: + + # zpool replace tank sdj '%draid1-0-s1' + # zpool status + state: DEGRADED + status: One or more devices has been taken offline by the administrator. + Sufficient replicas exist for the pool to continue functioning in a + degraded state. + action: Online the device using 'zpool online' or replace the device with + 'zpool replace'. + scan: rebuilt 2.13G in 0h0m5s with 0 errors on Fri Feb 24 23:20:59 2017 + config: + + NAME STATE READ WRITE CKSUM + tank DEGRADED 0 0 0 + draid1-0 DEGRADED 0 0 0 + sdd ONLINE 0 0 0 + sde ONLINE 0 0 0 + sdf ONLINE 0 0 0 + sdg ONLINE 0 0 0 + sdh ONLINE 0 0 0 + sdu ONLINE 0 0 0 + spare-6 DEGRADED 0 0 0 + sdj OFFLINE 0 0 0 + %draid1-0-s1 ONLINE 0 0 0 + sdv ONLINE 0 0 0 + sdl ONLINE 0 0 0 + sdm ONLINE 0 0 0 + sdn ONLINE 0 0 0 + spare-11 DEGRADED 0 0 0 + sdo OFFLINE 0 0 0 + %draid1-0-s0 ONLINE 0 0 0 + sdp ONLINE 0 0 0 + sdq ONLINE 0 0 0 + sdr ONLINE 0 0 0 + sds ONLINE 0 0 0 + sdt ONLINE 0 0 0 + spares + %draid1-0-s0 INUSE currently in use + %draid1-0-s1 INUSE currently in use + +Again, full redundancy has been restored without adding any new drive. +If another drive fails, the pool will still be able to handle IO, but +there'd be no more distributed spare to rebuild (both are in *INUSE* +state now). At this point, there's no urgency to add a new replacement +drive because the pool can survive yet another drive failure. + +Rebuild for mirror vdev +~~~~~~~~~~~~~~~~~~~~~~~ + +The sequential rebuild process also works for the mirror vdev, when a +drive is attached to a mirror or a mirror child vdev is replaced. + +By default, rebuild for mirror vdev is turned off. It can be turned on +using the zfs module option *spa_rebuild_mirror=1*. + +Rebuild throttling +~~~~~~~~~~~~~~~~~~ + +The rebuild process may delay *zio* by *spa_vdev_scan_delay* if the +draid vdev has seen any important IO in the recent *spa_vdev_scan_idle* +period. But when a dRAID vdev has lost all redundancy, e.g. a draid2 +with 2 faulted child drives, the rebuild process will go full speed by +ignoring *spa_vdev_scan_delay* and *spa_vdev_scan_idle* altogether +because the vdev is now in critical state. + +After delaying, the rebuild zio is issued using priority +*ZIO_PRIORITY_SCRUB* for reads and *ZIO_PRIORITY_ASYNC_WRITE* for +writes. Therefore the options that control the queuing of these two IO +priorities will affect rebuild *zio* as well, for example +*zfs_vdev_scrub_min_active*, *zfs_vdev_scrub_max_active*, +*zfs_vdev_async_write_min_active*, and +*zfs_vdev_async_write_max_active*. + +Rebalance +--------- + +Distributed spare space can be made available again by simply replacing +any failed drive with a new drive. This process is called *rebalance* +which is essentially a *resilver*: + +:: + + # zpool replace -f tank sdo sdw + # zpool status + state: DEGRADED + status: One or more devices has been taken offline by the administrator. + Sufficient replicas exist for the pool to continue functioning in a + degraded state. + action: Online the device using 'zpool online' or replace the device with + 'zpool replace'. + scan: resilvered 2.21G in 0h0m58s with 0 errors on Fri Feb 24 23:31:45 2017 + config: + + NAME STATE READ WRITE CKSUM + tank DEGRADED 0 0 0 + draid1-0 DEGRADED 0 0 0 + sdd ONLINE 0 0 0 + sde ONLINE 0 0 0 + sdf ONLINE 0 0 0 + sdg ONLINE 0 0 0 + sdh ONLINE 0 0 0 + sdu ONLINE 0 0 0 + spare-6 DEGRADED 0 0 0 + sdj OFFLINE 0 0 0 + %draid1-0-s1 ONLINE 0 0 0 + sdv ONLINE 0 0 0 + sdl ONLINE 0 0 0 + sdm ONLINE 0 0 0 + sdn ONLINE 0 0 0 + sdw ONLINE 0 0 0 + sdp ONLINE 0 0 0 + sdq ONLINE 0 0 0 + sdr ONLINE 0 0 0 + sds ONLINE 0 0 0 + sdt ONLINE 0 0 0 + spares + %draid1-0-s0 AVAIL + %draid1-0-s1 INUSE currently in use + +Note that the scan status now says *"resilvered"*. Also, the state of +*%draid1-0-s0* has become *AVAIL* again. Since the resilver process +checks block checksums, it makes up for the lack of checksum +verification during previous rebuild. + +The dRAID1 vdev in this example shuffles three (4 data + 1 parity) +redundancy groups to the 17 drives. For any single drive failure, only +about 1/3 of the blocks are affected (and should be resilvered/rebuilt). +The rebuild process is able to avoid unnecessary work, but the resilver +process by default will not. The rebalance (which is essentially +resilver) can speed up a lot by setting module option +*zfs_no_resilver_skip* to 0. This feature is turned off by default +because of issue +`https://github.com/zfsonlinux/zfs/issues/5806 `__. + +Troubleshooting +=============== + +Please report bugs to `the dRAID +PR `__, as long as the +code is not merged upstream. + +.. |raidz1| image:: https://cloud.githubusercontent.com/assets/6722662/23642396/9790e432-02b7-11e7-8198-ae9f17c61d85.png +.. |draid1| image:: https://cloud.githubusercontent.com/assets/6722662/23642395/9783ef8e-02b7-11e7-8d7e-31d1053ee4ff.png diff --git a/docs/hole_birth-FAQ.rst b/docs/hole_birth-FAQ.rst new file mode 100644 index 0000000..0e00838 --- /dev/null +++ b/docs/hole_birth-FAQ.rst @@ -0,0 +1,62 @@ +Short explanation +~~~~~~~~~~~~~~~~~ + +The hole_birth feature has/had bugs, the result of which is that, if you +do a ``zfs send -i`` (or ``-R``, since it uses ``-i``) from an affected +dataset, the receiver will not see any checksum or other errors, but the +resulting destination snapshot will not match the source. + +ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring the +faulty metadata which causes this issue *on the sender side*. + +FAQ +~~~ + +I have a pool with hole_birth enabled, how do I know if I am affected? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is technically possible to calculate whether you have any affected +files, but it requires scraping zdb output for each file in each +snapshot in each dataset, which is a combinatoric nightmare. (If you +really want it, there is a proof of concept +`here `__. + +Is there any less painful way to fix this if we have already received an affected snapshot? +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +No, the data you need was simply not present in the send stream, +unfortunately, and cannot feasibly be rewritten in place. + +Long explanation +~~~~~~~~~~~~~~~~ + +hole_birth is a feature to speed up ZFS send -i - in particular, ZFS +used to not store metadata on when "holes" (sparse regions) in files +were created, so every zfs send -i needed to include every hole. + +hole_birth, as the name implies, added tracking for the txg (transaction +group) when a hole was created, so that zfs send -i could only send +holes that had a birth_time between (starting snapshot txg) and (ending +snapshot txg), and life was wonderful. + +Unfortunately, hole_birth had a number of edge cases where it could +"forget" to set the birth_time of holes in some cases, causing it to +record the birth_time as 0 (the value used prior to hole_birth, and +essentially equivalent to "since file creation"). + +This meant that, when you did a zfs send -i, since zfs send does not +have any knowledge of the surrounding snapshots when sending a given +snapshot, it would see the creation txg as 0, conclude "oh, it is 0, I +must have already sent this before", and not include it. + +This means that, on the receiving side, it does not know those holes +should exist, and does not create them. This leads to differences +between the source and the destination. + +ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring this +metadata and always sending holes with birth_time 0, configurable using +the tunable known as ``ignore_hole_birth`` or +``send_holes_without_birth_time``. The latter is what OpenZFS +standardized on. ZoL version 0.6.5.8 only has the former, but for any +ZoL version with ``send_holes_without_birth_time``, they point to the +same value, so changing either will work.