From 5643c66781a079de4a4eb8437f437020faa4f16f Mon Sep 17 00:00:00 2001 From: George Melikov Date: Mon, 31 Oct 2022 01:01:17 +0300 Subject: [PATCH] Add basic RAIDZ intro Signed-off-by: George Melikov --- docs/Basic Concepts/RAIDZ.rst | 89 +++++++++++++++++++++++++++++++++++ 1 file changed, 89 insertions(+) create mode 100644 docs/Basic Concepts/RAIDZ.rst diff --git a/docs/Basic Concepts/RAIDZ.rst b/docs/Basic Concepts/RAIDZ.rst new file mode 100644 index 0000000..ede5df0 --- /dev/null +++ b/docs/Basic Concepts/RAIDZ.rst @@ -0,0 +1,89 @@ +RAIDZ +===== + +tl;dr: RAIDZ is effective for large block sizes and sequential workloads. + +Introduction +~~~~~~~~~~~~ + +RAIDZ is a variation on RAID-5 that allows for better distribution of parity +and eliminates the RAID-5 “write hole” (in which data and parity become +inconsistent after a power loss). +Data and parity is striped across all disks within a raidz group. + +A raidz group can have single, double, or triple parity, meaning that the raidz +group can sustain one, two, or three failures, respectively, without losing any +data. The ``raidz1`` vdev type specifies a single-parity raidz group; the ``raidz2`` +vdev type specifies a double-parity raidz group; and the ``raidz3`` vdev type +specifies a triple-parity raidz group. The ``raidz`` vdev type is an alias for +raidz1. + +A raidz group with N disks of size X with P parity disks can hold +approximately (N-P)*X bytes and can withstand P devices failing without +losing data. The minimum number of devices in a raidz group is one more +than the number of parity disks. The recommended number is between 3 and 9 +to help increase performance. + + +Space efficiency +~~~~~~~~~~~~~~~~ + +Actual used space for a block in RAIDZ is based on several points: + +- minimal write size is disk sector size (can be set via `ashift` vdev parameter) + +- stripe width in RAIDZ is dynamic, and starts with at least one data block part, or up to + ``disks count`` minus ``parity number`` parts of data block + +- one block of data with size of ``recordsize`` is + splitted equally via ``sector size`` parts + and written on each stripe on RAIDZ vdev +- each stripe of data will have a part of block + +- in addition to data one, two or three blocks of parity should be written, + one per disk; so, for raidz2 of 5 disks there will be 3 blocks of data and + 2 blocks of parity + +Due to these inputs, if ``recordsize`` is less or equal to sector size, +then RAIDZ's parity size will be effictively equal to mirror with same redundancy. +For example, for raidz1 of 3 disks with ``ashift=12`` and ``recordsize=4K`` +we will allocate on disk: + +- one 4K block of data + +- one 4K padding block + +, and usable space ratio will be 50%, same as with double mirror. + + +Another example for ``ashift=12`` and ``recordsize=128K`` for raidz1 of 3 disks: + +- total stripe width is 3 + +- one stripe can have up to 2 data parts of 4K size because of 1 parity blocks + +- we will have 128K/2 = 64 stripes with 8K of data and 4K of parity each + +, so usable space ratio in this case will be 66%. + + +If RAIDZ will have more disks, it's stripe width will be larger, and space +efficiency better too. + +You can find actual parity cost per RAIDZ size here: + +.. raw:: html + + + +(`source `__) + + +Performance considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Write +^^^^^ + +Because of full stripe width, one block write will write stripe part on each disk. +One RAIDZ vdev has a write IOPS of one slowest disk because of that in worst case.