Update dRAID documentation (#78)
Refresh the dRAID documentation page to accurately reflect the implementation of dRAID which has been merged. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit is contained in:
@@ -1,417 +1,250 @@
|
|||||||
dRAID Howto
|
dRAID
|
||||||
===========
|
=====
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
This page a describes *work in progress* functionality, which is not yet
|
This page a describes functionality which has been merged to the
|
||||||
merged in master branch.
|
master branch but is not in the OpenZFS 2.0 release. In order to
|
||||||
|
use dRAID you'll need to checkout the latest source and build
|
||||||
|
`custom packages`_ to install.
|
||||||
|
|
||||||
Introduction
|
Introduction
|
||||||
------------
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
raidz vs draid
|
`dRAID`_ is a variant of raidz that provides integrated distributed hot
|
||||||
~~~~~~~~~~~~~~
|
spares which allows for faster resilvering while retaining the benefits
|
||||||
|
of raidz. A dRAID vdev is constructed from multiple internal raidz
|
||||||
|
groups, each with D data devices and P parity devices. These groups
|
||||||
|
are distributed over all of the children in order to fully utilize the
|
||||||
|
available disk performance. This is known as parity declustering and
|
||||||
|
it has been an active area of research. The image below is simplified,
|
||||||
|
but it helps illustrate this key difference between dRAID and raidz.
|
||||||
|
|
||||||
ZFS users are most likely very familiar with raidz already, so a
|
|draid1|
|
||||||
comparison with draid would help. The illustrations below are
|
|
||||||
simplified, but sufficient for the purpose of a comparison. For example,
|
|
||||||
31 drives can be configured as a zpool of 6 raidz1 vdevs and a hot
|
|
||||||
spare: |raidz1|
|
|
||||||
|
|
||||||
As shown above, if drive 0 fails and is replaced by the hot spare, only
|
Additionally, a dRAID vdev must shuffle its child vdevs in such a way
|
||||||
5 out of the 30 surviving drives will work to resilver: drives 1-4 read,
|
that regardless of which drive has failed, the rebuild IO (both read
|
||||||
and drive 30 writes.
|
and write) will distribute evenly among all surviving drives. This
|
||||||
|
is accomplished by using carefully chosen precomputed permutation
|
||||||
|
maps. This has the advantage of both keeping pool creation fast and
|
||||||
|
making it impossible for the mapping to be damaged or lost.
|
||||||
|
|
||||||
The same 30 drives can be configured as 1 draid1 vdev of the same level
|
Another way dRAID differs from raidz is that it uses a fixed stripe
|
||||||
of redundancy (i.e. single parity, 1/4 parity ratio) and single spare
|
width (padding as necessary with zeros). This allows a dRAID vdev to
|
||||||
capacity: |draid1|
|
be sequentially resilvered, however the fixed stripe width significantly
|
||||||
|
effects both usable capacity and IOPS. For example, with the default
|
||||||
|
D=8 and 4k disk sectors the minimum allocation size is 32k. If using
|
||||||
|
compression, this relatively large allocation size can reduce the
|
||||||
|
effective compression ratio. When using ZFS volumes and dRAID the
|
||||||
|
default volblocksize property is increased to account for the allocation
|
||||||
|
size. If a dRAID pool will hold a significant amount of small blocks,
|
||||||
|
it is recommended to also add a mirrored special vdev to store those
|
||||||
|
blocks.
|
||||||
|
|
||||||
The drives are shuffled in a way that, after drive 0 fails, all 30
|
In regards to IO/s, performance is similar to raidz since for any
|
||||||
surviving drives will work together to restore the lost data/parity:
|
read all D data disks must be accessed. Delivered random IOPS can be
|
||||||
|
reasonably approximated as floor((N-S)/(D+P))*<single-drive-IOPS>.
|
||||||
|
|
||||||
- All 30 drives read, because unlike the raidz1 configuration shown
|
In summary dRAID can provide the same level of redundancy and
|
||||||
above, in the draid1 configuration the neighbor drives of the failed
|
performance as raidz, while also providing a fast integrated distributed
|
||||||
drive 0 (i.e. drives in a same data+parity group) are not fixed.
|
spare.
|
||||||
- All 30 drives write, because now there is no dedicated spare drive.
|
|
||||||
Instead, spare blocks come from all drives.
|
|
||||||
|
|
||||||
To summarize:
|
|
||||||
|
|
||||||
- Normal application IO: draid and raidz are very similar. There's a
|
|
||||||
slight advantage in draid, since there's no dedicated spare drive
|
|
||||||
which is idle when not in use.
|
|
||||||
- Restore lost data/parity: for raidz, not all surviving drives will
|
|
||||||
work to rebuild, and in addition it's bounded by the write throughput
|
|
||||||
of a single replacement drive. For draid, the rebuild speed will
|
|
||||||
scale with the total number of drives because all surviving drives
|
|
||||||
will work to rebuild.
|
|
||||||
|
|
||||||
The dRAID vdev must shuffle its child drives in a way that regardless of
|
|
||||||
which drive has failed, the rebuild IO (both read and write) will
|
|
||||||
distribute evenly among all surviving drives, so the rebuild speed will
|
|
||||||
scale. The exact mechanism used by the dRAID vdev driver is beyond the
|
|
||||||
scope of this simple introduction here. If interested, please refer to
|
|
||||||
the recommended readings in the next section.
|
|
||||||
|
|
||||||
Recommended Reading
|
|
||||||
~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Parity declustering (the fancy term for shuffling drives) has been an
|
|
||||||
active research topic, and many papers have been published in this area.
|
|
||||||
The `Permutation Development Data
|
|
||||||
Layout <http://www.cse.scu.edu/~tschwarz/TechReports/hpca.pdf>`__ is a
|
|
||||||
good paper to begin. The dRAID vdev driver uses a shuffling algorithm
|
|
||||||
loosely based on the mechanism described in this paper.
|
|
||||||
|
|
||||||
Using dRAID
|
|
||||||
-----------
|
|
||||||
|
|
||||||
First get the code `here <https://github.com/openzfs/zfs/pull/10102>`__,
|
|
||||||
build zfs with *configure --enable-debug*, and install. Then load the
|
|
||||||
zfs kernel module with the following options which help dRAID rebuild
|
|
||||||
performance.
|
|
||||||
|
|
||||||
- zfs_vdev_scrub_max_active=10
|
|
||||||
- zfs_vdev_async_write_min_active=4
|
|
||||||
|
|
||||||
Create a dRAID vdev
|
Create a dRAID vdev
|
||||||
~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Similar to raidz vdev a dRAID vdev can be created using the
|
A dRAID vdev is created like any other by using the ``zpool create``
|
||||||
``zpool create`` command:
|
command and enumerating the disks which should be used.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# zpool create <pool> draid[1,2,3][ <vdevs...>
|
# zpool create <pool> draid[1,2,3] <vdevs...>
|
||||||
|
|
||||||
Unlike raidz, additional options may be provided as part of the
|
Like raidz, the parity level is specified immediately after the ``draid``
|
||||||
``draid`` vdev type to specify an exact dRAID layout. When unspecific
|
vdev type. However, unlike raidz additional colon separated options can be
|
||||||
reasonable defaults will be chosen.
|
specified. The most important of which is the ``:<spares>s`` option which
|
||||||
|
controls the number of distributed hot spares to create. By default, no
|
||||||
|
spares are created. The ``:<data>d`` option can be specified to set the
|
||||||
|
number of data devices to use in each RAID stripe (D+P). When unspecified
|
||||||
|
reasonable defaults are chosen.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# zpool create <pool> draid[1,2,3][:<groups>g][:<spares>s][:<data>d][:<iterations>] <vdevs...>
|
# zpool create <pool> draid[<parity>][:<data>d][:<children>c][:<spares>s] <vdevs...>
|
||||||
|
|
||||||
- groups - Number of redundancy groups (default: 1 group per 12 vdevs)
|
- **parity** - The parity level (1-3).
|
||||||
- spares - Number of distributed hot spares (default: 1)
|
|
||||||
- data - Number of data devices per group (default: determined by
|
|
||||||
number of groups)
|
|
||||||
- iterations - Number of iterations to perform generating a valid dRAID
|
|
||||||
mapping (default 3).
|
|
||||||
|
|
||||||
*Notes*:
|
- **data** - The number of data devices per redundancy group. In general
|
||||||
|
a smaller value of D will increase IOPS, improve the compression ratio,
|
||||||
|
and speed up resilvering at the expense of total usable capacity.
|
||||||
|
Defaults to 8, unless N-P-S is less than 8.
|
||||||
|
|
||||||
- The default values are not set in stone and may change.
|
- **children** - The expected number of children. Useful as a cross-check
|
||||||
- For the majority of common configurations we intend to provide
|
when listing a large number of devices. An error is returned when the
|
||||||
pre-computed balanced dRAID mappings.
|
provided number of children differs.
|
||||||
- When *data* is specified then: (draid_children - spares) % (parity +
|
|
||||||
data) == 0, otherwise the pool creation will fail.
|
|
||||||
|
|
||||||
Now the dRAID vdev is online and ready for IO:
|
- **spares** - The number of distributed hot spares. Defaults to zero.
|
||||||
|
|
||||||
|
For example, to create an 11 disk dRAID pool with 4+1 redundancy and a
|
||||||
|
single distributed spare the command would be:
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
|
# zpool create tank draid:4d:1s:11c /dev/sd[a-k]
|
||||||
|
# zpool status tank
|
||||||
|
|
||||||
pool: tank
|
pool: tank
|
||||||
state: ONLINE
|
state: ONLINE
|
||||||
config:
|
config:
|
||||||
|
|
||||||
NAME STATE READ WRITE CKSUM
|
NAME STATE READ WRITE CKSUM
|
||||||
tank ONLINE 0 0 0
|
tank ONLINE 0 0 0
|
||||||
draid2:4g:2s-0 ONLINE 0 0 0
|
draid1:4d:11c:1s-0 ONLINE 0 0 0
|
||||||
L0 ONLINE 0 0 0
|
sda ONLINE 0 0 0
|
||||||
L1 ONLINE 0 0 0
|
sdb ONLINE 0 0 0
|
||||||
L2 ONLINE 0 0 0
|
sdc ONLINE 0 0 0
|
||||||
L3 ONLINE 0 0 0
|
sdd ONLINE 0 0 0
|
||||||
...
|
sde ONLINE 0 0 0
|
||||||
L50 ONLINE 0 0 0
|
sdf ONLINE 0 0 0
|
||||||
L51 ONLINE 0 0 0
|
sdg ONLINE 0 0 0
|
||||||
L52 ONLINE 0 0 0
|
sdh ONLINE 0 0 0
|
||||||
spares
|
sdi ONLINE 0 0 0
|
||||||
s0-draid2:4g:2s-0 AVAIL
|
sdj ONLINE 0 0 0
|
||||||
s1-draid2:4g:2s-0 AVAIL
|
sdk ONLINE 0 0 0
|
||||||
|
spares
|
||||||
|
draid1-0-0 AVAIL
|
||||||
|
|
||||||
errors: No known data errors
|
Note that the dRAID vdev name, ``draid1:4d:11c:1s``, fully describes the
|
||||||
|
configuration and all of disks which are part of the dRAID are listed.
|
||||||
|
Furthermore, the logical distributed hot spare is shown as an available
|
||||||
|
spare disk.
|
||||||
|
|
||||||
There are two logical hot spare vdevs shown above at the bottom:
|
Rebuilding to a Distributed Spare
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
- The names begin with a ``s<id>-`` followed by the name of the parent
|
One of the major advantages of dRAID is that it supports both sequential
|
||||||
dRAID vdev.
|
and traditional healing resilvers. When performing a sequential resilver
|
||||||
- These hot spares are logical, made from reserved blocks on all the 53
|
to a distributed hot spare the performance scales with the number of disks
|
||||||
child drives of the dRAID vdev.
|
divided by the stripe width (D+P). This can greatly reduce resilver times
|
||||||
- Unlike traditional hot spares, the distributed spare can only replace
|
and restore full redundancy in a fraction of the usual time. For example,
|
||||||
a drive in its parent dRAID vdev.
|
the following graph shows the observed sequential resilver time in hours
|
||||||
|
for a 90 HDD based dRAID filled to 90% capacity.
|
||||||
|
|
||||||
The dRAID vdev behaves just like a raidz vdev of the same parity level.
|
|draid-resilver|
|
||||||
You can do IO to/from it, scrub it, fail a child drive and it'd operate
|
|
||||||
in degraded mode.
|
|
||||||
|
|
||||||
Rebuild to distributed spare
|
When using dRAID and a distributed spare, the process for handling a
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
failed disk is almost identical to raidz with a traditional hot spare.
|
||||||
|
When a disk failure is detected the ZFS Event Daemon (ZED) will start
|
||||||
When there's a failed/offline child drive, the dRAID vdev supports a
|
rebuilding to a spare if one is available. The only difference is that
|
||||||
completely new mechanism to reconstruct lost data/parity, in addition to
|
for dRAID a sequential resilver is started, while a healing resilver must
|
||||||
the resilver. First of all, resilver is still supported - if a failed
|
be used for raidz.
|
||||||
drive is replaced by another physical drive, the resilver process is
|
|
||||||
used to reconstruct lost data/parity to the new replacement drive, which
|
|
||||||
is the same as a resilver in a raidz vdev.
|
|
||||||
|
|
||||||
But if a child drive is replaced with a distributed spare, a new process
|
|
||||||
called rebuild is used instead of resilver:
|
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# zpool offline tank sdo
|
# echo offline >/sys/block/sdg/device/state
|
||||||
# zpool replace tank sdo '%draid1-0-s0'
|
# zpool replace -s tank sdg draid1-0-0
|
||||||
# zpool status
|
# zpool status
|
||||||
|
|
||||||
pool: tank
|
pool: tank
|
||||||
state: DEGRADED
|
state: DEGRADED
|
||||||
status: One or more devices has been taken offline by the administrator.
|
status: One or more devices is currently being resilvered. The pool will
|
||||||
Sufficient replicas exist for the pool to continue functioning in a
|
continue to function, possibly in a degraded state.
|
||||||
degraded state.
|
action: Wait for the resilver to complete.
|
||||||
action: Online the device using 'zpool online' or replace the device with
|
scan: resilver (draid1:4d:11c:1s-0) in progress since Tue Nov 24 14:34:25 2020
|
||||||
'zpool replace'.
|
3.51T scanned at 13.4G/s, 1.59T issued 6.07G/s, 6.13T total
|
||||||
scan: rebuilt 2.00G in 0h0m5s with 0 errors on Fri Feb 24 20:37:06 2017
|
326G resilvered, 57.17% done, 00:03:21 to go
|
||||||
config:
|
config:
|
||||||
|
|
||||||
NAME STATE READ WRITE CKSUM
|
NAME STATE READ WRITE CKSUM
|
||||||
tank DEGRADED 0 0 0
|
tank DEGRADED 0 0 0
|
||||||
draid1-0 DEGRADED 0 0 0
|
draid1:4d:11c:1s-0 DEGRADED 0 0 0
|
||||||
sdd ONLINE 0 0 0
|
sda ONLINE 0 0 0 (resilvering)
|
||||||
sde ONLINE 0 0 0
|
sdb ONLINE 0 0 0 (resilvering)
|
||||||
sdf ONLINE 0 0 0
|
sdc ONLINE 0 0 0 (resilvering)
|
||||||
sdg ONLINE 0 0 0
|
sdd ONLINE 0 0 0 (resilvering)
|
||||||
sdh ONLINE 0 0 0
|
sde ONLINE 0 0 0 (resilvering)
|
||||||
sdu ONLINE 0 0 0
|
sdf ONLINE 0 0 0 (resilvering)
|
||||||
sdj ONLINE 0 0 0
|
spare-6 DEGRADED 0 0 0
|
||||||
sdv ONLINE 0 0 0
|
sdg UNAVAIL 0 0 0
|
||||||
sdl ONLINE 0 0 0
|
draid1-0-0 ONLINE 0 0 0 (resilvering)
|
||||||
sdm ONLINE 0 0 0
|
sdh ONLINE 0 0 0 (resilvering)
|
||||||
sdn ONLINE 0 0 0
|
sdi ONLINE 0 0 0 (resilvering)
|
||||||
spare-11 DEGRADED 0 0 0
|
sdj ONLINE 0 0 0 (resilvering)
|
||||||
sdo OFFLINE 0 0 0
|
sdk ONLINE 0 0 0 (resilvering)
|
||||||
%draid1-0-s0 ONLINE 0 0 0
|
|
||||||
sdp ONLINE 0 0 0
|
|
||||||
sdq ONLINE 0 0 0
|
|
||||||
sdr ONLINE 0 0 0
|
|
||||||
sds ONLINE 0 0 0
|
|
||||||
sdt ONLINE 0 0 0
|
|
||||||
spares
|
spares
|
||||||
%draid1-0-s0 INUSE currently in use
|
draid1-0-0 INUSE currently in use
|
||||||
%draid1-0-s1 AVAIL
|
|
||||||
|
|
||||||
The scan status line of the *zpool status* output now says *"rebuilt"*
|
While both types of resilvering achieve the same goal it's worth taking
|
||||||
instead of *"resilvered"*, because the lost data/parity was rebuilt to
|
a moment to summarize the key differences.
|
||||||
the distributed spare by a brand new process called *"rebuild"*. The
|
|
||||||
main differences from *resilver* are:
|
|
||||||
|
|
||||||
- The rebuild process does not scan the whole block pointer tree.
|
- A traditional healing resilver scans the entire block tree. This
|
||||||
Instead, it only scans the spacemap objects.
|
means the checksum for each block is available while it's being
|
||||||
- The IO from rebuild is sequential, because it rebuilds metaslabs one
|
repaired and can be immediately verified. The downside is this
|
||||||
by one in sequential order.
|
creates a random read workload which is not ideal for performance.
|
||||||
- The rebuild process is not limited to block boundaries. For example,
|
|
||||||
if 10 64K blocks are allocated contiguously, then rebuild will fix
|
|
||||||
640K at one time. So rebuild process will generate larger IOs than
|
|
||||||
resilver.
|
|
||||||
- For all the benefits above, there is one price to pay. The rebuild
|
|
||||||
process cannot verify block checksums, since it doesn't have block
|
|
||||||
pointers.
|
|
||||||
- Moreover, the rebuild process requires support from on-disk format,
|
|
||||||
and **only** works on draid and mirror vdevs. Resilver, on the other
|
|
||||||
hand, works with any vdev (including draid).
|
|
||||||
|
|
||||||
Although rebuild process creates larger IOs, the drives will not
|
- A sequential resilver instead scans the space maps in order to
|
||||||
necessarily see large IO requests. The block device queue parameter
|
determine what space is allocated and what must be repaired.
|
||||||
*/sys/block/*/queue/max_sectors_kb* must be tuned accordingly. However,
|
This rebuild process is not limited to block boundaries and can
|
||||||
since the rebuild IO is already sequential, the benefits of enabling
|
sequentially reads from the disks and make repairs using larger
|
||||||
larger IO requests might be marginal.
|
I/Os. The price to pay for this performance improvement is that
|
||||||
|
the block checksums cannot be verified while resilvering. Therefore,
|
||||||
|
a scrub is started to verify the checksums after the sequential
|
||||||
|
resilver completes.
|
||||||
|
|
||||||
At this point, redundancy has been fully restored without adding any new
|
For a more in depth explanation of the differences between sequential
|
||||||
drive to the pool. If another drive is offlined, the pool is still able
|
and healing resilvering check out these `sequential resilver`_ slides
|
||||||
to do IO:
|
which were presented at the OpenZFS Developer Summit.
|
||||||
|
|
||||||
::
|
Rebalancing
|
||||||
|
~~~~~~~~~~~
|
||||||
# zpool offline tank sdj
|
|
||||||
# zpool status
|
|
||||||
state: DEGRADED
|
|
||||||
status: One or more devices has been taken offline by the administrator.
|
|
||||||
Sufficient replicas exist for the pool to continue functioning in a
|
|
||||||
degraded state.
|
|
||||||
action: Online the device using 'zpool online' or replace the device with
|
|
||||||
'zpool replace'.
|
|
||||||
scan: rebuilt 2.00G in 0h0m5s with 0 errors on Fri Feb 24 20:37:06 2017
|
|
||||||
config:
|
|
||||||
|
|
||||||
NAME STATE READ WRITE CKSUM
|
|
||||||
tank DEGRADED 0 0 0
|
|
||||||
draid1-0 DEGRADED 0 0 0
|
|
||||||
sdd ONLINE 0 0 0
|
|
||||||
sde ONLINE 0 0 0
|
|
||||||
sdf ONLINE 0 0 0
|
|
||||||
sdg ONLINE 0 0 0
|
|
||||||
sdh ONLINE 0 0 0
|
|
||||||
sdu ONLINE 0 0 0
|
|
||||||
sdj OFFLINE 0 0 0
|
|
||||||
sdv ONLINE 0 0 0
|
|
||||||
sdl ONLINE 0 0 0
|
|
||||||
sdm ONLINE 0 0 0
|
|
||||||
sdn ONLINE 0 0 0
|
|
||||||
spare-11 DEGRADED 0 0 0
|
|
||||||
sdo OFFLINE 0 0 0
|
|
||||||
%draid1-0-s0 ONLINE 0 0 0
|
|
||||||
sdp ONLINE 0 0 0
|
|
||||||
sdq ONLINE 0 0 0
|
|
||||||
sdr ONLINE 0 0 0
|
|
||||||
sds ONLINE 0 0 0
|
|
||||||
sdt ONLINE 0 0 0
|
|
||||||
spares
|
|
||||||
%draid1-0-s0 INUSE currently in use
|
|
||||||
%draid1-0-s1 AVAIL
|
|
||||||
|
|
||||||
As shown above, the *draid1-0* vdev is still in *DEGRADED* mode although
|
|
||||||
two child drives have failed and it's only single-parity. Since the
|
|
||||||
*%draid1-0-s1* is still *AVAIL*, full redundancy can be restored by
|
|
||||||
replacing *sdj* with it, without adding new drive to the pool:
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
# zpool replace tank sdj '%draid1-0-s1'
|
|
||||||
# zpool status
|
|
||||||
state: DEGRADED
|
|
||||||
status: One or more devices has been taken offline by the administrator.
|
|
||||||
Sufficient replicas exist for the pool to continue functioning in a
|
|
||||||
degraded state.
|
|
||||||
action: Online the device using 'zpool online' or replace the device with
|
|
||||||
'zpool replace'.
|
|
||||||
scan: rebuilt 2.13G in 0h0m5s with 0 errors on Fri Feb 24 23:20:59 2017
|
|
||||||
config:
|
|
||||||
|
|
||||||
NAME STATE READ WRITE CKSUM
|
|
||||||
tank DEGRADED 0 0 0
|
|
||||||
draid1-0 DEGRADED 0 0 0
|
|
||||||
sdd ONLINE 0 0 0
|
|
||||||
sde ONLINE 0 0 0
|
|
||||||
sdf ONLINE 0 0 0
|
|
||||||
sdg ONLINE 0 0 0
|
|
||||||
sdh ONLINE 0 0 0
|
|
||||||
sdu ONLINE 0 0 0
|
|
||||||
spare-6 DEGRADED 0 0 0
|
|
||||||
sdj OFFLINE 0 0 0
|
|
||||||
%draid1-0-s1 ONLINE 0 0 0
|
|
||||||
sdv ONLINE 0 0 0
|
|
||||||
sdl ONLINE 0 0 0
|
|
||||||
sdm ONLINE 0 0 0
|
|
||||||
sdn ONLINE 0 0 0
|
|
||||||
spare-11 DEGRADED 0 0 0
|
|
||||||
sdo OFFLINE 0 0 0
|
|
||||||
%draid1-0-s0 ONLINE 0 0 0
|
|
||||||
sdp ONLINE 0 0 0
|
|
||||||
sdq ONLINE 0 0 0
|
|
||||||
sdr ONLINE 0 0 0
|
|
||||||
sds ONLINE 0 0 0
|
|
||||||
sdt ONLINE 0 0 0
|
|
||||||
spares
|
|
||||||
%draid1-0-s0 INUSE currently in use
|
|
||||||
%draid1-0-s1 INUSE currently in use
|
|
||||||
|
|
||||||
Again, full redundancy has been restored without adding any new drive.
|
|
||||||
If another drive fails, the pool will still be able to handle IO, but
|
|
||||||
there'd be no more distributed spare to rebuild (both are in *INUSE*
|
|
||||||
state now). At this point, there's no urgency to add a new replacement
|
|
||||||
drive because the pool can survive yet another drive failure.
|
|
||||||
|
|
||||||
Rebuild for mirror vdev
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The sequential rebuild process also works for the mirror vdev, when a
|
|
||||||
drive is attached to a mirror or a mirror child vdev is replaced.
|
|
||||||
|
|
||||||
By default, rebuild for mirror vdev is turned off. It can be turned on
|
|
||||||
using the zfs module option *spa_rebuild_mirror=1*.
|
|
||||||
|
|
||||||
Rebuild throttling
|
|
||||||
~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
The rebuild process may delay *zio* by *spa_vdev_scan_delay* if the
|
|
||||||
draid vdev has seen any important IO in the recent *spa_vdev_scan_idle*
|
|
||||||
period. But when a dRAID vdev has lost all redundancy, e.g. a draid2
|
|
||||||
with 2 faulted child drives, the rebuild process will go full speed by
|
|
||||||
ignoring *spa_vdev_scan_delay* and *spa_vdev_scan_idle* altogether
|
|
||||||
because the vdev is now in critical state.
|
|
||||||
|
|
||||||
After delaying, the rebuild zio is issued using priority
|
|
||||||
*ZIO_PRIORITY_SCRUB* for reads and *ZIO_PRIORITY_ASYNC_WRITE* for
|
|
||||||
writes. Therefore the options that control the queuing of these two IO
|
|
||||||
priorities will affect rebuild *zio* as well, for example
|
|
||||||
*zfs_vdev_scrub_min_active*, *zfs_vdev_scrub_max_active*,
|
|
||||||
*zfs_vdev_async_write_min_active*, and
|
|
||||||
*zfs_vdev_async_write_max_active*.
|
|
||||||
|
|
||||||
Rebalance
|
|
||||||
---------
|
|
||||||
|
|
||||||
Distributed spare space can be made available again by simply replacing
|
Distributed spare space can be made available again by simply replacing
|
||||||
any failed drive with a new drive. This process is called *rebalance*
|
any failed drive with a new drive. This process is called rebalancing
|
||||||
which is essentially a *resilver*:
|
and is essentially a resilver. When performing rebalancing a healing
|
||||||
|
resilver is recommended since the pool is no longer degraded. This
|
||||||
|
ensures all checksums are verified when rebuilding to the new disk
|
||||||
|
and eliminates the need to perform a subsequent scrub of the pool.
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
# zpool replace -f tank sdo sdw
|
# zpool replace tank sdg sdl
|
||||||
# zpool status
|
# zpool status
|
||||||
|
|
||||||
|
pool: tank
|
||||||
state: DEGRADED
|
state: DEGRADED
|
||||||
status: One or more devices has been taken offline by the administrator.
|
status: One or more devices is currently being resilvered. The pool will
|
||||||
Sufficient replicas exist for the pool to continue functioning in a
|
continue to function, possibly in a degraded state.
|
||||||
degraded state.
|
action: Wait for the resilver to complete.
|
||||||
action: Online the device using 'zpool online' or replace the device with
|
scan: resilver in progress since Tue Nov 24 14:45:16 2020
|
||||||
'zpool replace'.
|
6.13T scanned at 7.82G/s, 6.10T issued at 7.78G/s, 6.13T total
|
||||||
scan: resilvered 2.21G in 0h0m58s with 0 errors on Fri Feb 24 23:31:45 2017
|
565G resilvered, 99.44% done, 00:00:04 to go
|
||||||
config:
|
config:
|
||||||
|
|
||||||
NAME STATE READ WRITE CKSUM
|
NAME STATE READ WRITE CKSUM
|
||||||
tank DEGRADED 0 0 0
|
tank DEGRADED 0 0 0
|
||||||
draid1-0 DEGRADED 0 0 0
|
draid1:4d:11c:1s-0 DEGRADED 0 0 0
|
||||||
sdd ONLINE 0 0 0
|
sda ONLINE 0 0 0 (resilvering)
|
||||||
sde ONLINE 0 0 0
|
sdb ONLINE 0 0 0 (resilvering)
|
||||||
sdf ONLINE 0 0 0
|
sdc ONLINE 0 0 0 (resilvering)
|
||||||
sdg ONLINE 0 0 0
|
sdd ONLINE 0 0 0 (resilvering)
|
||||||
sdh ONLINE 0 0 0
|
sde ONLINE 0 0 0 (resilvering)
|
||||||
sdu ONLINE 0 0 0
|
sdf ONLINE 0 0 0 (resilvering)
|
||||||
spare-6 DEGRADED 0 0 0
|
spare-6 DEGRADED 0 0 0
|
||||||
sdj OFFLINE 0 0 0
|
replacing-0 DEGRADED 0 0 0
|
||||||
%draid1-0-s1 ONLINE 0 0 0
|
sdg UNAVAIL 0 0 0
|
||||||
sdv ONLINE 0 0 0
|
sdl ONLINE 0 0 0 (resilvering)
|
||||||
sdl ONLINE 0 0 0
|
draid1-0-0 ONLINE 0 0 0 (resilvering)
|
||||||
sdm ONLINE 0 0 0
|
sdh ONLINE 0 0 0 (resilvering)
|
||||||
sdn ONLINE 0 0 0
|
sdi ONLINE 0 0 0 (resilvering)
|
||||||
sdw ONLINE 0 0 0
|
sdj ONLINE 0 0 0 (resilvering)
|
||||||
sdp ONLINE 0 0 0
|
sdk ONLINE 0 0 0 (resilvering)
|
||||||
sdq ONLINE 0 0 0
|
|
||||||
sdr ONLINE 0 0 0
|
|
||||||
sds ONLINE 0 0 0
|
|
||||||
sdt ONLINE 0 0 0
|
|
||||||
spares
|
spares
|
||||||
%draid1-0-s0 AVAIL
|
draid1-0-0 INUSE currently in use
|
||||||
%draid1-0-s1 INUSE currently in use
|
|
||||||
|
|
||||||
Note that the scan status now says *"resilvered"*. Also, the state of
|
After the resilvering completes the distributed hot spare is once again
|
||||||
*%draid1-0-s0* has become *AVAIL* again. Since the resilver process
|
available for use and the pool has been restored to its normal healthy
|
||||||
checks block checksums, it makes up for the lack of checksum
|
state.
|
||||||
verification during previous rebuild.
|
|
||||||
|
|
||||||
The dRAID1 vdev in this example shuffles three (4 data + 1 parity)
|
.. |draid1| image:: /_static/img/raidz_draid.png
|
||||||
redundancy groups to the 17 drives. For any single drive failure, only
|
.. |draid-resilver| image:: /_static/img/draid-resilver-hours.png
|
||||||
about 1/3 of the blocks are affected (and should be resilvered/rebuilt).
|
.. _dRAID: https://docs.google.com/presentation/d/1uo0nBfY84HIhEqGWEx-Tbm8fPbJKtIP3ICo4toOPcJo/edit
|
||||||
The rebuild process is able to avoid unnecessary work, but the resilver
|
.. _sequential resilver: https://docs.google.com/presentation/d/1vLsgQ1MaHlifw40C9R2sPsSiHiQpxglxMbK2SMthu0Q/edit#slide=id.g995720a6cf_1_39
|
||||||
process by default will not. The rebalance (which is essentially
|
.. _custom packages: https://openzfs.github.io/openzfs-docs/Developer%20Resources/Custom%20Packages.html#
|
||||||
resilver) can speed up a lot by setting module option
|
|
||||||
*zfs_no_resilver_skip* to 0. This feature is turned off by default
|
|
||||||
because of issue :issue:`5806`.
|
|
||||||
|
|
||||||
Troubleshooting
|
|
||||||
---------------
|
|
||||||
|
|
||||||
Please report bugs to `the dRAID
|
|
||||||
PR <https://github.com/zfsonlinux/zfs/pull/10102>`__, as long as the
|
|
||||||
code is not merged upstream.
|
|
||||||
|
|
||||||
.. |raidz1| image:: /_static/img/draid_raidz.png
|
|
||||||
.. |draid1| image:: /_static/img/draid_draid.png
|
|
||||||
|
|||||||
BIN
docs/_static/img/draid-resilver-hours.png
vendored
Normal file
BIN
docs/_static/img/draid-resilver-hours.png
vendored
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 62 KiB |
BIN
docs/_static/img/draid_draid.png
vendored
BIN
docs/_static/img/draid_draid.png
vendored
Binary file not shown.
|
Before Width: | Height: | Size: 61 KiB |
BIN
docs/_static/img/draid_raidz.png
vendored
BIN
docs/_static/img/draid_raidz.png
vendored
Binary file not shown.
|
Before Width: | Height: | Size: 67 KiB |
BIN
docs/_static/img/raidz_draid.png
vendored
Normal file
BIN
docs/_static/img/raidz_draid.png
vendored
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 122 KiB |
Reference in New Issue
Block a user