Rename Basics concepts
This eliminates the typo and uses title case. Signed-off-by: Richard Laager <rlaager@wiktel.com>
This commit is contained in:
123
docs/Basic Concepts/Checksums.rst
Normal file
123
docs/Basic Concepts/Checksums.rst
Normal file
@@ -0,0 +1,123 @@
|
||||
Checksums and Their Use in ZFS
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
End-to-end checksums are a key feature of ZFS and an important
|
||||
differentiator for ZFS over other RAID implementations and filesystems.
|
||||
Advantages of end-to-end checksums include:
|
||||
|
||||
- detects data corruption upon reading from media
|
||||
- blocks that are detected as corrupt are automatically repaired if
|
||||
possible, by using the RAID protection in suitably configured pools,
|
||||
or redundant copies (see the zfs ``copies`` property)
|
||||
- periodic scrubs can check data to detect and repair latent media
|
||||
degradation (bit rot) and corruption from other sources
|
||||
- checksums on ZFS replication streams, ``zfs send`` and
|
||||
``zfs receive``, ensure the data received is not corrupted by
|
||||
intervening storage or transport mechanisms
|
||||
|
||||
Checksum Algorithms
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The checksum algorithms in ZFS can be changed for datasets (filesystems
|
||||
or volumes). The checksum algorithm used for each block is stored in the
|
||||
block pointer (metadata). The block checksum is calculated when the
|
||||
block is written, so changing the algorithm only affects writes
|
||||
occurring after the change.
|
||||
|
||||
The checksum algorithm for a dataset can be changed by setting the
|
||||
``checksum`` property:
|
||||
|
||||
.. code:: bash
|
||||
|
||||
zfs set checksum=sha256 pool_name/dataset_name
|
||||
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| Checksum | Ok for dedup | Compatible with | Notes |
|
||||
| | and nopwrite?| other ZFS | |
|
||||
| | | implementations? | |
|
||||
+===========+==============+========================+=========================+
|
||||
| on | see notes | yes | ``on`` is a |
|
||||
| | | | short hand for |
|
||||
| | | | ``fletcher4`` |
|
||||
| | | | for non-deduped |
|
||||
| | | | datasets and |
|
||||
| | | | ``sha256`` for |
|
||||
| | | | deduped |
|
||||
| | | | datasets |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| off | no | yes | Do not do use |
|
||||
| | | | ``off`` |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| fletcher2 | no | yes | Deprecated |
|
||||
| | | | implementation |
|
||||
| | | | of Fletcher |
|
||||
| | | | checksum, use |
|
||||
| | | | ``fletcher4`` |
|
||||
| | | | instead |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| fletcher4 | no | yes | Fletcher |
|
||||
| | | | algorithm, also |
|
||||
| | | | used for |
|
||||
| | | | ``zfs send`` |
|
||||
| | | | streams |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| sha256 | yes | yes | Default for |
|
||||
| | | | deduped |
|
||||
| | | | datasets |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| noparity | no | yes | Do not use |
|
||||
| | | | ``noparity`` |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| sha512 | yes | requires pool | salted |
|
||||
| | | feature | ``sha512`` |
|
||||
| | | ``org.illumos:sha512`` | currently not |
|
||||
| | | | supported for |
|
||||
| | | | any filesystem |
|
||||
| | | | on the boot |
|
||||
| | | | pools |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| skein | yes | requires pool | salted |
|
||||
| | | feature | ``skein`` |
|
||||
| | | ``org.illumos:skein`` | currently not |
|
||||
| | | | supported for |
|
||||
| | | | any filesystem |
|
||||
| | | | on the boot |
|
||||
| | | | pools |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
| edonr | yes | requires pool | salted |
|
||||
| | | feature | ``edonr`` |
|
||||
| | | ``org.illumos:edonr`` | currently not |
|
||||
| | | | supported for |
|
||||
| | | | any filesystem |
|
||||
| | | | on the boot |
|
||||
| | | | pools |
|
||||
+-----------+--------------+------------------------+-------------------------+
|
||||
|
||||
Checksum Accelerators
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
ZFS has the ability to offload checksum operations to the Intel
|
||||
QuickAssist Technology (QAT) adapters.
|
||||
|
||||
Checksum Microbenchmarks
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Some ZFS features use microbenchmarks when the ``zfs.ko`` kernel module
|
||||
is loaded to determine the optimal algorithm for checksums. The results
|
||||
of the microbenchmarks are observable in the ``/proc/spl/kstat/zfs``
|
||||
directory. The winning algorithm is reported as the "fastest" and
|
||||
becomes the default. The default can be overridden by setting zfs module
|
||||
parameters.
|
||||
|
||||
========= ==================================== ========================
|
||||
Checksum Results Filename ``zfs`` module parameter
|
||||
========= ==================================== ========================
|
||||
Fletcher4 /proc/spl/kstat/zfs/fletcher_4_bench zfs_fletcher_4_impl
|
||||
========= ==================================== ========================
|
||||
|
||||
Disabling Checksums
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
While it may be tempting to disable checksums to improve CPU
|
||||
performance, it is widely considered by the ZFS community to be an
|
||||
extrodinarily bad idea. Don't disable checksums.
|
||||
105
docs/Basic Concepts/Troubleshooting.rst
Normal file
105
docs/Basic Concepts/Troubleshooting.rst
Normal file
@@ -0,0 +1,105 @@
|
||||
Troubleshooting
|
||||
===============
|
||||
|
||||
.. todo::
|
||||
This page is a draft.
|
||||
|
||||
This page contains tips for troubleshooting ZFS on Linux and what info
|
||||
developers might want for bug triage.
|
||||
|
||||
- `About Log Files <#about-log-files>`__
|
||||
|
||||
- `Generic Kernel Log <#generic-kernel-log>`__
|
||||
- `ZFS Kernel Module Debug
|
||||
Messages <#zfs-kernel-module-debug-messages>`__
|
||||
|
||||
- `Unkillable Process <#unkillable-process>`__
|
||||
- `ZFS Events <#zfs-events>`__
|
||||
|
||||
--------------
|
||||
|
||||
About Log Files
|
||||
---------------
|
||||
|
||||
Log files can be very useful for troubleshooting. In some cases,
|
||||
interesting information is stored in multiple log files that are
|
||||
correlated to system events.
|
||||
|
||||
Pro tip: logging infrastructure tools like *elasticsearch*, *fluentd*,
|
||||
*influxdb*, or *splunk* can simplify log analysis and event correlation.
|
||||
|
||||
Generic Kernel Log
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Typically, Linux kernel log messages are available from ``dmesg -T``,
|
||||
``/var/log/syslog``, or where kernel log messages are sent (eg by
|
||||
``rsyslogd``).
|
||||
|
||||
ZFS Kernel Module Debug Messages
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ZFS kernel modules use an internal log buffer for detailed logging
|
||||
information. This log information is available in the pseudo file
|
||||
``/proc/spl/kstat/zfs/dbgmsg`` for ZFS builds where ZFS module parameter
|
||||
`zfs_dbgmsg_enable =
|
||||
1 <https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_dbgmsg_enable>`__
|
||||
|
||||
--------------
|
||||
|
||||
Unkillable Process
|
||||
------------------
|
||||
|
||||
Symptom: ``zfs`` or ``zpool`` command appear hung, does not return, and
|
||||
is not killable
|
||||
|
||||
Likely cause: kernel thread hung or panic
|
||||
|
||||
Log files of interest: `Generic Kernel Log <#generic-kernel-log>`__,
|
||||
`ZFS Kernel Module Debug Messages <#zfs-kernel-module-debug-messages>`__
|
||||
|
||||
Important information: if a kernel thread is stuck, then a backtrace of
|
||||
the stuck thread can be in the logs. In some cases, the stuck thread is
|
||||
not logged until the deadman timer expires. See also `debug
|
||||
tunables <https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#debug>`__
|
||||
|
||||
--------------
|
||||
|
||||
ZFS Events
|
||||
----------
|
||||
|
||||
ZFS uses an event-based messaging interface for communication of
|
||||
important events to other consumers running on the system. The ZFS Event
|
||||
Daemon (zed) is a userland daemon that listens for these events and
|
||||
processes them. zed is extensible so you can write shell scripts or
|
||||
other programs that subscribe to events and take action. For example,
|
||||
the script usually installed at ``/etc/zfs/zed.d/all-syslog.sh`` writes
|
||||
a formatted event message to ``syslog.`` See the man page for ``zed(8)``
|
||||
for more information.
|
||||
|
||||
A history of events is also available via the ``zpool events`` command.
|
||||
This history begins at ZFS kernel module load and includes events from
|
||||
any pool. These events are stored in RAM and limited in count to a value
|
||||
determined by the kernel tunable
|
||||
`zfs_event_len_max <https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zfs_zevent_len_max>`__.
|
||||
``zed`` has an internal throttling mechanism to prevent overconsumption
|
||||
of system resources processing ZFS events.
|
||||
|
||||
More detailed information about events is observable using
|
||||
``zpool events -v`` The contents of the verbose events is subject to
|
||||
change, based on the event and information available at the time of the
|
||||
event.
|
||||
|
||||
Each event has a class identifier used for filtering event types.
|
||||
Commonly seen events are those related to pool management with class
|
||||
``sysevent.fs.zfs.*`` including import, export, configuration updates,
|
||||
and ``zpool history`` updates.
|
||||
|
||||
Events related to errors are reported as class ``ereport.*`` These can
|
||||
be invaluable for troubleshooting. Some faults can cause multiple
|
||||
ereports as various layers of the software deal with the fault. For
|
||||
example, on a simple pool without parity protection, a faulty disk could
|
||||
cause an ``ereport.io`` during a read from the disk that results in an
|
||||
``erport.fs.zfs.checksum`` at the pool level. These events are also
|
||||
reflected by the error counters observed in ``zpool status`` If you see
|
||||
checksum or read/write errors in ``zpool status`` then there should be
|
||||
one or more corresponding ereports in the ``zpool events`` output.
|
||||
417
docs/Basic Concepts/dRAID Howto.rst
Normal file
417
docs/Basic Concepts/dRAID Howto.rst
Normal file
@@ -0,0 +1,417 @@
|
||||
dRAID Howto
|
||||
===========
|
||||
|
||||
.. note::
|
||||
This page a describes *work in progress* functionality, which is not yet
|
||||
merged in master branch.
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
raidz vs draid
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
ZFS users are most likely very familiar with raidz already, so a
|
||||
comparison with draid would help. The illustrations below are
|
||||
simplified, but sufficient for the purpose of a comparison. For example,
|
||||
31 drives can be configured as a zpool of 6 raidz1 vdevs and a hot
|
||||
spare: |raidz1|
|
||||
|
||||
As shown above, if drive 0 fails and is replaced by the hot spare, only
|
||||
5 out of the 30 surviving drives will work to resilver: drives 1-4 read,
|
||||
and drive 30 writes.
|
||||
|
||||
The same 30 drives can be configured as 1 draid1 vdev of the same level
|
||||
of redundancy (i.e. single parity, 1/4 parity ratio) and single spare
|
||||
capacity: |draid1|
|
||||
|
||||
The drives are shuffled in a way that, after drive 0 fails, all 30
|
||||
surviving drives will work together to restore the lost data/parity:
|
||||
|
||||
- All 30 drives read, because unlike the raidz1 configuration shown
|
||||
above, in the draid1 configuration the neighbor drives of the failed
|
||||
drive 0 (i.e. drives in a same data+parity group) are not fixed.
|
||||
- All 30 drives write, because now there is no dedicated spare drive.
|
||||
Instead, spare blocks come from all drives.
|
||||
|
||||
To summarize:
|
||||
|
||||
- Normal application IO: draid and raidz are very similar. There's a
|
||||
slight advantage in draid, since there's no dedicated spare drive
|
||||
which is idle when not in use.
|
||||
- Restore lost data/parity: for raidz, not all surviving drives will
|
||||
work to rebuild, and in addition it's bounded by the write throughput
|
||||
of a single replacement drive. For draid, the rebuild speed will
|
||||
scale with the total number of drives because all surviving drives
|
||||
will work to rebuild.
|
||||
|
||||
The dRAID vdev must shuffle its child drives in a way that regardless of
|
||||
which drive has failed, the rebuild IO (both read and write) will
|
||||
distribute evenly among all surviving drives, so the rebuild speed will
|
||||
scale. The exact mechanism used by the dRAID vdev driver is beyond the
|
||||
scope of this simple introduction here. If interested, please refer to
|
||||
the recommended readings in the next section.
|
||||
|
||||
Recommended Reading
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Parity declustering (the fancy term for shuffling drives) has been an
|
||||
active research topic, and many papers have been published in this area.
|
||||
The `Permutation Development Data
|
||||
Layout <http://www.cse.scu.edu/~tschwarz/TechReports/hpca.pdf>`__ is a
|
||||
good paper to begin. The dRAID vdev driver uses a shuffling algorithm
|
||||
loosely based on the mechanism described in this paper.
|
||||
|
||||
Using dRAID
|
||||
-----------
|
||||
|
||||
First get the code `here <https://github.com/openzfs/zfs/pull/10102>`__,
|
||||
build zfs with *configure --enable-debug*, and install. Then load the
|
||||
zfs kernel module with the following options which help dRAID rebuild
|
||||
performance.
|
||||
|
||||
- zfs_vdev_scrub_max_active=10
|
||||
- zfs_vdev_async_write_min_active=4
|
||||
|
||||
Create a dRAID vdev
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Similar to raidz vdev a dRAID vdev can be created using the
|
||||
``zpool create`` command:
|
||||
|
||||
::
|
||||
|
||||
# zpool create <pool> draid[1,2,3][ <vdevs...>
|
||||
|
||||
Unlike raidz, additional options may be provided as part of the
|
||||
``draid`` vdev type to specify an exact dRAID layout. When unspecific
|
||||
reasonable defaults will be chosen.
|
||||
|
||||
::
|
||||
|
||||
# zpool create <pool> draid[1,2,3][:<groups>g][:<spares>s][:<data>d][:<iterations>] <vdevs...>
|
||||
|
||||
- groups - Number of redundancy groups (default: 1 group per 12 vdevs)
|
||||
- spares - Number of distributed hot spares (default: 1)
|
||||
- data - Number of data devices per group (default: determined by
|
||||
number of groups)
|
||||
- iterations - Number of iterations to perform generating a valid dRAID
|
||||
mapping (default 3).
|
||||
|
||||
*Notes*:
|
||||
|
||||
- The default values are not set in stone and may change.
|
||||
- For the majority of common configurations we intend to provide
|
||||
pre-computed balanced dRAID mappings.
|
||||
- When *data* is specified then: (draid_children - spares) % (parity +
|
||||
data) == 0, otherwise the pool creation will fail.
|
||||
|
||||
Now the dRAID vdev is online and ready for IO:
|
||||
|
||||
::
|
||||
|
||||
pool: tank
|
||||
state: ONLINE
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
tank ONLINE 0 0 0
|
||||
draid2:4g:2s-0 ONLINE 0 0 0
|
||||
L0 ONLINE 0 0 0
|
||||
L1 ONLINE 0 0 0
|
||||
L2 ONLINE 0 0 0
|
||||
L3 ONLINE 0 0 0
|
||||
...
|
||||
L50 ONLINE 0 0 0
|
||||
L51 ONLINE 0 0 0
|
||||
L52 ONLINE 0 0 0
|
||||
spares
|
||||
s0-draid2:4g:2s-0 AVAIL
|
||||
s1-draid2:4g:2s-0 AVAIL
|
||||
|
||||
errors: No known data errors
|
||||
|
||||
There are two logical hot spare vdevs shown above at the bottom:
|
||||
|
||||
- The names begin with a ``s<id>-`` followed by the name of the parent
|
||||
dRAID vdev.
|
||||
- These hot spares are logical, made from reserved blocks on all the 53
|
||||
child drives of the dRAID vdev.
|
||||
- Unlike traditional hot spares, the distributed spare can only replace
|
||||
a drive in its parent dRAID vdev.
|
||||
|
||||
The dRAID vdev behaves just like a raidz vdev of the same parity level.
|
||||
You can do IO to/from it, scrub it, fail a child drive and it'd operate
|
||||
in degraded mode.
|
||||
|
||||
Rebuild to distributed spare
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When there's a failed/offline child drive, the dRAID vdev supports a
|
||||
completely new mechanism to reconstruct lost data/parity, in addition to
|
||||
the resilver. First of all, resilver is still supported - if a failed
|
||||
drive is replaced by another physical drive, the resilver process is
|
||||
used to reconstruct lost data/parity to the new replacement drive, which
|
||||
is the same as a resilver in a raidz vdev.
|
||||
|
||||
But if a child drive is replaced with a distributed spare, a new process
|
||||
called rebuild is used instead of resilver:
|
||||
|
||||
::
|
||||
|
||||
# zpool offline tank sdo
|
||||
# zpool replace tank sdo '%draid1-0-s0'
|
||||
# zpool status
|
||||
pool: tank
|
||||
state: DEGRADED
|
||||
status: One or more devices has been taken offline by the administrator.
|
||||
Sufficient replicas exist for the pool to continue functioning in a
|
||||
degraded state.
|
||||
action: Online the device using 'zpool online' or replace the device with
|
||||
'zpool replace'.
|
||||
scan: rebuilt 2.00G in 0h0m5s with 0 errors on Fri Feb 24 20:37:06 2017
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
tank DEGRADED 0 0 0
|
||||
draid1-0 DEGRADED 0 0 0
|
||||
sdd ONLINE 0 0 0
|
||||
sde ONLINE 0 0 0
|
||||
sdf ONLINE 0 0 0
|
||||
sdg ONLINE 0 0 0
|
||||
sdh ONLINE 0 0 0
|
||||
sdu ONLINE 0 0 0
|
||||
sdj ONLINE 0 0 0
|
||||
sdv ONLINE 0 0 0
|
||||
sdl ONLINE 0 0 0
|
||||
sdm ONLINE 0 0 0
|
||||
sdn ONLINE 0 0 0
|
||||
spare-11 DEGRADED 0 0 0
|
||||
sdo OFFLINE 0 0 0
|
||||
%draid1-0-s0 ONLINE 0 0 0
|
||||
sdp ONLINE 0 0 0
|
||||
sdq ONLINE 0 0 0
|
||||
sdr ONLINE 0 0 0
|
||||
sds ONLINE 0 0 0
|
||||
sdt ONLINE 0 0 0
|
||||
spares
|
||||
%draid1-0-s0 INUSE currently in use
|
||||
%draid1-0-s1 AVAIL
|
||||
|
||||
The scan status line of the *zpool status* output now says *"rebuilt"*
|
||||
instead of *"resilvered"*, because the lost data/parity was rebuilt to
|
||||
the distributed spare by a brand new process called *"rebuild"*. The
|
||||
main differences from *resilver* are:
|
||||
|
||||
- The rebuild process does not scan the whole block pointer tree.
|
||||
Instead, it only scans the spacemap objects.
|
||||
- The IO from rebuild is sequential, because it rebuilds metaslabs one
|
||||
by one in sequential order.
|
||||
- The rebuild process is not limited to block boundaries. For example,
|
||||
if 10 64K blocks are allocated contiguously, then rebuild will fix
|
||||
640K at one time. So rebuild process will generate larger IOs than
|
||||
resilver.
|
||||
- For all the benefits above, there is one price to pay. The rebuild
|
||||
process cannot verify block checksums, since it doesn't have block
|
||||
pointers.
|
||||
- Moreover, the rebuild process requires support from on-disk format,
|
||||
and **only** works on draid and mirror vdevs. Resilver, on the other
|
||||
hand, works with any vdev (including draid).
|
||||
|
||||
Although rebuild process creates larger IOs, the drives will not
|
||||
necessarily see large IO requests. The block device queue parameter
|
||||
*/sys/block/*/queue/max_sectors_kb* must be tuned accordingly. However,
|
||||
since the rebuild IO is already sequential, the benefits of enabling
|
||||
larger IO requests might be marginal.
|
||||
|
||||
At this point, redundancy has been fully restored without adding any new
|
||||
drive to the pool. If another drive is offlined, the pool is still able
|
||||
to do IO:
|
||||
|
||||
::
|
||||
|
||||
# zpool offline tank sdj
|
||||
# zpool status
|
||||
state: DEGRADED
|
||||
status: One or more devices has been taken offline by the administrator.
|
||||
Sufficient replicas exist for the pool to continue functioning in a
|
||||
degraded state.
|
||||
action: Online the device using 'zpool online' or replace the device with
|
||||
'zpool replace'.
|
||||
scan: rebuilt 2.00G in 0h0m5s with 0 errors on Fri Feb 24 20:37:06 2017
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
tank DEGRADED 0 0 0
|
||||
draid1-0 DEGRADED 0 0 0
|
||||
sdd ONLINE 0 0 0
|
||||
sde ONLINE 0 0 0
|
||||
sdf ONLINE 0 0 0
|
||||
sdg ONLINE 0 0 0
|
||||
sdh ONLINE 0 0 0
|
||||
sdu ONLINE 0 0 0
|
||||
sdj OFFLINE 0 0 0
|
||||
sdv ONLINE 0 0 0
|
||||
sdl ONLINE 0 0 0
|
||||
sdm ONLINE 0 0 0
|
||||
sdn ONLINE 0 0 0
|
||||
spare-11 DEGRADED 0 0 0
|
||||
sdo OFFLINE 0 0 0
|
||||
%draid1-0-s0 ONLINE 0 0 0
|
||||
sdp ONLINE 0 0 0
|
||||
sdq ONLINE 0 0 0
|
||||
sdr ONLINE 0 0 0
|
||||
sds ONLINE 0 0 0
|
||||
sdt ONLINE 0 0 0
|
||||
spares
|
||||
%draid1-0-s0 INUSE currently in use
|
||||
%draid1-0-s1 AVAIL
|
||||
|
||||
As shown above, the *draid1-0* vdev is still in *DEGRADED* mode although
|
||||
two child drives have failed and it's only single-parity. Since the
|
||||
*%draid1-0-s1* is still *AVAIL*, full redundancy can be restored by
|
||||
replacing *sdj* with it, without adding new drive to the pool:
|
||||
|
||||
::
|
||||
|
||||
# zpool replace tank sdj '%draid1-0-s1'
|
||||
# zpool status
|
||||
state: DEGRADED
|
||||
status: One or more devices has been taken offline by the administrator.
|
||||
Sufficient replicas exist for the pool to continue functioning in a
|
||||
degraded state.
|
||||
action: Online the device using 'zpool online' or replace the device with
|
||||
'zpool replace'.
|
||||
scan: rebuilt 2.13G in 0h0m5s with 0 errors on Fri Feb 24 23:20:59 2017
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
tank DEGRADED 0 0 0
|
||||
draid1-0 DEGRADED 0 0 0
|
||||
sdd ONLINE 0 0 0
|
||||
sde ONLINE 0 0 0
|
||||
sdf ONLINE 0 0 0
|
||||
sdg ONLINE 0 0 0
|
||||
sdh ONLINE 0 0 0
|
||||
sdu ONLINE 0 0 0
|
||||
spare-6 DEGRADED 0 0 0
|
||||
sdj OFFLINE 0 0 0
|
||||
%draid1-0-s1 ONLINE 0 0 0
|
||||
sdv ONLINE 0 0 0
|
||||
sdl ONLINE 0 0 0
|
||||
sdm ONLINE 0 0 0
|
||||
sdn ONLINE 0 0 0
|
||||
spare-11 DEGRADED 0 0 0
|
||||
sdo OFFLINE 0 0 0
|
||||
%draid1-0-s0 ONLINE 0 0 0
|
||||
sdp ONLINE 0 0 0
|
||||
sdq ONLINE 0 0 0
|
||||
sdr ONLINE 0 0 0
|
||||
sds ONLINE 0 0 0
|
||||
sdt ONLINE 0 0 0
|
||||
spares
|
||||
%draid1-0-s0 INUSE currently in use
|
||||
%draid1-0-s1 INUSE currently in use
|
||||
|
||||
Again, full redundancy has been restored without adding any new drive.
|
||||
If another drive fails, the pool will still be able to handle IO, but
|
||||
there'd be no more distributed spare to rebuild (both are in *INUSE*
|
||||
state now). At this point, there's no urgency to add a new replacement
|
||||
drive because the pool can survive yet another drive failure.
|
||||
|
||||
Rebuild for mirror vdev
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The sequential rebuild process also works for the mirror vdev, when a
|
||||
drive is attached to a mirror or a mirror child vdev is replaced.
|
||||
|
||||
By default, rebuild for mirror vdev is turned off. It can be turned on
|
||||
using the zfs module option *spa_rebuild_mirror=1*.
|
||||
|
||||
Rebuild throttling
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The rebuild process may delay *zio* by *spa_vdev_scan_delay* if the
|
||||
draid vdev has seen any important IO in the recent *spa_vdev_scan_idle*
|
||||
period. But when a dRAID vdev has lost all redundancy, e.g. a draid2
|
||||
with 2 faulted child drives, the rebuild process will go full speed by
|
||||
ignoring *spa_vdev_scan_delay* and *spa_vdev_scan_idle* altogether
|
||||
because the vdev is now in critical state.
|
||||
|
||||
After delaying, the rebuild zio is issued using priority
|
||||
*ZIO_PRIORITY_SCRUB* for reads and *ZIO_PRIORITY_ASYNC_WRITE* for
|
||||
writes. Therefore the options that control the queuing of these two IO
|
||||
priorities will affect rebuild *zio* as well, for example
|
||||
*zfs_vdev_scrub_min_active*, *zfs_vdev_scrub_max_active*,
|
||||
*zfs_vdev_async_write_min_active*, and
|
||||
*zfs_vdev_async_write_max_active*.
|
||||
|
||||
Rebalance
|
||||
---------
|
||||
|
||||
Distributed spare space can be made available again by simply replacing
|
||||
any failed drive with a new drive. This process is called *rebalance*
|
||||
which is essentially a *resilver*:
|
||||
|
||||
::
|
||||
|
||||
# zpool replace -f tank sdo sdw
|
||||
# zpool status
|
||||
state: DEGRADED
|
||||
status: One or more devices has been taken offline by the administrator.
|
||||
Sufficient replicas exist for the pool to continue functioning in a
|
||||
degraded state.
|
||||
action: Online the device using 'zpool online' or replace the device with
|
||||
'zpool replace'.
|
||||
scan: resilvered 2.21G in 0h0m58s with 0 errors on Fri Feb 24 23:31:45 2017
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
tank DEGRADED 0 0 0
|
||||
draid1-0 DEGRADED 0 0 0
|
||||
sdd ONLINE 0 0 0
|
||||
sde ONLINE 0 0 0
|
||||
sdf ONLINE 0 0 0
|
||||
sdg ONLINE 0 0 0
|
||||
sdh ONLINE 0 0 0
|
||||
sdu ONLINE 0 0 0
|
||||
spare-6 DEGRADED 0 0 0
|
||||
sdj OFFLINE 0 0 0
|
||||
%draid1-0-s1 ONLINE 0 0 0
|
||||
sdv ONLINE 0 0 0
|
||||
sdl ONLINE 0 0 0
|
||||
sdm ONLINE 0 0 0
|
||||
sdn ONLINE 0 0 0
|
||||
sdw ONLINE 0 0 0
|
||||
sdp ONLINE 0 0 0
|
||||
sdq ONLINE 0 0 0
|
||||
sdr ONLINE 0 0 0
|
||||
sds ONLINE 0 0 0
|
||||
sdt ONLINE 0 0 0
|
||||
spares
|
||||
%draid1-0-s0 AVAIL
|
||||
%draid1-0-s1 INUSE currently in use
|
||||
|
||||
Note that the scan status now says *"resilvered"*. Also, the state of
|
||||
*%draid1-0-s0* has become *AVAIL* again. Since the resilver process
|
||||
checks block checksums, it makes up for the lack of checksum
|
||||
verification during previous rebuild.
|
||||
|
||||
The dRAID1 vdev in this example shuffles three (4 data + 1 parity)
|
||||
redundancy groups to the 17 drives. For any single drive failure, only
|
||||
about 1/3 of the blocks are affected (and should be resilvered/rebuilt).
|
||||
The rebuild process is able to avoid unnecessary work, but the resilver
|
||||
process by default will not. The rebalance (which is essentially
|
||||
resilver) can speed up a lot by setting module option
|
||||
*zfs_no_resilver_skip* to 0. This feature is turned off by default
|
||||
because of issue :issue:`5806`.
|
||||
|
||||
Troubleshooting
|
||||
---------------
|
||||
|
||||
Please report bugs to `the dRAID
|
||||
PR <https://github.com/zfsonlinux/zfs/pull/10102>`__, as long as the
|
||||
code is not merged upstream.
|
||||
|
||||
.. |raidz1| image:: /_static/img/draid_raidz.png
|
||||
.. |draid1| image:: /_static/img/draid_draid.png
|
||||
9
docs/Basic Concepts/index.rst
Normal file
9
docs/Basic Concepts/index.rst
Normal file
@@ -0,0 +1,9 @@
|
||||
Basic Concepts
|
||||
==============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:caption: Contents:
|
||||
:glob:
|
||||
|
||||
*
|
||||
Reference in New Issue
Block a user