From 9a074f2843a2fd5927105816b2901f4d037c5691 Mon Sep 17 00:00:00 2001
From: George Melikov <mail@gmelikov.ru>
Date: Sun, 13 Sep 2020 14:20:41 +0300
Subject: [PATCH] Hardware raw mediawiki->rst conversion

---
 docs/Performance and Tuning/Hardware.rst | 797 +++++++++++++++++++++++
 1 file changed, 797 insertions(+)
 create mode 100644 docs/Performance and Tuning/Hardware.rst

diff --git a/docs/Performance and Tuning/Hardware.rst b/docs/Performance and Tuning/Hardware.rst
new file mode 100644
index 0000000..43951d1
--- /dev/null
+++ b/docs/Performance and Tuning/Hardware.rst	
@@ -0,0 +1,797 @@
+Introduction
+============
+
+Storage before ZFS involved rather expensive hardware that was unable to
+protect against silent corruption and did not scale very well. The
+introduction of ZFS has enabled people to use far less expensive
+hardware than previously used in the industry with superior scaling.
+This page attempts to provide some basic guidance to people buying
+hardware for use in ZFS-based servers and workstations.
+
+Hardware that adheres to this guidance will enable ZFS to reach its full
+potential for performance and reliability. Hardware that does not adhere
+to it will serve as a handicap. Unless otherwise stated, such handicaps
+apply to all storage stacks and are by no means specific to ZFS. Systems
+built using competing storage stacks will also benefit from these
+suggestions.
+
+.. _bios_cpu_microcode_updates:
+
+BIOS / CPU microcode updates
+============================
+
+Running the latest BIOS and CPU microcode is highly recommended.
+
+Background
+----------
+
+Computer microprocessors are very complex designs that often have bugs,
+which are called errata. Modern microprocessors are designed to utilize
+microcode. This puts part of the hardware design into quasi-software
+that can be patched without replacing the entire chip. Errata are often
+resolved through CPU microcode updates. These are often bundled in BIOS
+updates. In some cases, the BIOS interactions with the CPU through
+machine registers can be modified to fix things with the same microcode.
+If a newer microcode is not bundled as part of a BIOS update, it can
+often be loaded by the operating system bootloader or the operating
+system itself.
+
+.. _ecc_memory:
+
+ECC Memory
+==========
+
+Bit flips can have fairly dramatic consequences for all computer
+filesystems and ZFS is no exception. No technique used in ZFS (or any
+other filesystem) is capable of protecting against bit flips.
+Consequently, ECC Memory is highly recommended.
+
+.. _background_1:
+
+Background
+----------
+
+Ordinary background radiation will randomly flip bits in computer
+memory, which causes undefined behavior. These are known as "bit flips".
+Each bit flip can have any of three possible consequences depending on
+which bit is flipped:
+
+-  Bit flips can have no effect.
+
+   -  Bit flips that have no effect occur in unused memory.
+
+-  Bit flips can cause runtime failures.
+
+   -  This is the case when a bit flip occurs in something read from
+      disk.
+   -  Failures are typically observed when program code is altered.
+   -  If the bit flip is in a routine within the system's kernel or
+      /sbin/init, the system will likely crash. Otherwise, reloading the
+      affected data can clear it. This is typically achieved by a
+      reboot.
+
+-  It can cause data corruption.
+
+   -  This is the case when the bit is in use by data being written to
+      disk.
+   -  If the bit flip occurs before ZFS' checksum calculation, ZFS will
+      not realize that the data is corrupt.
+   -  If the bit flip occurs after ZFS' checksum calculation, but before
+      write-out, ZFS will detect it, but it might not be able to correct
+      it.
+
+-  It can cause metadata corruption.
+
+   -  This is the case when a bit flips in an on-disk structure being
+      written to disk.
+   -  If the bit flip occurs before ZFS' checksum calculation, ZFS will
+      not realize that the metadata is corrupt.
+   -  If the bit flip occurs after ZFS' checksum calculation, but before
+      write-out, ZFS will detect it, but it might not be able to correct
+      it.
+   -  Recovery from such an event will depend on what was corrupted. In
+      the worst, case, a pool could be rendered unimportable.
+
+      -  All filesystems have poor reliability in their absolute worst
+         case bit-flip failure scenarios. Such scenarios should be
+         considered extraordinarily rare.
+
+.. _drive_interfaces:
+
+Drive Interfaces
+================
+
+.. _sas_versus_sata:
+
+SAS versus SATA
+---------------
+
+ZFS depends on the block device layer for storage. Consequently, ZFS is
+affected by the same things that affect other filesystems, such as
+driver support and non-working hardware. Consequently, there are a few
+things to note:
+
+-  Never place SATA disks into a SAS expander without a SAS interposer.
+
+   -  If you do this and it does work, it is the exception, rather than
+      the rule.
+
+-  Do not expect SAS controllers to be compatible with SATA port
+   multipliers.
+
+   -  This configuration is typically not tested.
+   -  The disks could be unrecognized.
+
+-  Support for SATA port multipliers is inconsistent across Open ZFS
+   platforms
+
+   -  Linux drivers generally support them.
+   -  Illumos drivers generally do not support them.
+   -  FreeBSD drivers are somewhere between Linux and Illumos in terms
+      of support.
+
+.. _usb_hard_drives_andor_adapters:
+
+USB Hard Drives and/or Adapters
+-------------------------------
+
+These have problems involving sector size reporting, SMART passthrough,
+the ability to set ERC and other areas. ZFS will perform as well on such
+devices as they are capable of allowing, but try to avoid them. They
+should not be expected to have the same up-time as SAS and SATA drives
+and should be considered unreliable.
+
+Controllers
+===========
+
+The ideal storage controller for ZFS has the following attributes:
+
+-  Driver support on major Open ZFS platforms
+
+   -  Stability is important.
+
+-  High per-port bandwidth
+
+   -  PCI Express interface bandwidth divided by the number of ports
+
+-  Low cost
+
+   -  Support for RAID, Battery Backup Units and hardware write caches
+      is unnecessary.
+
+Marc Bevand's blog post `From 32 to 2 ports: Ideal SATA/SAS Controllers
+for ZFS & Linux MD RAID <http://blog.zorinaq.com/?e=10>`__ contains an
+excellent list of storage controllers that meet these criteria. He
+regularly updates it as newer controllers become available.
+
+.. _hardware_raid_controllers:
+
+Hardware RAID controllers
+-------------------------
+
+Hardware RAID controllers should not be used with ZFS. While ZFS will
+likely be more reliable than other filesystems on Hardware RAID, it will
+not be as reliable as it would be on its own.
+
+-  Hardware RAID will limit opportunities for ZFS to perform self
+   healing on checksum failures. When ZFS does RAID-Z or mirroring, a
+   checksum failure on one disk can be corrected by treating the disk
+   containing the sector as bad for the purpose of reconstructing the
+   original information. This cannot be done when a RAID controller
+   handles the redundancy unless a duplicate copy is stored by ZFS in
+   the case that the corruption involving as metadata, the copies flag
+   is set or the RAID array is part of a mirror/raid-z vdev within ZFS.
+
+-  Sector size information is not necessarily passed correctly by
+   hardware RAID on RAID 1 and cannot be passed correctly on RAID 5/6.
+   Hardware RAID 1 is more likely to experience read-modify-write
+   overhead from partial sector writes and Hardware RAID 5/6 will almost
+   certainty suffer from partial stripe writes (i.e. the RAID write
+   hole). Using ZFS with the disks directly will allow it to obtain the
+   sector size information reported by the disks to avoid
+   read-modify-write on sectors while ZFS avoids partial stripe writes
+   on RAID-Z by desing from using copy-on-write.
+
+   -  There can be sector alignment problems on ZFS when a drive
+      misreports its sector size. Such drives are typically NAND-flash
+      based solid state drives and older SATA drives from the advanced
+      format (4K sector size) transition before Windows XP EoL occurred.
+      This can be `manually
+      corrected <Performance_tuning#Alignment_Shift_.28ashift.29>`__ at
+      vdev creation.
+   -  It is possible for the RAID header to cause misalignment of sector
+      writes on RAID 1 by starting the array within a sector on an
+      actual drive, such that manual correction of sector alignment at
+      vdev creation does not solve the problem.
+
+-  Controller failures can require that the controller be replaced with
+   the same model, or in less extreme cases, a model from the same
+   manufacturer. Using ZFS by itself allows any controller to be used.
+
+-  If a hardware RAID controller's write cache is used, an additional
+   failure point is introduced that can only be partially mitigated by
+   additional complexity from adding flash to save data in power loss
+   events. The data can still be lost if the battery fails when it is
+   required to survive a power loss event or there is no flash and power
+   is not restored in a timely manner. The loss of the data in the write
+   cache can severely damage anything stored on a RAID array when many
+   outstanding writes are cached. In addition, all writes are stored in
+   the cache rather than just synchronous writes that require a write
+   cache, which is inefficient, and the write cache is relatively small.
+   ZFS allows synchronous writes to be written directly to flash, which
+   should provide similar acceleration to hardware RAID and the ability
+   to accelerate many more in-flight operations.
+
+-  Behavior during RAID reconstruction when silent corruption damages
+   data is undefined. There are reports of RAID 5 and 6 arrays being
+   lost during reconstruction when the controller encounters silent
+   corruption. ZFS' checksums allow it to avoid this situation by
+   determining if not enough information exists to reconstruct data. In
+   which case, the file is listed as damaged in zpool status and the
+   system administrator has the opportunity to restore it from a backup.
+
+-  IO response times will be reduced whenever the OS blocks on IO
+   operations because the system CPU blocks on a much weaker embedded
+   CPU used in the RAID controller. This lowers IOPS relative to what
+   ZFS could have achieved.
+
+-  The controller's firmware is an additional layer of complexity that
+   cannot be inspected by arbitrary third parties. The ZFS source code
+   is open source and can be inspected by anyone.
+
+-  If multiple RAID arrays are formed by the same controller and one
+   fails, the identifiers provided by the arrays exposed to the OS might
+   become inconsistent. Giving the drives directly to the OS allows this
+   to be avoided via naming that maps to a unique port or unique drive
+   identifier.
+
+   -  e.g. If you have arrays A, B, C and D; array B dies, the
+      interaction between the hardware RAID controller and the OS might
+      rename arrays C and D to look like arrays B and C respectively.
+      This can fault pools verbatim imported from the cachefile.
+   -  Not all RAID controllers behave this way. However, this issue has
+      been observed on both Linux and FreeBSD when system administrators
+      used single drive RAID 0 arrays. It has also been observed with
+      controllers from different vendors.
+
+One might be inclined to try using single-drive RAID 0 arrays to try to
+use a RAID controller like a HBA, but this is not recommended for many
+of the reasons listed for other hardware RAID types. It is best to use a
+HBA instead of a RAID controller, for both performance and reliability.
+
+.. _hard_drives:
+
+Hard drives
+===========
+
+.. _sector_size:
+
+Sector Size
+-----------
+
+Historically, all hard drives had 512-byte sectors, with the exception
+of some SCSI drives that could be modified to support slightly larger
+sectors. In 2009, the industry migrated from 512-byte sectors to
+4096-byte "Advanced Format" sectors. Since Windows XP is not compatible
+with 4096-byte sectors or drives larger than 2TB, some of the first
+advanced format drives implemented hacks to maintain Windows XP
+compatibility.
+
+-  The first advanced format drives on the market misreported their
+   sector size as 512-bytes for Windows XP compatibility. As of 2013, it
+   is believed that such hard drives are no longer in production.
+   Advanced format hard drives made during or after this time should
+   report their true physical sector size.
+-  Drives storing 2TB and smaller might have a jumper that can be set to
+   map all sectors off by 1. This to provide proper alignment for
+   Windows XP, which started its first partition at sector 63. This
+   jumper setting should be off when using such drives with ZFS.
+
+As of 2014, there are still 512-byte and 4096-byte drives on the market,
+but they are known to properly identify themselves unless behind a USB
+to SATA controller. Replacing a 512-byte sector drive with a 4096-byte
+sector drives in a vdev created with 512-byte sector drives will
+adversely affect performance. Replacing a 4096-byte sector drive with a
+512-byte sector drive will have no negative effect on performance.
+
+.. _error_recovery_control:
+
+Error recovery control
+----------------------
+
+ZFS is said to be able to use cheap drives. This was true when it was
+introduced and hard drives supported Error recovery control. Since ZFS'
+introduction, error recovery control has been removed from low-end
+drives from certain manufacturers, most notably Western Digital.
+Consistent performance requires hard drives that support error recovery
+control.
+
+.. _background_2:
+
+Background
+~~~~~~~~~~
+
+Hard drives store data using small polarized regions a magnetic surface.
+Reading from and/or writing to this surface poses a few reliability
+problems. One is that imperfections in the surface can corrupt bits.
+Another is that vibrations can cause drive heads to miss their targets.
+Consequently, hard drive sectors are composed of three regions:
+
+-  A sector number
+-  The actual data
+-  ECC
+
+The sector number and ECC enables hard drives to detect and respond to
+such events. When either event occurs during a read, hard drives will
+retry the read many times until they either succeed or conclude that the
+data cannot be read. The latter case can take a substantial amount of
+time and consequently, IO to the drive will stall.
+
+Enterprise hard drives and some consumer hard drives implement a feature
+called Time-Limited Error Recovery (TLER) by Western Digital, Error
+Recovery Control (ERC) by Seagate and Command Completion Time Limit by
+Hitachi and Samsung, which permits the time drives are willing to spend
+on such events to be limited by the system administrator.
+
+Drives that lack such functionality can be expected to have arbitrarily
+high limits. Several minutes is not impossible. Drives with this
+functionality typically default to 7 seconds. ZFS does not currently
+adjust this setting on drives. However, it is advisable to write a
+script to set the error recovery time to a low value, such as 0.1
+seconds until ZFS is modified to control it. This must be done on every
+boot.
+
+.. _rpm_speeds:
+
+RPM Speeds
+----------
+
+High RPM drives have lower seek times, which is historically regarded as
+being desirable. They increase cost and sacrifice storage density in
+order to achieve what is typically no more than a factor of 6
+improvement over their lower RPM counterparts.
+
+To provide some numbers, a 15k RPM drive from a major manufacturer is
+rated for 3.4 millisecond average read and 3.9 millisecond average
+write. Presumably, this number assumes that the target sector is at most
+half the number of drive tracks away from the head and half the disk
+away. Being even further away is worst-case 2 times slower. Manufacturer
+numbers for 7200 RPM drives are not available, but they average 13 to 16
+milliseconds in empirical measurements. 5400 RPM drives can be expected
+to be slower.
+
+ARC and ZIL are able to mitigate much of the benefit of lower seek
+times. Far larger increases in IOPS performance can be obtained by
+adding additional RAM for ARC, L2ARC devices and SLOG devices. Even
+higher increases in performance can be obtained by replacing hard drives
+with solid state storage entirely. Such things are typically more cost
+effective than high RPM drives when considering IOPS.
+
+.. _command_queuing:
+
+Command Queuing
+---------------
+
+Drives with command queues are able to reorder IO operations to increase
+IOPS. This is called Native Command Queuing on SATA and Tagged Command
+Queuing on PATA/SCSI/SAS. ZFS stores objects in metaslabs and it can use
+several metastabs at any given time. Consequently, ZFS is not only
+designed to take advantage of command queuing, but good ZFS performance
+requires command queuing. Almost all drives manufactured within the past
+10 years can be expected to support command queuing. The exceptions are:
+
+-  Consumer PATA/IDE drives
+-  First generation SATA drives, which used IDE to SATA translation
+   chips, from 2003 to 2004.
+-  SATA drives operating under IDE emulation that was configured in the
+   system BIOS.
+
+Each Open ZFS system has different methods for checking whether command
+queuing is supported. On Linux, \`hdparm -I /path/to/device \| grep
+Queue\` is used. On FreeBSD, \`camcontrol identify $DEVICE\` is used.
+
+.. _nand_flash_ssds:
+
+NAND Flash SSDs
+===============
+
+As of 2014, Solid state storage is dominated by NAND-flash and most
+articles on solid state storage focus on it exclusively. As of 2014, the
+most popular form of flash storage used with ZFS involve drives with
+SATA interfaces. Enterprise models with SAS interfaces are beginning to
+become available.
+
+As of 2017, Solid state storage using NAND-flash with PCI-E interfaces
+are widely available on the market. They are predominantly enterprise
+drives that utilize a NVMe interface that has lower overhead than the
+ATA used in SATA or SCSI used in SAS. There is also an interface known
+as M.2 that is primarily used by consumer SSDs, although not necessarily
+limited to them. It can provide electrical connectivity for multiple
+buses, such as SATA, PCI-E and USB. M.2 SSDs appear to use either SATA
+or NVME.
+
+.. _nvme_low_level_formatting:
+
+NVMe low level formatting
+-------------------------
+
+Many NVMe SSDs support both 512-byte sectors and 4096-byte sectors. They
+often ship with 512-byte sectors, which are less performant than
+4096-byte sectors. Some also support metadata for T10/DIF CRC to try to
+improve reliability, although this is unnecessary with ZFS.
+
+NVMe drives should be
+`formatted <https://filers.blogspot.com/2018/12/how-to-format-nvme-drive.html>`__
+to use 4096-byte sectors without metadata prior to being given to ZFS
+for best performance unless they indicate that 512-byte sectors are as
+performant as 4096-byte sectors, although this is unlikely. Lower
+numbers in the Rel_Perf of Supported LBA Sizes from \`smartctl -a
+/dev/$device_namespace\` (for example \`smartctl -a /dev/nvme1n1`)
+indicate higher performance low level formats, with 0 being the best.
+The current formatting will be marked by a plus sign under the format
+Fmt.
+
+You may format a drive using \`nvme format /dev/nvme1n1 -l $ID`. The $ID
+corresponds to the Id field value from the Supported LBA Sizes SMART
+information.
+
+.. _power_failure_protection:
+
+Power Failure Protection
+------------------------
+
+.. _background_3:
+
+Background
+~~~~~~~~~~
+
+On-flash data structures are highly complex and traditionally have been
+highly vulnerable to corruption. In the past, such corruption would
+result in the loss of \*all\* drive data and an event such as a PSU
+failure could result in multiple drives simultaneously failing. Since
+the drive firmware is not available for review, the traditional
+conclusion was that all drives that lack hardware features to avoid
+power failure events cannot be trusted, which was found to be the case
+multiple times in the
+past\ `1 <http://lkcl.net/reports/ssd_analysis.html>`__\ `2 <https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf>`__\ `3 <http://blog.nordeus.com/dev-ops/power-failure-testing-with-ssds.htm>`__.
+Discussion of power failures bricking NAND flash SSDs appears to have
+vanished from literature following the year 2015. SSD manufacturers now
+claim that firmware power loss protection is robust enough to provide
+equivalent protection to hardware power loss protection. Kingston is one
+example\ `4 <https://www.kingston.com/us/solutions/servers-data-centers/ssd-power-loss-protection>`__.
+Firmware power loss protection is used to guarantee the protection of
+flushed data and the drives’ own metadata, which is all that filesystems
+such as ZFS need.
+
+However, those that either need or want strong guarantees that firmware
+bugs are unlikely to be able to brick drives following power loss events
+should continue to use drives that provide hardware power loss
+protection. The basic concept behind how hardware power failure
+protection works has been `documented by
+Intel <https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-power-loss-imminent-technology-brief.pdf>`__
+for those who wish to read about the details. As of 2020, use of
+hardware power loss protection is now a feature solely of enterprise
+SSDs that attempt to protect unflushed data in addition to drive
+metadata and flushed data. This additional protection beyond protecting
+flushed data and the drive metadata provides no additional benefit to
+ZFS, but it does not hurt it.
+
+It should also be noted that drives in data centers and laptops are
+unlikely to experience power loss events, reducing the usefulness of
+hardware power loss protection. This is especially the case in
+datacenters where redundant power, UPS power and the use of IPMI to do
+forced reboots should prevent most drives from experiencing power loss
+events.
+
+Lists of drives that provide hardware power loss protection are
+maintained below for those who need/want it. Since ZFS, like other
+filesystems, only requires power failure protection for flushed data and
+drive metadata, older drives that only protect these things are included
+on the lists.
+
+.. _nvme_drives_with_power_failure_protection:
+
+NVMe drives with power failure protection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A non-exhaustive list of NVMe drives with power failure protection is as
+follows:
+
+-  Intel 750
+-  Intel DC P3500/P3600/P3608/P3700
+-  Samsung PM963 (M.2 form factor)
+-  Samsung PM1725/PM1725a
+-  Samsung XS1715
+-  Toshiba ZD6300
+-  Seagate Nytro 5000 M.2 (XP1920LE30002 tested; **read notes below
+   before buying**)
+
+   -  Inexpensive 22110 M.2 enterprise drive using consumer MLC that is
+      optimized for read mostly workloads. It is not a good choice for a
+      SLOG device, which is a write mostly workload.
+   -  The
+      `manual <https://www.seagate.com/www-content/support-content/enterprise-storage/solid-state-drives/nytro-5000/_shared/docs/nytro-5000-mp2-pm-100810195d.pdf>`__
+      for this drive specifies airflow requirements. If the drive does
+      not receive sufficient airflow from case fans, it will overheat at
+      idle. It's thermal throttling will severely degrade performance
+      such that write throughput performance will be limited to 1/10 of
+      the specification and read latencies will reach several hundred
+      milliseconds. Under continuous load, the device will continue to
+      become hotter until it suffers a "degraded reliability" event
+      where all data on at least one NVMe namespace is lost. The NVMe
+      namespace is then unusable until a secure erase is done. Even with
+      sufficient airflow under normal circumstances, data loss is
+      possible under load following the failure of fans in an enterprise
+      environment. Anyone deploying this into production in an
+      enterprise environment should be mindful of this failure mode.
+   -  Those who wish to use this drive in a low airflow situation can
+      workaround this failure mode by placing a passive heatsink such as
+      `this <https://smile.amazon.com/gp/product/B07BDKN3XV>`__ on the
+      NAND flash controller. It is the chip under the sticker closest to
+      the capacitors. This was tested by placing the heatsink over the
+      sticker (as removing it was considered undesirable). The heatsink
+      will prevent the drive from overheating to the point of data loss,
+      but it will not fully alleviate the overheating situation under
+      load without active airflow. A scrub will cause it to overheat
+      after a few hundred gigabytes are read. However, the thermal
+      throttling will quickly cool the drive from 76 degrees Celsius to
+      74 degrees Celsius, restoring performance.
+
+      -  It might be possible to use the heatsink in an enterprise
+         environment to provide protection against data loss following
+         fan failures. However, this was not evaluated. Furthermore,
+         operating temperatures for consumer NAND flash should be at or
+         above 40 degrees Celsius for long term data integrity.
+         Therefore, the use of a heatsink to provide protection against
+         data loss following fan failures in an enterprise environment
+         should be evaluated before deploying drives into production to
+         ensure that the drive is not overcooled.
+
+.. _sas_drives_with_power_failure_protection:
+
+SAS drives with power failure protection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A non-exhaustive list of SAS drives with power failure protection is as
+follows:
+
+-  Samsung PM1633/PM1633a
+-  Samsung SM1625
+-  Samsung PM853T
+-  Toshiba PX05SHB***/PX04SHB***/PX04SHQ**\*
+-  Toshiba PX05SLB***/PX04SLB***/PX04SLQ**\*
+-  Toshiba PX05SMB***/PX04SMB***/PX04SMQ**\*
+-  Toshiba PX05SRB***/PX04SRB***/PX04SRQ**\*
+-  Toshiba PX05SVB***/PX04SVB***/PX04SVQ**\*
+
+.. _sata_drives_with_power_failure_protection:
+
+SATA drives with power failure protection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A non-exhaustive list of SATA drives with power failure protection is as
+follows:
+
+-  Crucial MX100/MX200/MX300
+-  Crucial M500/M550/M600
+-  Intel 320
+
+   -  Early reports claimed that the 330 and 335 had power failure
+      protection too, but they do
+      not.\ `5 <http://blog.nordeus.com/dev-ops/power-failure-testing-with-ssds.htm>`__
+
+-  Intel 710
+-  Intel 730
+-  Intel DC S3500/S3510/S3610/S3700/S3710
+-  Micron 5210 Ion
+
+   -  First QLC drive on the list. High capacity with a low price per
+      gigabyte.
+
+-  Samsung PM863/PM863a
+-  Samsung SM843T (do not confuse with SM843)
+-  Samsung SM863/SM863a
+-  Samsung 845DC Evo
+-  Samsung 845DC Pro
+
+   -  High sustained write
+      IOPS\ `6 <http://www.anandtech.com/show/8319/samsung-ssd-845dc-evopro-preview-exploring-worstcase-iops/5>`__
+
+-  Toshiba HK4E/HK3E2
+-  Toshiba HK4R/HK3R2/HK3R
+
+.. _criteriaprocess_for_inclusion_into_these_lists:
+
+Criteria/process for inclusion into these lists
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These lists have been compiled on a volunteer basis by OpenZFS
+contributors (mainly Richard Yao) from trustworthy sources of
+information. The lists are intended to be vendor neutral and are not
+intended to benefit any particular manufacturer. Any perceived bias
+toward any manufacturer is caused by a lack of awareness and a lack of
+time to research additional options. Confirmation of the presence of
+adequate power loss protection by a reliable source is the only
+requirement for inclusion into this list. Adequate power loss protection
+means that the drive must protect both its own internal metadata and all
+flushed data. Protection of unflushed data is irrelevant and therefore
+not a requirement. ZFS only expects storage to protect flushed data.
+Consequently, solid state drives whose power loss protection only
+protects flushed data is sufficient for ZFS to ensure that data remains
+safe.
+
+Anyone who believes an unlisted drive to provide adequate power failure
+protection may contact the `OpenZFS mailing list <mailing_list>`__ with
+a request for inclusion and substantiation for the claim that power
+failure protection is provided. Examples of substantiation include
+pictures of drive internals showing the presence of capacitors,
+statements by well regarded independent review sites such as Anandtech
+and manufacturer specification sheets. The latter are accepted on the
+honor system until a manufacturer is found to misstate reality on the
+protection of the drives' own internal metadata structures and/or the
+protection of flushed data. Thus far, all manufacturers have been
+honest.
+
+.. _flash_pages:
+
+Flash pages
+-----------
+
+The smallest unit on a NAND chip that can be written is a flash page.
+The first NAND-flash SSDs on the market had 4096-byte pages. Further
+complicating matters is that the the page size has been doubled twice
+since then. NAND flash SSDs \*should\* report these pages as being
+sectors, but so far, all of them incorrectly report 512-byte sectors for
+Windows XP compatibility. The consequence is that we have a similar
+situation to what we had with early advanced format hard drives.
+
+As of 2014, most NAND-flash SSDs on the market have 8192-byte page
+sizes. However, models using 128-Gbit NAND from certain manufacturers
+have a 16384-byte page size. Maximum performance requires that vdevs be
+created with correct ashift values (13 for 8192-byte and 14 for
+16384-byte). However, not all Open ZFS platforms support this. The Linux
+port supports ashift=13, while others are limited to ashift=12
+(4096-byte).
+
+As of 2017, NAND-flash SSDs are tuned for 4096-byte IOs. Matching the
+flash page size is unnecessary and ashift=12 is usually the correct
+choice. Public documentation on flash page size is also nearly
+non-existent.
+
+.. _ata_trim_scsi_unmap:
+
+ATA TRIM / SCSI UNMAP
+---------------------
+
+At this time, only the FreeBSD port has support for sending block
+discard commands to vdevs to generate appropriate ATA TRIM and/or SCSI
+UNMAP commands. It should be noted that this is a separate case from
+discard on zvols or hole punching on filesystems. Those work regardless
+of whether ATA TRIM / SCSI UNMAP is sent to the actual block devices.
+
+.. _ata_trim_performance_issues:
+
+ATA TRIM Performance Issues
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ATA TRIM command in SATA 3.0 and earlier is a non-queued command.
+Issuing a TRIM command on a SATA drive conforming to SATA 3.0 or earlier
+will cause the drive to drain its IO queue and stop servicing requests
+until it finishes, which hurts performance. SATA 3.1 removed this
+limitation, but very few SATA drives on the market are conformant to
+SATA 3.1 and it is difficult to distinguish them from SATA 3.0 drives.
+At the same time, SCSI UNMAP has no such problems.
+
+.. _optane_3d_xpoint_ssds:
+
+Optane / 3D XPoint SSDs
+=======================
+
+These are SSDs with far better latencies and write endurance than NAND
+flash SSDs. They are byte addressable, such that ashift=9 is fine for
+use on them. Unlike NAND flash SSDs, they do not require any special
+power failure protection circuitry for reliability. There is also no
+need to run TRIM on them. However, they cost more per GB than NAND flash
+(as of 2020). The enterprise models make excellent SLOG devices. Here is
+a list of models that are known to perform well:
+
+-  Intel DC
+   P4800X\ `7 <https://www.servethehome.com/intel-optane-hands-on-real-world-benchmark-and-test-results/>`__
+
+   -  This gives basically the highest performance you can get as of
+      June 2020.
+
+Also, at time of writing in June 2020, only one model is listed. This is
+due to there being few such drives on the market. The client models are
+likely to be outperformed by well configured NAND flash drives, so they
+have not been listed (although they are likely cheaper than NAND flash).
+More will likely be added in the future.
+
+Note that SLOG devices rarely have more than 4GB in use at any given
+time, so the smaller sized devices are generally the best choice in
+terms of cost, with larger sizes giving no benefit. Larger sizes could
+be a good choice for other vdev types, depending on performance needs
+and cost considerations.
+
+Power
+=====
+
+Ensuring that computers are properly grounded is highly recommended.
+There have been cases in user homes where machines experienced random
+failures when plugged into power receptacles that had open grounds (i.e.
+no ground wire at all). This can cause random failures on any computer
+system, whether it uses ZFS or not.
+
+Power should also be relatively stable. Large dips in voltages from
+brownouts are preferably avoided through the use of UPS units or line
+conditioners. Systems subject to unstable power that do not outright
+shutdown can exhibit undefined behavior. PSUs with longer hold-up times
+should be able to provide partial protection against this, but hold up
+times are often undocumented and are not a substitute for a UPS or line
+conditioner.
+
+.. _pwr_ok_signal:
+
+PWR_OK signal
+-------------
+
+PSUs are supposed to deassert a PWR_OK signal to indicate that provided
+voltages are no longer within the rated specification. This should force
+an immediate shutdown. However, the system clock of a developer
+workstation was observed to significantly deviate from the expected
+value following during a series of ~1 second brown outs. This machine
+did not use a UPS at the time. However, the PWR_OK mechanism should have
+protected against this. The observation of the PWR_OK signal failing to
+force a shutdown with adverse consequences (to the system clock in this
+case) suggests that the PWR_OK mechanism is not a strict guarantee.
+
+.. _psu_hold_up_times:
+
+PSU Hold-up Times
+-----------------
+
+A PSU hold-up time is the amount of time that a PSU can continue to
+output power at maximum output within standard voltage tolerances
+following the loss of input power. This is important for supporting UPS
+units because `the transfer
+time <https://www.sunpower-uk.com/glossary/what-is-transfer-time/>`__
+taken by a standard UPS to supply power from its battery can leave
+machines without power for "5-12 ms". `Intel's ATX Power Supply design
+guide <https://paginas.fe.up.pt/~asousa/pc-info/atxps09_atx_pc_pow_supply.pdf>`__
+specifies a hold up time of 17 milliseconds at maximum continuous
+output. The hold-up time is a inverse function of how much power is
+being output by the PSU, with lower power output increasing holdup
+times.
+
+Capacitor aging in PSUs will lower the hold-up time below what it was
+when new, which could cause reliability issues as equipment ages.
+Machines using substandard PSUs with hold-up times below the
+specification therefore require higher end UPS units for protection to
+ensure that the transfer time does not exceed the hold-up time. A
+hold-up time below the transfer time during a transfer to battery power
+can cause undefined behavior should the PWR_OK signal not become
+deasserted to force the machine to poser off.
+
+If in doubt, use a double conversion UPS unit. Double conversion UPS
+units always run off the battery, such that the transfer time is 0. This
+is unless they are high efficiency models that are hybrids between
+standard UPS units and double conversion UPS units, although these are
+reported to have much lower transfer times than standard PSUs. You could
+also contact your PSU manufacturer for the hold up time specification,
+but if reliability for years is a requirement, you should use a higher
+end UPS with a low transfer time.
+
+Note that double conversion units are at most 94% efficient unless they
+support a high efficiency mode, which adds latency to the time to
+transition to battery power.
+
+.. _ups_batteries:
+
+UPS batteries
+-------------
+
+The lead acid batteries in UPS units generally need to be replaced
+regularly to ensure that they provide power during power outages. For
+home systems, this is every 3 to 5 years, although this varies with
+temperature\ `8 <https://www.apc.com/us/en/faqs/FA158934/>`__. For
+enterprise systems, contact your vendor.