809 lines
34 KiB
ReStructuredText
809 lines
34 KiB
ReStructuredText
Hardware
|
||
********
|
||
|
||
.. contents:: Table of Contents
|
||
:local:
|
||
|
||
Introduction
|
||
============
|
||
|
||
Storage before ZFS involved rather expensive hardware that was unable to
|
||
protect against silent corruption and did not scale very well. The
|
||
introduction of ZFS has enabled people to use far less expensive
|
||
hardware than previously used in the industry with superior scaling.
|
||
This page attempts to provide some basic guidance to people buying
|
||
hardware for use in ZFS-based servers and workstations.
|
||
|
||
Hardware that adheres to this guidance will enable ZFS to reach its full
|
||
potential for performance and reliability. Hardware that does not adhere
|
||
to it will serve as a handicap. Unless otherwise stated, such handicaps
|
||
apply to all storage stacks and are by no means specific to ZFS. Systems
|
||
built using competing storage stacks will also benefit from these
|
||
suggestions.
|
||
|
||
.. _bios_cpu_microcode_updates:
|
||
|
||
BIOS / CPU microcode updates
|
||
============================
|
||
|
||
Running the latest BIOS and CPU microcode is highly recommended.
|
||
|
||
Background
|
||
----------
|
||
|
||
Computer microprocessors are very complex designs that often have bugs,
|
||
which are called errata. Modern microprocessors are designed to utilize
|
||
microcode. This puts part of the hardware design into quasi-software
|
||
that can be patched without replacing the entire chip. Errata are often
|
||
resolved through CPU microcode updates. These are often bundled in BIOS
|
||
updates. In some cases, the BIOS interactions with the CPU through
|
||
machine registers can be modified to fix things with the same microcode.
|
||
If a newer microcode is not bundled as part of a BIOS update, it can
|
||
often be loaded by the operating system bootloader or the operating
|
||
system itself.
|
||
|
||
.. _ecc_memory:
|
||
|
||
ECC Memory
|
||
==========
|
||
|
||
Bit flips can have fairly dramatic consequences for all computer
|
||
filesystems and ZFS is no exception. No technique used in ZFS (or any
|
||
other filesystem) is capable of protecting against bit flips.
|
||
Consequently, ECC Memory is highly recommended.
|
||
|
||
.. _background_1:
|
||
|
||
Background
|
||
----------
|
||
|
||
Ordinary background radiation will randomly flip bits in computer
|
||
memory, which causes undefined behavior. These are known as "bit flips".
|
||
Each bit flip can have any of three possible consequences depending on
|
||
which bit is flipped:
|
||
|
||
- Bit flips can have no effect.
|
||
|
||
- Bit flips that have no effect occur in unused memory.
|
||
|
||
- Bit flips can cause runtime failures.
|
||
|
||
- This is the case when a bit flip occurs in something read from
|
||
disk.
|
||
- Failures are typically observed when program code is altered.
|
||
- If the bit flip is in a routine within the system's kernel or
|
||
/sbin/init, the system will likely crash. Otherwise, reloading the
|
||
affected data can clear it. This is typically achieved by a
|
||
reboot.
|
||
|
||
- It can cause data corruption.
|
||
|
||
- This is the case when the bit is in use by data being written to
|
||
disk.
|
||
- If the bit flip occurs before ZFS' checksum calculation, ZFS will
|
||
not realize that the data is corrupt.
|
||
- If the bit flip occurs after ZFS' checksum calculation, but before
|
||
write-out, ZFS will detect it, but it might not be able to correct
|
||
it.
|
||
|
||
- It can cause metadata corruption.
|
||
|
||
- This is the case when a bit flips in an on-disk structure being
|
||
written to disk.
|
||
- If the bit flip occurs before ZFS' checksum calculation, ZFS will
|
||
not realize that the metadata is corrupt.
|
||
- If the bit flip occurs after ZFS' checksum calculation, but before
|
||
write-out, ZFS will detect it, but it might not be able to correct
|
||
it.
|
||
- Recovery from such an event will depend on what was corrupted. In
|
||
the worst, case, a pool could be rendered unimportable.
|
||
|
||
- All filesystems have poor reliability in their absolute worst
|
||
case bit-flip failure scenarios. Such scenarios should be
|
||
considered extraordinarily rare.
|
||
|
||
.. _drive_interfaces:
|
||
|
||
Drive Interfaces
|
||
================
|
||
|
||
.. _sas_versus_sata:
|
||
|
||
SAS versus SATA
|
||
---------------
|
||
|
||
ZFS depends on the block device layer for storage. Consequently, ZFS is
|
||
affected by the same things that affect other filesystems, such as
|
||
driver support and non-working hardware. Consequently, there are a few
|
||
things to note:
|
||
|
||
- Never place SATA disks into a SAS expander without a SAS interposer.
|
||
|
||
- If you do this and it does work, it is the exception, rather than
|
||
the rule.
|
||
|
||
- Do not expect SAS controllers to be compatible with SATA port
|
||
multipliers.
|
||
|
||
- This configuration is typically not tested.
|
||
- The disks could be unrecognized.
|
||
|
||
- Support for SATA port multipliers is inconsistent across OpenZFS
|
||
platforms
|
||
|
||
- Linux drivers generally support them.
|
||
- Illumos drivers generally do not support them.
|
||
- FreeBSD drivers are somewhere between Linux and Illumos in terms
|
||
of support.
|
||
|
||
.. _usb_hard_drives_andor_adapters:
|
||
|
||
USB Hard Drives and/or Adapters
|
||
-------------------------------
|
||
|
||
These have problems involving sector size reporting, SMART passthrough,
|
||
the ability to set ERC and other areas. ZFS will perform as well on such
|
||
devices as they are capable of allowing, but try to avoid them. They
|
||
should not be expected to have the same up-time as SAS and SATA drives
|
||
and should be considered unreliable.
|
||
|
||
Controllers
|
||
===========
|
||
|
||
The ideal storage controller for ZFS has the following attributes:
|
||
|
||
- Driver support on major OpenZFS platforms
|
||
|
||
- Stability is important.
|
||
|
||
- High per-port bandwidth
|
||
|
||
- PCI Express interface bandwidth divided by the number of ports
|
||
|
||
- Low cost
|
||
|
||
- Support for RAID, Battery Backup Units and hardware write caches
|
||
is unnecessary.
|
||
|
||
Marc Bevand's blog post `From 32 to 2 ports: Ideal SATA/SAS Controllers
|
||
for ZFS & Linux MD RAID <http://blog.zorinaq.com/?e=10>`__ contains an
|
||
excellent list of storage controllers that meet these criteria. He
|
||
regularly updates it as newer controllers become available.
|
||
|
||
.. _hardware_raid_controllers:
|
||
|
||
Hardware RAID controllers
|
||
-------------------------
|
||
|
||
Hardware RAID controllers should not be used with ZFS. While ZFS will
|
||
likely be more reliable than other filesystems on Hardware RAID, it will
|
||
not be as reliable as it would be on its own.
|
||
|
||
- Hardware RAID will limit opportunities for ZFS to perform self
|
||
healing on checksum failures. When ZFS does RAID-Z or mirroring, a
|
||
checksum failure on one disk can be corrected by treating the disk
|
||
containing the sector as bad for the purpose of reconstructing the
|
||
original information. This cannot be done when a RAID controller
|
||
handles the redundancy unless a duplicate copy is stored by ZFS in
|
||
the case that the corruption involving as metadata, the copies flag
|
||
is set or the RAID array is part of a mirror/raid-z vdev within ZFS.
|
||
|
||
- Sector size information is not necessarily passed correctly by
|
||
hardware RAID on RAID 1 and cannot be passed correctly on RAID 5/6.
|
||
Hardware RAID 1 is more likely to experience read-modify-write
|
||
overhead from partial sector writes and Hardware RAID 5/6 will almost
|
||
certainty suffer from partial stripe writes (i.e. the RAID write
|
||
hole). Using ZFS with the disks directly will allow it to obtain the
|
||
sector size information reported by the disks to avoid
|
||
read-modify-write on sectors while ZFS avoids partial stripe writes
|
||
on RAID-Z by desing from using copy-on-write.
|
||
|
||
- There can be sector alignment problems on ZFS when a drive
|
||
misreports its sector size. Such drives are typically NAND-flash
|
||
based solid state drives and older SATA drives from the advanced
|
||
format (4K sector size) transition before Windows XP EoL occurred.
|
||
This can be :ref:`manually corrected <alignment_shift_ashift>` at
|
||
vdev creation.
|
||
- It is possible for the RAID header to cause misalignment of sector
|
||
writes on RAID 1 by starting the array within a sector on an
|
||
actual drive, such that manual correction of sector alignment at
|
||
vdev creation does not solve the problem.
|
||
|
||
- Controller failures can require that the controller be replaced with
|
||
the same model, or in less extreme cases, a model from the same
|
||
manufacturer. Using ZFS by itself allows any controller to be used.
|
||
|
||
- If a hardware RAID controller's write cache is used, an additional
|
||
failure point is introduced that can only be partially mitigated by
|
||
additional complexity from adding flash to save data in power loss
|
||
events. The data can still be lost if the battery fails when it is
|
||
required to survive a power loss event or there is no flash and power
|
||
is not restored in a timely manner. The loss of the data in the write
|
||
cache can severely damage anything stored on a RAID array when many
|
||
outstanding writes are cached. In addition, all writes are stored in
|
||
the cache rather than just synchronous writes that require a write
|
||
cache, which is inefficient, and the write cache is relatively small.
|
||
ZFS allows synchronous writes to be written directly to flash, which
|
||
should provide similar acceleration to hardware RAID and the ability
|
||
to accelerate many more in-flight operations.
|
||
|
||
- Behavior during RAID reconstruction when silent corruption damages
|
||
data is undefined. There are reports of RAID 5 and 6 arrays being
|
||
lost during reconstruction when the controller encounters silent
|
||
corruption. ZFS' checksums allow it to avoid this situation by
|
||
determining if not enough information exists to reconstruct data. In
|
||
which case, the file is listed as damaged in zpool status and the
|
||
system administrator has the opportunity to restore it from a backup.
|
||
|
||
- IO response times will be reduced whenever the OS blocks on IO
|
||
operations because the system CPU blocks on a much weaker embedded
|
||
CPU used in the RAID controller. This lowers IOPS relative to what
|
||
ZFS could have achieved.
|
||
|
||
- The controller's firmware is an additional layer of complexity that
|
||
cannot be inspected by arbitrary third parties. The ZFS source code
|
||
is open source and can be inspected by anyone.
|
||
|
||
- If multiple RAID arrays are formed by the same controller and one
|
||
fails, the identifiers provided by the arrays exposed to the OS might
|
||
become inconsistent. Giving the drives directly to the OS allows this
|
||
to be avoided via naming that maps to a unique port or unique drive
|
||
identifier.
|
||
|
||
- e.g. If you have arrays A, B, C and D; array B dies, the
|
||
interaction between the hardware RAID controller and the OS might
|
||
rename arrays C and D to look like arrays B and C respectively.
|
||
This can fault pools verbatim imported from the cachefile.
|
||
- Not all RAID controllers behave this way. However, this issue has
|
||
been observed on both Linux and FreeBSD when system administrators
|
||
used single drive RAID 0 arrays. It has also been observed with
|
||
controllers from different vendors.
|
||
|
||
One might be inclined to try using single-drive RAID 0 arrays to try to
|
||
use a RAID controller like a HBA, but this is not recommended for many
|
||
of the reasons listed for other hardware RAID types. It is best to use a
|
||
HBA instead of a RAID controller, for both performance and reliability.
|
||
|
||
.. _hard_drives:
|
||
|
||
Hard drives
|
||
===========
|
||
|
||
.. _sector_size:
|
||
|
||
Sector Size
|
||
-----------
|
||
|
||
Historically, all hard drives had 512-byte sectors, with the exception
|
||
of some SCSI drives that could be modified to support slightly larger
|
||
sectors. In 2009, the industry migrated from 512-byte sectors to
|
||
4096-byte "Advanced Format" sectors. Since Windows XP is not compatible
|
||
with 4096-byte sectors or drives larger than 2TB, some of the first
|
||
advanced format drives implemented hacks to maintain Windows XP
|
||
compatibility.
|
||
|
||
- The first advanced format drives on the market misreported their
|
||
sector size as 512-bytes for Windows XP compatibility. As of 2013, it
|
||
is believed that such hard drives are no longer in production.
|
||
Advanced format hard drives made during or after this time should
|
||
report their true physical sector size.
|
||
- Drives storing 2TB and smaller might have a jumper that can be set to
|
||
map all sectors off by 1. This to provide proper alignment for
|
||
Windows XP, which started its first partition at sector 63. This
|
||
jumper setting should be off when using such drives with ZFS.
|
||
|
||
As of 2014, there are still 512-byte and 4096-byte drives on the market,
|
||
but they are known to properly identify themselves unless behind a USB
|
||
to SATA controller. Replacing a 512-byte sector drive with a 4096-byte
|
||
sector drives in a vdev created with 512-byte sector drives will
|
||
adversely affect performance. Replacing a 4096-byte sector drive with a
|
||
512-byte sector drive will have no negative effect on performance.
|
||
|
||
.. _error_recovery_control:
|
||
|
||
Error recovery control
|
||
----------------------
|
||
|
||
ZFS is said to be able to use cheap drives. This was true when it was
|
||
introduced and hard drives supported Error recovery control. Since ZFS'
|
||
introduction, error recovery control has been removed from low-end
|
||
drives from certain manufacturers, most notably Western Digital.
|
||
Consistent performance requires hard drives that support error recovery
|
||
control.
|
||
|
||
.. _background_2:
|
||
|
||
Background
|
||
~~~~~~~~~~
|
||
|
||
Hard drives store data using small polarized regions a magnetic surface.
|
||
Reading from and/or writing to this surface poses a few reliability
|
||
problems. One is that imperfections in the surface can corrupt bits.
|
||
Another is that vibrations can cause drive heads to miss their targets.
|
||
Consequently, hard drive sectors are composed of three regions:
|
||
|
||
- A sector number
|
||
- The actual data
|
||
- ECC
|
||
|
||
The sector number and ECC enables hard drives to detect and respond to
|
||
such events. When either event occurs during a read, hard drives will
|
||
retry the read many times until they either succeed or conclude that the
|
||
data cannot be read. The latter case can take a substantial amount of
|
||
time and consequently, IO to the drive will stall.
|
||
|
||
Enterprise hard drives and some consumer hard drives implement a feature
|
||
called Time-Limited Error Recovery (TLER) by Western Digital, Error
|
||
Recovery Control (ERC) by Seagate and Command Completion Time Limit by
|
||
Hitachi and Samsung, which permits the time drives are willing to spend
|
||
on such events to be limited by the system administrator.
|
||
|
||
Drives that lack such functionality can be expected to have arbitrarily
|
||
high limits. Several minutes is not impossible. Drives with this
|
||
functionality typically default to 7 seconds. ZFS does not currently
|
||
adjust this setting on drives. However, it is advisable to write a
|
||
script to set the error recovery time to a low value, such as 0.1
|
||
seconds until ZFS is modified to control it. This must be done on every
|
||
boot.
|
||
|
||
.. _rpm_speeds:
|
||
|
||
RPM Speeds
|
||
----------
|
||
|
||
High RPM drives have lower seek times, which is historically regarded as
|
||
being desirable. They increase cost and sacrifice storage density in
|
||
order to achieve what is typically no more than a factor of 6
|
||
improvement over their lower RPM counterparts.
|
||
|
||
To provide some numbers, a 15k RPM drive from a major manufacturer is
|
||
rated for 3.4 millisecond average read and 3.9 millisecond average
|
||
write. Presumably, this number assumes that the target sector is at most
|
||
half the number of drive tracks away from the head and half the disk
|
||
away. Being even further away is worst-case 2 times slower. Manufacturer
|
||
numbers for 7200 RPM drives are not available, but they average 13 to 16
|
||
milliseconds in empirical measurements. 5400 RPM drives can be expected
|
||
to be slower.
|
||
|
||
ARC and ZIL are able to mitigate much of the benefit of lower seek
|
||
times. Far larger increases in IOPS performance can be obtained by
|
||
adding additional RAM for ARC, L2ARC devices and SLOG devices. Even
|
||
higher increases in performance can be obtained by replacing hard drives
|
||
with solid state storage entirely. Such things are typically more cost
|
||
effective than high RPM drives when considering IOPS.
|
||
|
||
.. _command_queuing:
|
||
|
||
Command Queuing
|
||
---------------
|
||
|
||
Drives with command queues are able to reorder IO operations to increase
|
||
IOPS. This is called Native Command Queuing on SATA and Tagged Command
|
||
Queuing on PATA/SCSI/SAS. ZFS stores objects in metaslabs and it can use
|
||
several metastabs at any given time. Consequently, ZFS is not only
|
||
designed to take advantage of command queuing, but good ZFS performance
|
||
requires command queuing. Almost all drives manufactured within the past
|
||
10 years can be expected to support command queuing. The exceptions are:
|
||
|
||
- Consumer PATA/IDE drives
|
||
- First generation SATA drives, which used IDE to SATA translation
|
||
chips, from 2003 to 2004.
|
||
- SATA drives operating under IDE emulation that was configured in the
|
||
system BIOS.
|
||
|
||
Each OpenZFS system has different methods for checking whether command
|
||
queuing is supported. On Linux, ``hdparm -I /path/to/device \| grep
|
||
Queue`` is used. On FreeBSD, ``camcontrol identify $DEVICE`` is used.
|
||
|
||
.. _nand_flash_ssds:
|
||
|
||
NAND Flash SSDs
|
||
===============
|
||
|
||
As of 2014, Solid state storage is dominated by NAND-flash and most
|
||
articles on solid state storage focus on it exclusively. As of 2014, the
|
||
most popular form of flash storage used with ZFS involve drives with
|
||
SATA interfaces. Enterprise models with SAS interfaces are beginning to
|
||
become available.
|
||
|
||
As of 2017, Solid state storage using NAND-flash with PCI-E interfaces
|
||
are widely available on the market. They are predominantly enterprise
|
||
drives that utilize a NVMe interface that has lower overhead than the
|
||
ATA used in SATA or SCSI used in SAS. There is also an interface known
|
||
as M.2 that is primarily used by consumer SSDs, although not necessarily
|
||
limited to them. It can provide electrical connectivity for multiple
|
||
buses, such as SATA, PCI-E and USB. M.2 SSDs appear to use either SATA
|
||
or NVME.
|
||
|
||
.. _nvme_low_level_formatting:
|
||
|
||
NVMe low level formatting
|
||
-------------------------
|
||
|
||
Many NVMe SSDs support both 512-byte sectors and 4096-byte sectors. They
|
||
often ship with 512-byte sectors, which are less performant than
|
||
4096-byte sectors. Some also support metadata for T10/DIF CRC to try to
|
||
improve reliability, although this is unnecessary with ZFS.
|
||
|
||
NVMe drives should be
|
||
`formatted <https://filers.blogspot.com/2018/12/how-to-format-nvme-drive.html>`__
|
||
to use 4096-byte sectors without metadata prior to being given to ZFS
|
||
for best performance unless they indicate that 512-byte sectors are as
|
||
performant as 4096-byte sectors, although this is unlikely. Lower
|
||
numbers in the Rel_Perf of Supported LBA Sizes from ``smartctl -a
|
||
/dev/$device_namespace`` (for example ``smartctl -a /dev/nvme1n1``)
|
||
indicate higher performance low level formats, with 0 being the best.
|
||
The current formatting will be marked by a plus sign under the format
|
||
Fmt.
|
||
|
||
You may format a drive using ``nvme format /dev/nvme1n1 -l $ID``. The $ID
|
||
corresponds to the Id field value from the Supported LBA Sizes SMART
|
||
information.
|
||
|
||
.. _power_failure_protection:
|
||
|
||
Power Failure Protection
|
||
------------------------
|
||
|
||
.. _background_3:
|
||
|
||
Background
|
||
~~~~~~~~~~
|
||
|
||
On-flash data structures are highly complex and traditionally have been
|
||
highly vulnerable to corruption. In the past, such corruption would
|
||
result in the loss of \*all\* drive data and an event such as a PSU
|
||
failure could result in multiple drives simultaneously failing. Since
|
||
the drive firmware is not available for review, the traditional
|
||
conclusion was that all drives that lack hardware features to avoid
|
||
power failure events cannot be trusted, which was found to be the case
|
||
multiple times in the
|
||
past [#ssd_analysis]_ [#ssd_analysis2]_ [#ssd_analysis3]_.
|
||
Discussion of power failures bricking NAND flash SSDs appears to have
|
||
vanished from literature following the year 2015. SSD manufacturers now
|
||
claim that firmware power loss protection is robust enough to provide
|
||
equivalent protection to hardware power loss protection. `Kingston is one
|
||
example <https://www.kingston.com/us/solutions/servers-data-centers/ssd-power-loss-protection>`__.
|
||
Firmware power loss protection is used to guarantee the protection of
|
||
flushed data and the drives’ own metadata, which is all that filesystems
|
||
such as ZFS need.
|
||
|
||
However, those that either need or want strong guarantees that firmware
|
||
bugs are unlikely to be able to brick drives following power loss events
|
||
should continue to use drives that provide hardware power loss
|
||
protection. The basic concept behind how hardware power failure
|
||
protection works has been `documented by
|
||
Intel <https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-power-loss-imminent-technology-brief.pdf>`__
|
||
for those who wish to read about the details. As of 2020, use of
|
||
hardware power loss protection is now a feature solely of enterprise
|
||
SSDs that attempt to protect unflushed data in addition to drive
|
||
metadata and flushed data. This additional protection beyond protecting
|
||
flushed data and the drive metadata provides no additional benefit to
|
||
ZFS, but it does not hurt it.
|
||
|
||
It should also be noted that drives in data centers and laptops are
|
||
unlikely to experience power loss events, reducing the usefulness of
|
||
hardware power loss protection. This is especially the case in
|
||
datacenters where redundant power, UPS power and the use of IPMI to do
|
||
forced reboots should prevent most drives from experiencing power loss
|
||
events.
|
||
|
||
Lists of drives that provide hardware power loss protection are
|
||
maintained below for those who need/want it. Since ZFS, like other
|
||
filesystems, only requires power failure protection for flushed data and
|
||
drive metadata, older drives that only protect these things are included
|
||
on the lists.
|
||
|
||
.. _nvme_drives_with_power_failure_protection:
|
||
|
||
NVMe drives with power failure protection
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A non-exhaustive list of NVMe drives with power failure protection is as
|
||
follows:
|
||
|
||
- Intel 750
|
||
- Intel DC P3500/P3600/P3608/P3700
|
||
- Samsung PM963 (M.2 form factor)
|
||
- Samsung PM1725/PM1725a
|
||
- Samsung XS1715
|
||
- Toshiba ZD6300
|
||
- Seagate Nytro 5000 M.2 (XP1920LE30002 tested; **read notes below
|
||
before buying**)
|
||
|
||
- Inexpensive 22110 M.2 enterprise drive using consumer MLC that is
|
||
optimized for read mostly workloads. It is not a good choice for a
|
||
SLOG device, which is a write mostly workload.
|
||
- The
|
||
`manual <https://www.seagate.com/www-content/support-content/enterprise-storage/solid-state-drives/nytro-5000/_shared/docs/nytro-5000-mp2-pm-100810195d.pdf>`__
|
||
for this drive specifies airflow requirements. If the drive does
|
||
not receive sufficient airflow from case fans, it will overheat at
|
||
idle. It's thermal throttling will severely degrade performance
|
||
such that write throughput performance will be limited to 1/10 of
|
||
the specification and read latencies will reach several hundred
|
||
milliseconds. Under continuous load, the device will continue to
|
||
become hotter until it suffers a "degraded reliability" event
|
||
where all data on at least one NVMe namespace is lost. The NVMe
|
||
namespace is then unusable until a secure erase is done. Even with
|
||
sufficient airflow under normal circumstances, data loss is
|
||
possible under load following the failure of fans in an enterprise
|
||
environment. Anyone deploying this into production in an
|
||
enterprise environment should be mindful of this failure mode.
|
||
- Those who wish to use this drive in a low airflow situation can
|
||
workaround this failure mode by placing a passive heatsink such as
|
||
`this <https://smile.amazon.com/gp/product/B07BDKN3XV>`__ on the
|
||
NAND flash controller. It is the chip under the sticker closest to
|
||
the capacitors. This was tested by placing the heatsink over the
|
||
sticker (as removing it was considered undesirable). The heatsink
|
||
will prevent the drive from overheating to the point of data loss,
|
||
but it will not fully alleviate the overheating situation under
|
||
load without active airflow. A scrub will cause it to overheat
|
||
after a few hundred gigabytes are read. However, the thermal
|
||
throttling will quickly cool the drive from 76 degrees Celsius to
|
||
74 degrees Celsius, restoring performance.
|
||
|
||
- It might be possible to use the heatsink in an enterprise
|
||
environment to provide protection against data loss following
|
||
fan failures. However, this was not evaluated. Furthermore,
|
||
operating temperatures for consumer NAND flash should be at or
|
||
above 40 degrees Celsius for long term data integrity.
|
||
Therefore, the use of a heatsink to provide protection against
|
||
data loss following fan failures in an enterprise environment
|
||
should be evaluated before deploying drives into production to
|
||
ensure that the drive is not overcooled.
|
||
|
||
.. _sas_drives_with_power_failure_protection:
|
||
|
||
SAS drives with power failure protection
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A non-exhaustive list of SAS drives with power failure protection is as
|
||
follows:
|
||
|
||
- Samsung PM1633/PM1633a
|
||
- Samsung SM1625
|
||
- Samsung PM853T
|
||
- Toshiba PX05SHB***/PX04SHB***/PX04SHQ**\*
|
||
- Toshiba PX05SLB***/PX04SLB***/PX04SLQ**\*
|
||
- Toshiba PX05SMB***/PX04SMB***/PX04SMQ**\*
|
||
- Toshiba PX05SRB***/PX04SRB***/PX04SRQ**\*
|
||
- Toshiba PX05SVB***/PX04SVB***/PX04SVQ**\*
|
||
|
||
.. _sata_drives_with_power_failure_protection:
|
||
|
||
SATA drives with power failure protection
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
A non-exhaustive list of SATA drives with power failure protection is as
|
||
follows:
|
||
|
||
- Crucial MX100/MX200/MX300
|
||
- Crucial M500/M550/M600
|
||
- Intel 320
|
||
|
||
- Early reports claimed that the 330 and 335 had power failure
|
||
protection too, `but they do
|
||
not <http://blog.nordeus.com/dev-ops/power-failure-testing-with-ssds.htm>`__.
|
||
|
||
- Intel 710
|
||
- Intel 730
|
||
- Intel DC S3500/S3510/S3610/S3700/S3710
|
||
- Micron 5210 Ion
|
||
|
||
- First QLC drive on the list. High capacity with a low price per
|
||
gigabyte.
|
||
|
||
- Samsung PM863/PM863a
|
||
- Samsung SM843T (do not confuse with SM843)
|
||
- Samsung SM863/SM863a
|
||
- Samsung 845DC Evo
|
||
- Samsung 845DC Pro
|
||
|
||
- `High sustained write
|
||
IOPS <http://www.anandtech.com/show/8319/samsung-ssd-845dc-evopro-preview-exploring-worstcase-iops/5>`__
|
||
|
||
- Toshiba HK4E/HK3E2
|
||
- Toshiba HK4R/HK3R2/HK3R
|
||
|
||
.. _criteriaprocess_for_inclusion_into_these_lists:
|
||
|
||
Criteria/process for inclusion into these lists
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
These lists have been compiled on a volunteer basis by OpenZFS
|
||
contributors (mainly Richard Yao) from trustworthy sources of
|
||
information. The lists are intended to be vendor neutral and are not
|
||
intended to benefit any particular manufacturer. Any perceived bias
|
||
toward any manufacturer is caused by a lack of awareness and a lack of
|
||
time to research additional options. Confirmation of the presence of
|
||
adequate power loss protection by a reliable source is the only
|
||
requirement for inclusion into this list. Adequate power loss protection
|
||
means that the drive must protect both its own internal metadata and all
|
||
flushed data. Protection of unflushed data is irrelevant and therefore
|
||
not a requirement. ZFS only expects storage to protect flushed data.
|
||
Consequently, solid state drives whose power loss protection only
|
||
protects flushed data is sufficient for ZFS to ensure that data remains
|
||
safe.
|
||
|
||
Anyone who believes an unlisted drive to provide adequate power failure
|
||
protection may contact the :ref:`mailing_lists` with
|
||
a request for inclusion and substantiation for the claim that power
|
||
failure protection is provided. Examples of substantiation include
|
||
pictures of drive internals showing the presence of capacitors,
|
||
statements by well regarded independent review sites such as Anandtech
|
||
and manufacturer specification sheets. The latter are accepted on the
|
||
honor system until a manufacturer is found to misstate reality on the
|
||
protection of the drives' own internal metadata structures and/or the
|
||
protection of flushed data. Thus far, all manufacturers have been
|
||
honest.
|
||
|
||
.. _flash_pages:
|
||
|
||
Flash pages
|
||
-----------
|
||
|
||
The smallest unit on a NAND chip that can be written is a flash page.
|
||
The first NAND-flash SSDs on the market had 4096-byte pages. Further
|
||
complicating matters is that the the page size has been doubled twice
|
||
since then. NAND flash SSDs **should** report these pages as being
|
||
sectors, but so far, all of them incorrectly report 512-byte sectors for
|
||
Windows XP compatibility. The consequence is that we have a similar
|
||
situation to what we had with early advanced format hard drives.
|
||
|
||
As of 2014, most NAND-flash SSDs on the market have 8192-byte page
|
||
sizes. However, models using 128-Gbit NAND from certain manufacturers
|
||
have a 16384-byte page size. Maximum performance requires that vdevs be
|
||
created with correct ashift values (13 for 8192-byte and 14 for
|
||
16384-byte). However, not all OpenZFS platforms support this. The Linux
|
||
port supports ashift=13, while others are limited to ashift=12
|
||
(4096-byte).
|
||
|
||
As of 2017, NAND-flash SSDs are tuned for 4096-byte IOs. Matching the
|
||
flash page size is unnecessary and ashift=12 is usually the correct
|
||
choice. Public documentation on flash page size is also nearly
|
||
non-existent.
|
||
|
||
.. _ata_trim_scsi_unmap:
|
||
|
||
ATA TRIM / SCSI UNMAP
|
||
---------------------
|
||
|
||
It should be noted that this is a separate case from
|
||
discard on zvols or hole punching on filesystems. Those work regardless
|
||
of whether ATA TRIM / SCSI UNMAP is sent to the actual block devices.
|
||
|
||
.. _ata_trim_performance_issues:
|
||
|
||
ATA TRIM Performance Issues
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
The ATA TRIM command in SATA 3.0 and earlier is a non-queued command.
|
||
Issuing a TRIM command on a SATA drive conforming to SATA 3.0 or earlier
|
||
will cause the drive to drain its IO queue and stop servicing requests
|
||
until it finishes, which hurts performance. SATA 3.1 removed this
|
||
limitation, but very few SATA drives on the market are conformant to
|
||
SATA 3.1 and it is difficult to distinguish them from SATA 3.0 drives.
|
||
At the same time, SCSI UNMAP has no such problems.
|
||
|
||
.. _optane_3d_xpoint_ssds:
|
||
|
||
Optane / 3D XPoint SSDs
|
||
=======================
|
||
|
||
These are SSDs with far better latencies and write endurance than NAND
|
||
flash SSDs. They are byte addressable, such that ashift=9 is fine for
|
||
use on them. Unlike NAND flash SSDs, they do not require any special
|
||
power failure protection circuitry for reliability. There is also no
|
||
need to run TRIM on them. However, they cost more per GB than NAND flash
|
||
(as of 2020). The enterprise models make excellent SLOG devices. Here is
|
||
a list of models that are known to perform well:
|
||
|
||
- `Intel DC
|
||
P4800X <https://www.servethehome.com/intel-optane-hands-on-real-world-benchmark-and-test-results/>`__
|
||
|
||
- This gives basically the highest performance you can get as of
|
||
June 2020.
|
||
|
||
Also, at time of writing in June 2020, only one model is listed. This is
|
||
due to there being few such drives on the market. The client models are
|
||
likely to be outperformed by well configured NAND flash drives, so they
|
||
have not been listed (although they are likely cheaper than NAND flash).
|
||
More will likely be added in the future.
|
||
|
||
Note that SLOG devices rarely have more than 4GB in use at any given
|
||
time, so the smaller sized devices are generally the best choice in
|
||
terms of cost, with larger sizes giving no benefit. Larger sizes could
|
||
be a good choice for other vdev types, depending on performance needs
|
||
and cost considerations.
|
||
|
||
Power
|
||
=====
|
||
|
||
Ensuring that computers are properly grounded is highly recommended.
|
||
There have been cases in user homes where machines experienced random
|
||
failures when plugged into power receptacles that had open grounds (i.e.
|
||
no ground wire at all). This can cause random failures on any computer
|
||
system, whether it uses ZFS or not.
|
||
|
||
Power should also be relatively stable. Large dips in voltages from
|
||
brownouts are preferably avoided through the use of UPS units or line
|
||
conditioners. Systems subject to unstable power that do not outright
|
||
shutdown can exhibit undefined behavior. PSUs with longer hold-up times
|
||
should be able to provide partial protection against this, but hold up
|
||
times are often undocumented and are not a substitute for a UPS or line
|
||
conditioner.
|
||
|
||
.. _pwr_ok_signal:
|
||
|
||
PWR_OK signal
|
||
-------------
|
||
|
||
PSUs are supposed to deassert a PWR_OK signal to indicate that provided
|
||
voltages are no longer within the rated specification. This should force
|
||
an immediate shutdown. However, the system clock of a developer
|
||
workstation was observed to significantly deviate from the expected
|
||
value following during a series of ~1 second brown outs. This machine
|
||
did not use a UPS at the time. However, the PWR_OK mechanism should have
|
||
protected against this. The observation of the PWR_OK signal failing to
|
||
force a shutdown with adverse consequences (to the system clock in this
|
||
case) suggests that the PWR_OK mechanism is not a strict guarantee.
|
||
|
||
.. _psu_hold_up_times:
|
||
|
||
PSU Hold-up Times
|
||
-----------------
|
||
|
||
A PSU hold-up time is the amount of time that a PSU can continue to
|
||
output power at maximum output within standard voltage tolerances
|
||
following the loss of input power. This is important for supporting UPS
|
||
units because `the transfer
|
||
time <https://www.sunpower-uk.com/glossary/what-is-transfer-time/>`__
|
||
taken by a standard UPS to supply power from its battery can leave
|
||
machines without power for "5-12 ms". `Intel's ATX Power Supply design
|
||
guide <https://paginas.fe.up.pt/~asousa/pc-info/atxps09_atx_pc_pow_supply.pdf>`__
|
||
specifies a hold up time of 17 milliseconds at maximum continuous
|
||
output. The hold-up time is a inverse function of how much power is
|
||
being output by the PSU, with lower power output increasing holdup
|
||
times.
|
||
|
||
Capacitor aging in PSUs will lower the hold-up time below what it was
|
||
when new, which could cause reliability issues as equipment ages.
|
||
Machines using substandard PSUs with hold-up times below the
|
||
specification therefore require higher end UPS units for protection to
|
||
ensure that the transfer time does not exceed the hold-up time. A
|
||
hold-up time below the transfer time during a transfer to battery power
|
||
can cause undefined behavior should the PWR_OK signal not become
|
||
deasserted to force the machine to power off.
|
||
|
||
If in doubt, use a double conversion UPS unit. Double conversion UPS
|
||
units always run off the battery, such that the transfer time is 0. This
|
||
is unless they are high efficiency models that are hybrids between
|
||
standard UPS units and double conversion UPS units, although these are
|
||
reported to have much lower transfer times than standard PSUs. You could
|
||
also contact your PSU manufacturer for the hold up time specification,
|
||
but if reliability for years is a requirement, you should use a higher
|
||
end UPS with a low transfer time.
|
||
|
||
Note that double conversion units are at most 94% efficient unless they
|
||
support a high efficiency mode, which adds latency to the time to
|
||
transition to battery power.
|
||
|
||
.. _ups_batteries:
|
||
|
||
UPS batteries
|
||
-------------
|
||
|
||
The lead acid batteries in UPS units generally need to be replaced
|
||
regularly to ensure that they provide power during power outages. For
|
||
home systems, this is every 3 to 5 years, although this varies with
|
||
temperature [#ups_temp]_. For
|
||
enterprise systems, contact your vendor.
|
||
|
||
|
||
.. rubric:: Footnotes
|
||
|
||
.. [#ssd_analysis] <http://lkcl.net/reports/ssd_analysis.html>
|
||
.. [#ssd_analysis2] <https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf>
|
||
.. [#ssd_analysis3] <http://blog.nordeus.com/dev-ops/power-failure-testing-with-ssds.htm>
|
||
.. [#ups_temp] <https://www.apc.com/us/en/faqs/FA158934/>
|