Initial wiki md to rst auto convertation
This commit is contained in:
741
docs/FAQ.rst
Normal file
741
docs/FAQ.rst
Normal file
@@ -0,0 +1,741 @@
|
||||
Table Of Contents
|
||||
-----------------
|
||||
|
||||
- `What is ZFS on Linux <#what-is-zfs-on-linux>`__
|
||||
- `Hardware Requirements <#hardware-requirements>`__
|
||||
- `Do I have to use ECC memory for
|
||||
ZFS? <#do-i-have-to-use-ecc-memory-for-zfs>`__
|
||||
- `Installation <#installation>`__
|
||||
- `Supported Architectures <#supported-architectures>`__
|
||||
- `Supported Kernels <#supported-kernels>`__
|
||||
- `32-bit vs 64-bit Systems <#32-bit-vs-64-bit-systems>`__
|
||||
- `Booting from ZFS <#booting-from-zfs>`__
|
||||
- `Selecting /dev/ names when creating a
|
||||
pool <#selecting-dev-names-when-creating-a-pool>`__
|
||||
- `Setting up the /etc/zfs/vdev_id.conf
|
||||
file <#setting-up-the-etczfsvdev_idconf-file>`__
|
||||
- `Changing /dev/ names on an existing
|
||||
pool <#changing-dev-names-on-an-existing-pool>`__
|
||||
- `The /etc/zfs/zpool.cache file <#the-etczfszpoolcache-file>`__
|
||||
- `Generating a new /etc/zfs/zpool.cache
|
||||
file <#generating-a-new-etczfszpoolcache-file>`__
|
||||
- `Sending and Receiving Streams <#sending-and-receiving-streams>`__
|
||||
|
||||
- `hole_birth Bugs <#hole_birth-bugs>`__
|
||||
- `Sending Large Blocks <#sending-large-blocks>`__
|
||||
|
||||
- `CEPH/ZFS <#cephzfs>`__
|
||||
|
||||
- `ZFS Configuration <#zfs-configuration>`__
|
||||
- `CEPH Configuration (ceph.conf} <#ceph-configuration-cephconf>`__
|
||||
- `Other General Guidelines <#other-general-guidelines>`__
|
||||
|
||||
- `Performance Considerations <#performance-considerations>`__
|
||||
- `Advanced Format Disks <#advanced-format-disks>`__
|
||||
- `ZVOL used space larger than
|
||||
expected <#ZVOL-used-space-larger-than-expected>`__
|
||||
- `Using a zvol for a swap device <#using-a-zvol-for-a-swap-device>`__
|
||||
- `Using ZFS on Xen Hypervisor or Xen
|
||||
Dom0 <#using-zfs-on-xen-hypervisor-or-xen-dom0>`__
|
||||
- `udisks2 creates /dev/mapper/ entries for
|
||||
zvol <#udisks2-creating-devmapper-entries-for-zvol>`__
|
||||
- `Licensing <#licensing>`__
|
||||
- `Reporting a problem <#reporting-a-problem>`__
|
||||
- `Does ZFS on Linux have a Code of
|
||||
Conduct? <#does-zfs-on-linux-have-a-code-of-conduct>`__
|
||||
|
||||
What is ZFS on Linux
|
||||
--------------------
|
||||
|
||||
The ZFS on Linux project is an implementation of
|
||||
`OpenZFS <http://open-zfs.org/wiki/Main_Page>`__ designed to work in a
|
||||
Linux environment. OpenZFS is an outstanding storage platform that
|
||||
encompasses the functionality of traditional filesystems, volume
|
||||
managers, and more, with consistent reliability, functionality and
|
||||
performance across all distributions. Additional information about
|
||||
OpenZFS can be found in the `OpenZFS wikipedia
|
||||
article <https://en.wikipedia.org/wiki/OpenZFS>`__.
|
||||
|
||||
Hardware Requirements
|
||||
---------------------
|
||||
|
||||
Because ZFS was originally designed for Sun Solaris it was long
|
||||
considered a filesystem for large servers and for companies that could
|
||||
afford the best and most powerful hardware available. But since the
|
||||
porting of ZFS to numerous OpenSource platforms (The BSDs, Illumos and
|
||||
Linux - under the umbrella organization "OpenZFS"), these requirements
|
||||
have been lowered.
|
||||
|
||||
The suggested hardware requirements are:
|
||||
|
||||
- ECC memory. This isn't really a requirement, but it's highly
|
||||
recommended.
|
||||
- 8GB+ of memory for the best performance. It's perfectly possible to
|
||||
run with 2GB or less (and people do), but you'll need more if using
|
||||
deduplication.
|
||||
|
||||
Do I have to use ECC memory for ZFS?
|
||||
------------------------------------
|
||||
|
||||
Using ECC memory for OpenZFS is strongly recommended for enterprise
|
||||
environments where the strongest data integrity guarantees are required.
|
||||
Without ECC memory rare random bit flips caused by cosmic rays or by
|
||||
faulty memory can go undetected. If this were to occur OpenZFS (or any
|
||||
other filesystem) will write the damaged data to disk and be unable to
|
||||
automatically detect the corruption.
|
||||
|
||||
Unfortunately, ECC memory is not always supported by consumer grade
|
||||
hardware. And even when it is ECC memory will be more expensive. For
|
||||
home users the additional safety brought by ECC memory might not justify
|
||||
the cost. It's up to you to determine what level of protection your data
|
||||
requires.
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
ZFS on Linux is available for all major Linux distributions. Refer to
|
||||
the [[getting started]] section of the wiki for links to installations
|
||||
instructions for many popular distributions. If your distribution isn't
|
||||
listed you can always build ZFS on Linux from the latest official
|
||||
`tarball <https://github.com/zfsonlinux/zfs/releases>`__.
|
||||
|
||||
Supported Architectures
|
||||
-----------------------
|
||||
|
||||
ZFS on Linux is regularly compiled for the following architectures:
|
||||
x86_64, x86, aarch64, arm, ppc64, ppc.
|
||||
|
||||
Supported Kernels
|
||||
-----------------
|
||||
|
||||
The `notes <https://github.com/zfsonlinux/zfs/releases>`__ for a given
|
||||
ZFS on Linux release will include a range of supported kernels. Point
|
||||
releases will be tagged as needed in order to support the *stable*
|
||||
kernel available from `kernel.org <https://www.kernel.org/>`__. The
|
||||
oldest supported kernel is 2.6.32 due to its prominence in Enterprise
|
||||
Linux distributions.
|
||||
|
||||
.. _32-bit-vs-64-bit-systems:
|
||||
|
||||
32-bit vs 64-bit Systems
|
||||
------------------------
|
||||
|
||||
You are **strongly** encouraged to use a 64-bit kernel. ZFS on Linux
|
||||
will build for 32-bit kernels but you may encounter stability problems.
|
||||
|
||||
ZFS was originally developed for the Solaris kernel which differs from
|
||||
the Linux kernel in several significant ways. Perhaps most importantly
|
||||
for ZFS it is common practice in the Solaris kernel to make heavy use of
|
||||
the virtual address space. However, use of the virtual address space is
|
||||
strongly discouraged in the Linux kernel. This is particularly true on
|
||||
32-bit architectures where the virtual address space is limited to 100M
|
||||
by default. Using the virtual address space on 64-bit Linux kernels is
|
||||
also discouraged but the address space is so much larger than physical
|
||||
memory it is less of an issue.
|
||||
|
||||
If you are bumping up against the virtual memory limit on a 32-bit
|
||||
system you will see the following message in your system logs. You can
|
||||
increase the virtual address size with the boot option ``vmalloc=512M``.
|
||||
|
||||
::
|
||||
|
||||
vmap allocation for size 4198400 failed: use vmalloc=<size> to increase size.
|
||||
|
||||
However, even after making this change your system will likely not be
|
||||
entirely stable. Proper support for 32-bit systems is contingent upon
|
||||
the OpenZFS code being weaned off its dependence on virtual memory. This
|
||||
will take some time to do correctly but it is planned for OpenZFS. This
|
||||
change is also expected to improve how efficiently OpenZFS manages the
|
||||
ARC cache and allow for tighter integration with the standard Linux page
|
||||
cache.
|
||||
|
||||
Booting from ZFS
|
||||
----------------
|
||||
|
||||
Booting from ZFS on Linux is possible and many people do it. There are
|
||||
excellent walk throughs available for [[Debian]], [[Ubuntu]] and
|
||||
`Gentoo <https://github.com/pendor/gentoo-zfs-install/tree/master/install>`__.
|
||||
|
||||
Selecting /dev/ names when creating a pool
|
||||
------------------------------------------
|
||||
|
||||
There are different /dev/ names that can be used when creating a ZFS
|
||||
pool. Each option has advantages and drawbacks, the right choice for
|
||||
your ZFS pool really depends on your requirements. For development and
|
||||
testing using /dev/sdX naming is quick and easy. A typical home server
|
||||
might prefer /dev/disk/by-id/ naming for simplicity and readability.
|
||||
While very large configurations with multiple controllers, enclosures,
|
||||
and switches will likely prefer /dev/disk/by-vdev naming for maximum
|
||||
control. But in the end, how you choose to identify your disks is up to
|
||||
you.
|
||||
|
||||
- **/dev/sdX, /dev/hdX:** Best for development/test pools
|
||||
|
||||
- Summary: The top level /dev/ names are the default for consistency
|
||||
with other ZFS implementations. They are available under all Linux
|
||||
distributions and are commonly used. However, because they are not
|
||||
persistent they should only be used with ZFS for development/test
|
||||
pools.
|
||||
- Benefits:This method is easy for a quick test, the names are
|
||||
short, and they will be available on all Linux distributions.
|
||||
- Drawbacks:The names are not persistent and will change depending
|
||||
on what order they disks are detected in. Adding or removing
|
||||
hardware for your system can easily cause the names to change. You
|
||||
would then need to remove the zpool.cache file and re-import the
|
||||
pool using the new names.
|
||||
- Example: ``zpool create tank sda sdb``
|
||||
|
||||
- **/dev/disk/by-id/:** Best for small pools (less than 10 disks)
|
||||
|
||||
- Summary: This directory contains disk identifiers with more human
|
||||
readable names. The disk identifier usually consists of the
|
||||
interface type, vendor name, model number, device serial number,
|
||||
and partition number. This approach is more user friendly because
|
||||
it simplifies identifying a specific disk.
|
||||
- Benefits: Nice for small systems with a single disk controller.
|
||||
Because the names are persistent and guaranteed not to change, it
|
||||
doesn't matter how the disks are attached to the system. You can
|
||||
take them all out, randomly mixed them up on the desk, put them
|
||||
back anywhere in the system and your pool will still be
|
||||
automatically imported correctly.
|
||||
- Drawbacks: Configuring redundancy groups based on physical
|
||||
location becomes difficult and error prone.
|
||||
- Example:
|
||||
``zpool create tank scsi-SATA_Hitachi_HTS7220071201DP1D10DGG6HMRP``
|
||||
|
||||
- **/dev/disk/by-path/:** Good for large pools (greater than 10 disks)
|
||||
|
||||
- Summary: This approach is to use device names which include the
|
||||
physical cable layout in the system, which means that a particular
|
||||
disk is tied to a specific location. The name describes the PCI
|
||||
bus number, as well as enclosure names and port numbers. This
|
||||
allows the most control when configuring a large pool.
|
||||
- Benefits: Encoding the storage topology in the name is not only
|
||||
helpful for locating a disk in large installations. But it also
|
||||
allows you to explicitly layout your redundancy groups over
|
||||
multiple adapters or enclosures.
|
||||
- Drawbacks: These names are long, cumbersome, and difficult for a
|
||||
human to manage.
|
||||
- Example:
|
||||
``zpool create tank pci-0000:00:1f.2-scsi-0:0:0:0 pci-0000:00:1f.2-scsi-1:0:0:0``
|
||||
|
||||
- **/dev/disk/by-vdev/:** Best for large pools (greater than 10 disks)
|
||||
|
||||
- Summary: This approach provides administrative control over device
|
||||
naming using the configuration file /etc/zfs/vdev_id.conf. Names
|
||||
for disks in JBODs can be generated automatically to reflect their
|
||||
physical location by enclosure IDs and slot numbers. The names can
|
||||
also be manually assigned based on existing udev device links,
|
||||
including those in /dev/disk/by-path or /dev/disk/by-id. This
|
||||
allows you to pick your own unique meaningful names for the disks.
|
||||
These names will be displayed by all the zfs utilities so it can
|
||||
be used to clarify the administration of a large complex pool. See
|
||||
the vdev_id and vdev_id.conf man pages for further details.
|
||||
- Benefits: The main benefit of this approach is that it allows you
|
||||
to choose meaningful human-readable names. Beyond that, the
|
||||
benefits depend on the naming method employed. If the names are
|
||||
derived from the physical path the benefits of /dev/disk/by-path
|
||||
are realized. On the other hand, aliasing the names based on drive
|
||||
identifiers or WWNs has the same benefits as using
|
||||
/dev/disk/by-id.
|
||||
- Drawbacks: This method relies on having a /etc/zfs/vdev_id.conf
|
||||
file properly configured for your system. To configure this file
|
||||
please refer to section `Setting up the /etc/zfs/vdev_id.conf
|
||||
file <#setting-up-the-etczfsvdev_idconf-file>`__. As with
|
||||
benefits, the drawbacks of /dev/disk/by-id or /dev/disk/by-path
|
||||
may apply depending on the naming method employed.
|
||||
- Example: ``zpool create tank mirror A1 B1 mirror A2 B2``
|
||||
|
||||
.. _setting-up-the-etczfsvdev_idconf-file:
|
||||
|
||||
Setting up the /etc/zfs/vdev_id.conf file
|
||||
-----------------------------------------
|
||||
|
||||
In order to use /dev/disk/by-vdev/ naming the ``/etc/zfs/vdev_id.conf``
|
||||
must be configured. The format of this file is described in the
|
||||
vdev_id.conf man page. Several examples follow.
|
||||
|
||||
A non-multipath configuration with direct-attached SAS enclosures and an
|
||||
arbitrary slot re-mapping.
|
||||
|
||||
::
|
||||
|
||||
multipath no
|
||||
topology sas_direct
|
||||
phys_per_port 4
|
||||
|
||||
# PCI_SLOT HBA PORT CHANNEL NAME
|
||||
channel 85:00.0 1 A
|
||||
channel 85:00.0 0 B
|
||||
|
||||
# Linux Mapped
|
||||
# Slot Slot
|
||||
slot 0 2
|
||||
slot 1 6
|
||||
slot 2 0
|
||||
slot 3 3
|
||||
slot 4 5
|
||||
slot 5 7
|
||||
slot 6 4
|
||||
slot 7 1
|
||||
|
||||
A SAS-switch topology. Note that the channel keyword takes only two
|
||||
arguments in this example.
|
||||
|
||||
::
|
||||
|
||||
topology sas_switch
|
||||
|
||||
# SWITCH PORT CHANNEL NAME
|
||||
channel 1 A
|
||||
channel 2 B
|
||||
channel 3 C
|
||||
channel 4 D
|
||||
|
||||
A multipath configuration. Note that channel names have multiple
|
||||
definitions - one per physical path.
|
||||
|
||||
::
|
||||
|
||||
multipath yes
|
||||
|
||||
# PCI_SLOT HBA PORT CHANNEL NAME
|
||||
channel 85:00.0 1 A
|
||||
channel 85:00.0 0 B
|
||||
channel 86:00.0 1 A
|
||||
channel 86:00.0 0 B
|
||||
|
||||
A configuration using device link aliases.
|
||||
|
||||
::
|
||||
|
||||
# by-vdev
|
||||
# name fully qualified or base name of device link
|
||||
alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca
|
||||
alias d2 wwn-0x5000c5002def789e
|
||||
|
||||
After defining the new disk names run ``udevadm trigger`` to prompt udev
|
||||
to parse the configuration file. This will result in a new
|
||||
/dev/disk/by-vdev directory which is populated with symlinks to /dev/sdX
|
||||
names. Following the first example above, you could then create the new
|
||||
pool of mirrors with the following command:
|
||||
|
||||
::
|
||||
|
||||
$ zpool create tank \
|
||||
mirror A0 B0 mirror A1 B1 mirror A2 B2 mirror A3 B3 \
|
||||
mirror A4 B4 mirror A5 B5 mirror A6 B6 mirror A7 B7
|
||||
|
||||
$ zpool status
|
||||
pool: tank
|
||||
state: ONLINE
|
||||
scan: none requested
|
||||
config:
|
||||
|
||||
NAME STATE READ WRITE CKSUM
|
||||
tank ONLINE 0 0 0
|
||||
mirror-0 ONLINE 0 0 0
|
||||
A0 ONLINE 0 0 0
|
||||
B0 ONLINE 0 0 0
|
||||
mirror-1 ONLINE 0 0 0
|
||||
A1 ONLINE 0 0 0
|
||||
B1 ONLINE 0 0 0
|
||||
mirror-2 ONLINE 0 0 0
|
||||
A2 ONLINE 0 0 0
|
||||
B2 ONLINE 0 0 0
|
||||
mirror-3 ONLINE 0 0 0
|
||||
A3 ONLINE 0 0 0
|
||||
B3 ONLINE 0 0 0
|
||||
mirror-4 ONLINE 0 0 0
|
||||
A4 ONLINE 0 0 0
|
||||
B4 ONLINE 0 0 0
|
||||
mirror-5 ONLINE 0 0 0
|
||||
A5 ONLINE 0 0 0
|
||||
B5 ONLINE 0 0 0
|
||||
mirror-6 ONLINE 0 0 0
|
||||
A6 ONLINE 0 0 0
|
||||
B6 ONLINE 0 0 0
|
||||
mirror-7 ONLINE 0 0 0
|
||||
A7 ONLINE 0 0 0
|
||||
B7 ONLINE 0 0 0
|
||||
|
||||
errors: No known data errors
|
||||
|
||||
Changing /dev/ names on an existing pool
|
||||
----------------------------------------
|
||||
|
||||
Changing the /dev/ names on an existing pool can be done by simply
|
||||
exporting the pool and re-importing it with the -d option to specify
|
||||
which new names should be used. For example, to use the custom names in
|
||||
/dev/disk/by-vdev:
|
||||
|
||||
::
|
||||
|
||||
$ zpool export tank
|
||||
$ zpool import -d /dev/disk/by-vdev tank
|
||||
|
||||
.. _the-etczfszpoolcache-file:
|
||||
|
||||
The /etc/zfs/zpool.cache file
|
||||
-----------------------------
|
||||
|
||||
Whenever a pool is imported on the system it will be added to the
|
||||
``/etc/zfs/zpool.cache file``. This file stores pool configuration
|
||||
information, such as the device names and pool state. If this file
|
||||
exists when running the ``zpool import`` command then it will be used to
|
||||
determine the list of pools available for import. When a pool is not
|
||||
listed in the cache file it will need to be detected and imported using
|
||||
the ``zpool import -d /dev/disk/by-id`` command.
|
||||
|
||||
.. _generating-a-new-etczfszpoolcache-file:
|
||||
|
||||
Generating a new /etc/zfs/zpool.cache file
|
||||
------------------------------------------
|
||||
|
||||
The ``/etc/zfs/zpool.cache`` file will be automatically updated when
|
||||
your pool configuration is changed. However, if for some reason it
|
||||
becomes stale you can force the generation of a new
|
||||
``/etc/zfs/zpool.cache`` file by setting the cachefile property on the
|
||||
pool.
|
||||
|
||||
::
|
||||
|
||||
$ zpool set cachefile=/etc/zfs/zpool.cache tank
|
||||
|
||||
Conversely the cache file can be disabled by setting ``cachefile=none``.
|
||||
This is useful for failover configurations where the pool should always
|
||||
be explicitly imported by the failover software.
|
||||
|
||||
::
|
||||
|
||||
$ zpool set cachefile=none tank
|
||||
|
||||
Sending and Receiving Streams
|
||||
-----------------------------
|
||||
|
||||
hole_birth Bugs
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
The hole_birth feature has/had bugs, the result of which is that, if you
|
||||
do a ``zfs send -i`` (or ``-R``, since it uses ``-i``) from an affected
|
||||
dataset, the receiver *will not see any checksum or other errors, but
|
||||
will not match the source*.
|
||||
|
||||
ZoL versions 0.6.5.8 and 0.7.0-rc1 (and above) default to ignoring the
|
||||
faulty metadata which causes this issue *on the sender side*.
|
||||
|
||||
For more details, see the [[hole_birth FAQ]].
|
||||
|
||||
Sending Large Blocks
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When sending incremental streams which contain large blocks (>128K) the
|
||||
``--large-block`` flag must be specified. Inconsist use of the flag
|
||||
between incremental sends can result in files being incorrectly zeroed
|
||||
when they are received. Raw encrypted send/recvs automatically imply the
|
||||
``--large-block`` flag and are therefore unaffected.
|
||||
|
||||
For more details, see `issue
|
||||
6224 <https://github.com/zfsonlinux/zfs/issues/6224>`__.
|
||||
|
||||
CEPH/ZFS
|
||||
--------
|
||||
|
||||
There is a lot of tuning that can be done that's dependent on the
|
||||
workload that is being put on CEPH/ZFS, as well as some general
|
||||
guidelines. Some are as follow;
|
||||
|
||||
ZFS Configuration
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
The CEPH filestore back-end heavily relies on xattrs, for optimal
|
||||
performance all CEPH workloads will benefit from the following ZFS
|
||||
dataset parameters
|
||||
|
||||
- ``xattr=sa``
|
||||
- ``dnodesize=auto``
|
||||
|
||||
Beyond that typically rbd/cephfs focused workloads benefit from small
|
||||
recordsize({16K-128K), while objectstore/s3/rados focused workloads
|
||||
benefit from large recordsize (128K-1M).
|
||||
|
||||
.. _ceph-configuration-cephconf:
|
||||
|
||||
CEPH Configuration (ceph.conf}
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Additionally CEPH sets various values internally for handling xattrs
|
||||
based on the underlying filesystem. As CEPH only officially
|
||||
supports/detects XFS and BTRFS, for all other filesystems it falls back
|
||||
to rather `limited "safe"
|
||||
values <https://github.com/ceph/ceph/blob/4fe7e2a458a1521839bc390c2e3233dd809ec3ac/src/common/config_opts.h#L1125-L1148>`__.
|
||||
On newer releases need for larger xattrs will prevent OSD's from even
|
||||
starting.
|
||||
|
||||
The officially recommended workaround (`see
|
||||
here <http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/#not-recommended>`__)
|
||||
has some severe downsides, and more specifically is geared toward
|
||||
filesystems with "limited" xattr support such as ext4.
|
||||
|
||||
ZFS does not have a limit internally to xattrs length, as such we can
|
||||
treat it similarly to how CEPH treats XFS. We can set overrides to set 3
|
||||
internal values to the same as those used with XFS(`see
|
||||
here <https://github.com/ceph/ceph/blob/9b317f7322848802b3aab9fec3def81dddd4a49b/src/os/filestore/FileStore.cc#L5714-L5737>`__
|
||||
and
|
||||
`here <https://github.com/ceph/ceph/blob/4fe7e2a458a1521839bc390c2e3233dd809ec3ac/src/common/config_opts.h#L1125-L1148>`__)
|
||||
and allow it be used without the severe limitations of the "official"
|
||||
workaround.
|
||||
|
||||
::
|
||||
|
||||
[osd]
|
||||
filestore_max_inline_xattrs = 10
|
||||
filestore_max_inline_xattr_size = 65536
|
||||
filestore_max_xattr_value_size = 65536
|
||||
|
||||
Other General Guidelines
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- Use a separate journal device. Do not don't collocate CEPH journal on
|
||||
ZFS dataset if at all possible, this will quickly lead to terrible
|
||||
fragmentation, not to mention terrible performance upfront even
|
||||
before fragmentation (CEPH journal does a dsync for every write).
|
||||
- Use a SLOG device, even with a separate CEPH journal device. For some
|
||||
workloads, skipping SLOG and setting ``logbias=throughput`` may be
|
||||
acceptable.
|
||||
- Use a high-quality SLOG/CEPH journal device, consumer based SSD, or
|
||||
even NVMe WILL NOT DO (Samsung 830, 840, 850, etc) for a variety of
|
||||
reasons. CEPH will kill them quickly, on-top of the performance being
|
||||
quite low in this use. Generally recommended are [Intel DC S3610,
|
||||
S3700, S3710, P3600, P3700], or [Samsung SM853, SM863], or better.
|
||||
- If using an high quality SSD or NVMe device(as mentioned above), you
|
||||
CAN share SLOG and CEPH Journal to good results on single device. A
|
||||
ratio of 4 HDDs to 1 SSD (Intel DC S3710 200GB), with each SSD
|
||||
partitioned (remember to align!) to 4x10GB (for ZIL/SLOG) + 4x20GB
|
||||
(for CEPH journal) has been reported to work well.
|
||||
|
||||
Again - CEPH + ZFS will KILL a consumer based SSD VERY quickly. Even
|
||||
ignoring the lack of power-loss protection, and endurance ratings, you
|
||||
will be very disappointed with performance of consumer based SSD under
|
||||
such a workload.
|
||||
|
||||
Performance Considerations
|
||||
--------------------------
|
||||
|
||||
To achieve good performance with your pool there are some easy best
|
||||
practices you should follow. Additionally, it should be made clear that
|
||||
the ZFS on Linux implementation has not yet been optimized for
|
||||
performance. As the project matures we can expect performance to
|
||||
improve.
|
||||
|
||||
- **Evenly balance your disk across controllers:** Often the limiting
|
||||
factor for performance is not the disk but the controller. By
|
||||
balancing your disks evenly across controllers you can often improve
|
||||
throughput.
|
||||
- **Create your pool using whole disks:** When running zpool create use
|
||||
whole disk names. This will allow ZFS to automatically partition the
|
||||
disk to ensure correct alignment. It will also improve
|
||||
interoperability with other OpenZFS implementations which honor the
|
||||
wholedisk property.
|
||||
- **Have enough memory:** A minimum of 2GB of memory is recommended for
|
||||
ZFS. Additional memory is strongly recommended when the compression
|
||||
and deduplication features are enabled.
|
||||
- **Improve performance by setting ashift=12:** You may be able to
|
||||
improve performance for some workloads by setting ``ashift=12``. This
|
||||
tuning can only be set when block devices are first added to a pool,
|
||||
such as when the pool is first created or when a new vdev is added to
|
||||
the pool. This tuning parameter can result in a decrease of capacity
|
||||
for RAIDZ configuratons.
|
||||
|
||||
Advanced Format Disks
|
||||
---------------------
|
||||
|
||||
Advanced Format (AF) is a new disk format which natively uses a 4,096
|
||||
byte, instead of 512 byte, sector size. To maintain compatibility with
|
||||
legacy systems many AF disks emulate a sector size of 512 bytes. By
|
||||
default, ZFS will automatically detect the sector size of the drive.
|
||||
This combination can result in poorly aligned disk accesses which will
|
||||
greatly degrade the pool performance.
|
||||
|
||||
Therefore, the ability to set the ashift property has been added to the
|
||||
zpool command. This allows users to explicitly assign the sector size
|
||||
when devices are first added to a pool (typically at pool creation time
|
||||
or adding a vdev to the pool). The ashift values range from 9 to 16 with
|
||||
the default value 0 meaning that zfs should auto-detect the sector size.
|
||||
This value is actually a bit shift value, so an ashift value for 512
|
||||
bytes is 9 (2^9 = 512) while the ashift value for 4,096 bytes is 12
|
||||
(2^12 = 4,096).
|
||||
|
||||
To force the pool to use 4,096 byte sectors at pool creation time, you
|
||||
may run:
|
||||
|
||||
::
|
||||
|
||||
$ zpool create -o ashift=12 tank mirror sda sdb
|
||||
|
||||
To force the pool to use 4,096 byte sectors when adding a vdev to a
|
||||
pool, you may run:
|
||||
|
||||
::
|
||||
|
||||
$ zpool add -o ashift=12 tank mirror sdc sdd
|
||||
|
||||
ZVOL used space larger than expected
|
||||
------------------------------------
|
||||
|
||||
| Depending on the filesystem used on the zvol (e.g. ext4) and the usage
|
||||
(e.g. deletion and creation of many files) the ``used`` and
|
||||
``referenced`` properties reported by the zvol may be larger than the
|
||||
"actual" space that is being used as reported by the consumer.
|
||||
| This can happen due to the way some filesystems work, in which they
|
||||
prefer to allocate files in new untouched blocks rather than the
|
||||
fragmented used blocks marked as free. This forces zfs to reference
|
||||
all blocks that the underlying filesystem has ever touched.
|
||||
| This is in itself not much of a problem, as when the ``used`` property
|
||||
reaches the configured ``volsize`` the underlying filesystem will
|
||||
start reusing blocks. But the problem arises if it is desired to
|
||||
snapshot the zvol, as the space referenced by the snapshots will
|
||||
contain the unused blocks.
|
||||
|
||||
| This issue can be prevented, by using the ``fstrim`` command to allow
|
||||
the kernel to specify to zfs which blocks are unused.
|
||||
| Executing a ``fstrim`` command before a snapshot is taken will ensure
|
||||
a minimum snapshot size.
|
||||
| Adding the ``discard`` option for the mounted ZVOL in ``\etc\fstab``
|
||||
effectively enables the Linux kernel to issue the trim commands
|
||||
continuously, without the need to execute fstrim on-demand.
|
||||
|
||||
Using a zvol for a swap device
|
||||
------------------------------
|
||||
|
||||
You may use a zvol as a swap device but you'll need to configure it
|
||||
appropriately.
|
||||
|
||||
**CAUTION:** for now swap on zvol may lead to deadlock, in this case
|
||||
please send your logs
|
||||
`here <https://github.com/zfsonlinux/zfs/issues/7734>`__.
|
||||
|
||||
- Set the volume block size to match your systems page size. This
|
||||
tuning prevents ZFS from having to perform read-modify-write options
|
||||
on a larger block while the system is already low on memory.
|
||||
- Set the ``logbias=throughput`` and ``sync=always`` properties. Data
|
||||
written to the volume will be flushed immediately to disk freeing up
|
||||
memory as quickly as possible.
|
||||
- Set ``primarycache=metadata`` to avoid keeping swap data in RAM via
|
||||
the ARC.
|
||||
- Disable automatic snapshots of the swap device.
|
||||
|
||||
::
|
||||
|
||||
$ zfs create -V 4G -b $(getconf PAGESIZE) \
|
||||
-o logbias=throughput -o sync=always \
|
||||
-o primarycache=metadata \
|
||||
-o com.sun:auto-snapshot=false rpool/swap
|
||||
|
||||
Using ZFS on Xen Hypervisor or Xen Dom0
|
||||
---------------------------------------
|
||||
|
||||
It is usually recommended to keep virtual machine storage and hypervisor
|
||||
pools, quite separate. Although few people have managed to successfully
|
||||
deploy and run ZFS on Linux using the same machine configured as Dom0.
|
||||
There are few caveats:
|
||||
|
||||
- Set a fair amount of memory in grub.conf, dedicated to Dom0.
|
||||
|
||||
- dom0_mem=16384M,max:16384M
|
||||
|
||||
- Allocate no more of 30-40% of Dom0's memory to ZFS in
|
||||
``/etc/modprobe.d/zfs.conf``.
|
||||
|
||||
- options zfs zfs_arc_max=6442450944
|
||||
|
||||
- Disable Xen's auto-ballooning in ``/etc/xen/xl.conf``
|
||||
- Watch out for any Xen bugs, such as `this
|
||||
one <https://github.com/zfsonlinux/zfs/issues/1067>`__ related to
|
||||
ballooning
|
||||
|
||||
udisks2 creating /dev/mapper/ entries for zvol
|
||||
----------------------------------------------
|
||||
|
||||
To prevent udisks2 from creating /dev/mapper entries that must be
|
||||
manually removed or maintained during zvol remove / rename, create a
|
||||
udev rule such as ``/etc/udev/rules.d/80-udisks2-ignore-zfs.rules`` with
|
||||
the following contents:
|
||||
|
||||
::
|
||||
|
||||
ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_FS_TYPE}=="zfs_member", ENV{ID_PART_ENTRY_TYPE}=="6a898cc3-1dd2-11b2-99a6-080020736631", ENV{UDISKS_IGNORE}="1"
|
||||
|
||||
Licensing
|
||||
---------
|
||||
|
||||
ZFS is licensed under the Common Development and Distribution License
|
||||
(`CDDL <http://hub.opensolaris.org/bin/view/Main/opensolaris_license>`__),
|
||||
and the Linux kernel is licensed under the GNU General Public License
|
||||
Version 2 (`GPLv2 <http://www.gnu.org/licenses/gpl2.html>`__). While
|
||||
both are free open source licenses they are restrictive licenses. The
|
||||
combination of them causes problems because it prevents using pieces of
|
||||
code exclusively available under one license with pieces of code
|
||||
exclusively available under the other in the same binary. In the case of
|
||||
the kernel, this prevents us from distributing ZFS on Linux as part of
|
||||
the kernel binary. However, there is nothing in either license that
|
||||
prevents distributing it in the form of a binary module or in the form
|
||||
of source code.
|
||||
|
||||
Additional reading and opinions:
|
||||
|
||||
- `Software Freedom Law
|
||||
Center <https://www.softwarefreedom.org/resources/2016/linux-kernel-cddl.html>`__
|
||||
- `Software Freedom
|
||||
Conservancy <https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/>`__
|
||||
- `Free Software
|
||||
Foundation <https://www.fsf.org/licensing/zfs-and-linux>`__
|
||||
- `Encouraging closed source
|
||||
modules <http://www.networkworld.com/article/2301697/smb/encouraging-closed-source-modules-part-1--copyright-and-software.html>`__
|
||||
|
||||
Reporting a problem
|
||||
-------------------
|
||||
|
||||
You can open a new issue and search existing issues using the public
|
||||
`issue tracker <https://github.com/zfsonlinux/zfs/issues>`__. The issue
|
||||
tracker is used to organize outstanding bug reports, feature requests,
|
||||
and other development tasks. Anyone may post comments after signing up
|
||||
for a github account.
|
||||
|
||||
Please make sure that what you're actually seeing is a bug and not a
|
||||
support issue. If in doubt, please ask on the mailing list first, and if
|
||||
you're then asked to file an issue, do so.
|
||||
|
||||
When opening a new issue include this information at the top of the
|
||||
issue:
|
||||
|
||||
- What distribution you're using and the version.
|
||||
- What spl/zfs packages you're using and the version.
|
||||
- Describe the problem you're observing.
|
||||
- Describe how to reproduce the problem.
|
||||
- Including any warning/errors/backtraces from the system logs.
|
||||
|
||||
When a new issue is opened it's not uncommon for a developer to request
|
||||
additional information about the problem. In general, the more detail
|
||||
you share about a problem the quicker a developer can resolve it. For
|
||||
example, providing a simple test case is always exceptionally helpful.
|
||||
Be prepared to work with the developer looking in to your bug in order
|
||||
to get it resolved. They may ask for information like:
|
||||
|
||||
- Your pool configuration as reported by ``zdb`` or ``zpool status``.
|
||||
- Your hardware configuration, such as
|
||||
|
||||
- Number of CPUs.
|
||||
- Amount of memory.
|
||||
- Whether your system has ECC memory.
|
||||
- Whether it is running under a VMM/Hypervisor.
|
||||
- Kernel version.
|
||||
- Values of the spl/zfs module parameters.
|
||||
|
||||
- Stack traces which may be logged to ``dmesg``.
|
||||
|
||||
Does ZFS on Linux have a Code of Conduct?
|
||||
-----------------------------------------
|
||||
|
||||
Yes, the ZFS on Linux community has a code of conduct. See the `Code of
|
||||
Conduct <http://open-zfs.org/wiki/Code_of_Conduct>`__ for details.
|
||||
Reference in New Issue
Block a user