diff --git a/docs/Performance and Tuning/Workload Tuning.rst b/docs/Performance and Tuning/Workload Tuning.rst new file mode 100644 index 0000000..0334a77 --- /dev/null +++ b/docs/Performance and Tuning/Workload Tuning.rst @@ -0,0 +1,685 @@ +Below are tips for various workloads. + +.. _basic_concepts: + +Basic concepts +-------------- + +Descriptions of ZFS internals that have an effect on application +performance follow. + +.. _adaptive_replacement_cache: + +Adaptive Replacement Cache +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +For decades, operating systems have used RAM as a cache to avoid the +necessity of waiting on disk IO, which is extremely slow. This concept +is called page replacement. Until ZFS, virtually all filesystems used +the Least Recently Used (LRU) page replacement algorithm in which the +least recently used pages are the first to be replaced. Unfortunately, +the LRU algorithm is vulnerable to cache flushes, where a brief change +in workload that occurs occasionally removes all frequently used data +from cache. The Adaptive Replacement Cache (ARC) algorithm was +implemented in ZFS to replace LRU. It solves this problem by maintaining +four lists: + +#. A list for recently cached entries. +#. A list for recently cached entries that have been accessed more than + once. +#. A list for entries evicted from #1. +#. A list of entries evicited from #2. + +Data is evicted from the first list while an effort is made to keep data +in the second list. In this way, ARC is able to outperform LRU by +providing a superior hit rate. + +In addition, a dedicated cache device (typically a SSD) can be added to +the pool, with +``zpool add``\ *``poolname``*\ ``cache``\ *``devicename``*. The cache +device is managed by the L2ARC, which scans entries that are next to be +evicted and writes them to the cache device. The data stored in ARC and +L2ARC can be controlled via the ``primarycache`` and ``secondarycache`` +zfs properties respectively, which can be set on both zvols and +datasets. Possible settings are ``all``, ``none`` and ``metadata``. It +is possible to improve performance when a zvol or dataset hosts an +application that does its own caching by caching only metadata. One +example is PostgreSQL. Another would be a virtual machine using ZFS. + +.. _alignment_shift_ashift: + +Alignment Shift (ashift) +~~~~~~~~~~~~~~~~~~~~~~~~ + +Top-level vdevs contain an internal property called ashift, which stands +for alignment shift. It is set at vdev creation and it is immutable. It +can be read using the ``zdb`` command. It is calculated as the maximum +base 2 logarithm of the physical sector size of any child vdev and it +alters the disk format such that writes are always done according to it. +This makes 2^ashift the smallest possible IO on a vdev. Configuring +ashift correctly is important because partial sector writes incur a +penalty where the sector must be read into a buffer before it can be +written. ZFS makes the implicit assumption that the sector size reported +by drives is correct and calculates ashift based on that. + +In an ideal world, physical sector size is always reported correctly and +therefore, this requires no attention. Unfortunately, this is not the +case. The sector size on all storage devices was 512-bytes prior to the +creation of flash-based solid state drives. Some operating systems, such +as Windows XP, were written under this assumption and will not function +when drives report a different sector size. + +Flash-based solid state drives came to market around 2007. These devices +report 512-byte sectors, but the actual flash pages, which roughly +correspond to sectors, are never 512-bytes. The early models used +4096-byte pages while the newer models have moved to an 8192-byte page. +In addition, "Advanced Format" hard drives have been created which also +use a 4096-byte sector size. Partial page writes suffer from similar +performance degradation as partial sector writes. In some cases, the +design of NAND-flash makes the performance degradation even worse, but +that is beyond the scope of this description. + +Reporting the correct sector sizes is the responsibility the block +device layer. This unfortunately has made proper handling of devices +that misreport drives different across different platforms. The +respective methods are as follows: + +- `sd.conf `__ + on illumos +- `gnop(8) `__ + on FreeBSD; see for example `FreeBSD on 4K sector + drives `__ + (2011-01-01) +- `ashift= `__ + on ZFS on Linux +- -o ashift= also works with both MacZFS (pool version 8) and ZFS-OSX + (pool version 5000). + +-o ashift= is convenient, but it is flawed in that the creation of pools +containing top level vdevs that have multiple optimal sector sizes +require the use of multiple commands. `A newer +syntax `__ +that will rely on the actual sector sizes has been discussed as a cross +platform replacement and will likely be implemented in the future. + +In addition, `Richard Yao `__ has contributed a `database of +drives known to misreport sector +sizes `__ +to the ZFS on Linux project. It is used to automatically adjust ashift +without the assistance of the system administrator. This approach is +unable to fully compensate for misreported sector sizes whenever drive +identifiers are used ambiguously (e.g. virtual machines, iSCSI LUNs, +some rare SSDs), but it does a great amount of good. The format is +roughly compatible with illumos' sd.conf and it is expected that other +implementations will integrate the database in future releases. Strictly +speaking, this database does not belong in ZFS, but the difficulty of +patching the Linux kernel (especially older ones) necessitated that this +be implemented in ZFS itself for Linux. The same is true for MacZFS. +However, FreeBSD and illumos are both able to implement this in the +correct layer. + +Compression +~~~~~~~~~~~ + +Internally, ZFS allocates data using multiples of the device's sector +size, typically either 512 bytes or 4KB (see above). When compression is +enabled, a smaller number of sectors can be allocated for each block. +The uncompressed block size is set by the ``recordsize`` (defaults to +128KB) or ``volblocksize`` (defaults to 8KB) property (for filesystems +vs volumes). + +The following compression algorithms are available: + +- LZ4 + + - New algorithm added after feature flags were created. It is + significantly superior to LZJB in all metrics tested. It is new + default compression algorithm (compression=on) in + OpenZFS\ `1 `__. + It is available on all platforms have as of 2020. + +- LZJB + + - Original default compression algorithm (compression=on) for ZFS. + It was created to satisfy the desire for a compression algorithm + suitable for use in filesystems. Specifically, that it provides + fair compression, has a high compression speed, has a high + decompression speed and detects incompressible data detection + quickly. + +- GZIP (1 through 9) + + - Classic Lempel-Ziv implementation. It provides high compression, + but it often makes IO CPU-bound. + +- ZLE (Zero Length Encoding) + + - A very simple algorithm that only compresses zeroes. + +If you want to use compression and are uncertain which to use, use LZ4. +It averages a 2.1:1 compression ratio while gzip-1 averages 2.7:1, but +gzip is much slower. Both figures are obtained from `testing by the LZ4 +project `__ on the Silesia corpus. The +greater compression ratio of gzip is usually only worthwhile for rarely +accessed data. + +.. _raid_z_stripe_width: + +RAID-Z stripe width +~~~~~~~~~~~~~~~~~~~ + +Choose a RAID-Z stripe width based on your IOPS needs and the amount of +space you are willing to devote to parity information. If you need more +IOPS, use fewer disks per stripe. If you need more usable space, use +more disks per stripe. Trying to optimize your RAID-Z stripe width based +on exact numbers is irrelevant in nearly all cases. See this `blog +post `__ +for more details. + +.. _dataset_recordsize: + +Dataset recordsize +~~~~~~~~~~~~~~~~~~ + +ZFS datasets use an internal recordsize of 128KB by default. The dataset +recordsize is the basic unit of data used for internal copy-on-write on +files. Partial record writes require that data be read from either ARC +(cheap) or disk (expensive). recordsize can be set to any power of 2 +from 512 bytes to 128 kilobytes. Software that writes in fixed record +sizes (e.g. databases) will benefit from the use of a matching +recordsize. + +Changing the recordsize on a dataset will only take effect for new +files. If you change the recordsize because your application should +perform better with a different one, you will need to recreate its +files. A cp followed by a mv on each file is sufficient. Alternatively, +send/recv should recreate the files with the correct recordsize when a +full receive is done. + +.. _larger_record_sizes: + +Larger record sizes +^^^^^^^^^^^^^^^^^^^ + +Record sizes of up to 16M are supported with the large_blocks pool +feature, which is enabled by default on new pools on systems that +support it. However, record sizes larger than 1M is disabled by default +unless the zfs_max_recordsize kernel module parameter is set to allow +sizes higher than 1M. Larger record sizes than 1M are not well tested as +1M, although they should work. \`zfs send\` operations must specify -L +to ensure that larger than 128KB blocks are sent and the receiving pools +must support the large_blocks feature. + +.. _zvol_volblocksize: + +zvol volblocksize +~~~~~~~~~~~~~~~~~ + +Zvols have a volblocksize property that is analogous to record size. The +default size is 8KB, which is the size of a page on the SPARC +architecture. Workloads that use smaller sized IOs (such as swap on x86 +which use 4096-byte pages) will benefit from a smaller volblocksize. + +Deduplication +~~~~~~~~~~~~~ + +Deduplication uses an on-disk hash table, using `extensible +hashing `__ as +implemented in the ZAP (ZFS Attribute Processor). Each cached entry uses +slightly more than 320 bytes of memory. The DDT code relies on ARC for +caching the DDT entries, such that there is no double caching or +internal fragmentation from the kernel memory allocator. Each pool has a +global deduplication table shared across all datasets and zvols on which +deduplication is enabled. Each entry in the hash table is a record of a +unique block in the pool. (Where the block size is set by the +``recordsize`` or ``volblocksize`` properties.) + +The hash table (also known as the DDT or DeDup Table) must be accessed +for every dedup-able block that is written or freed (regardless of +whether it has multiple references). If there is insufficient memory for +the DDT to be cached in memory, each cache miss will require reading a +random block from disk, resulting in poor performance. For example, if +operating on a single 7200RPM drive that can do 100 io/s, uncached DDT +reads would limit overall write throughput to 100 blocks per second, or +400KB/s with 4KB blocks. + +The consequence is that sufficient memory to store deduplication data is +required for good performance. The deduplication data is considered +metadata and therefore can be cached if the ``primarycache`` or +``secondarycache`` properties are set to ``metadata``. In addition, the +deduplication table will compete with other metadata for metadata +storage, which can have a negative effect on performance. Simulation of +the number of deduplication table entries needed for a given pool can be +done using the -D option to zdb. Then a simple multiplication by +320-bytes can be done to get the approximate memory requirements. +Alternatively, you can estimate an upper bound on the number of unique +blocks by dividing the amount of storage you plan to use on each dataset +(taking into account that partial records each count as a full +recordsize for the purposes of deduplication) by the recordsize and each +zvol by the volblocksize, summing and then multiplying by 320-bytes. + +.. _metaslab_allocator: + +Metaslab Allocator +~~~~~~~~~~~~~~~~~~ + +ZFS top level vdevs are divided into metaslabs from which blocks can be +independently allocated so allow for concurrent IOs to perform +allocations without blocking one another. At present, there is a +regression\ `2 `__ on the +Linux and Mac OS X ports that causes serialization to occur. + +By default, the selection of a metaslab is biased toward lower LBAs to +improve performance of spinning disks, but this does not make sense on +solid state media. This behavior can be adjusted globally by setting the +ZFS module's global metaslab_lba_weighting_enabled tuanble to 0. This +tunable is only advisable on systems that only use solid state media for +pools. + +The metaslab allocator will allocate blocks on a first-fit basis when a +metaslab has more than or equal to 4 percent free space and a best-fit +basis when a metaslab has less than 4 percent free space. The former is +much faster than the latter, but it is not possible to tell when this +behavior occurs from the pool's free space. However, the command \`zdb +-mmm $POOLNAME\` will provide this information. + +.. _pool_geometry: + +Pool Geometry +~~~~~~~~~~~~~ + +If small random IOPS are of primary importance, mirrored vdevs will +outperform raidz vdevs. Read IOPS on mirrors will scale with the number +of drives in each mirror while raidz vdevs will each be limited to the +IOPS of the slowest drive. + +If sequential writes are of primary importance, raidz will outperform +mirrored vdevs. Sequential write throughput increases linearly with the +number of data disks in raidz while writes are limited to the slowest +drive in mirrored vdevs. Sequential read performance should be roughly +the same on each. + +Both IOPS and throughput will increase by the respective sums of the +IOPS and throughput of each top level vdev, regardless of whether they +are raidz or mirrors. + +.. _whole_disks_versus_partitions: + +Whole Disks versus Partitions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +ZFS will behave differently on different platforms when given a whole +disk. + +On illumos, ZFS attempts to enable the write cache on a whole disk. The +illumos UFS driver cannot ensure integrity with the write cache enabled, +so by default Sun/Solaris systems using UFS file system for boot were +shipped with drive write cache disabled (long ago, when Sun was still an +independent company). For safety on illumos, if ZFS is not given the +whole disk, it could be shared with UFS and thus it is not appropriate +for ZFS to enable write cache. In this case, the write cache setting is +not changed and will remain as-is. Today, most vendors ship drives with +write cache enabled by default. + +On Linux, the Linux IO elevator is largely redundant given that ZFS has +its own IO elevator, so ZFS will set the IO elevator to noop to avoid +unnecessary CPU overhead. + +ZFS will also create a GPT partition table own partitions when given a +whole disk under illumos on x86/amd64 and on Linux. This is mainly to +make booting through UEFI possible because UEFI requires a small FAT +partition to be able to boot the system. The ZFS driver will be able to +tell the difference between whether the pool had been given the entire +disk or not via the whole_disk field in the label. + +This is not done on FreeBSD. Pools created by FreeBSD will always have +the whole_disk field set to true, such that a pool imported on another +platform that was created on FreeBSD will always be treated as the whole +disks were given to ZFS. + +.. _general_recommendations: + +General recommendations +----------------------- + +.. _alignment_shift: + +Alignment shift +~~~~~~~~~~~~~~~ + +Make sure that you create your pools such that the vdevs have the +correct alignment shift for your storage device's size. if dealing with +flash media, this is going to be either 12 (4K sectors) or 13 (8K +sectors). For SSD ephemeral storage on Amazon EC2, the proper setting is +12. + +.. _atime_updates: + +Atime Updates +~~~~~~~~~~~~~ + +Set either relatime=on or atime=off to minimize IOs used to update +access time stamps. For backward compatibility with a small percentage +of software that supports it, relatime is preferred when available and +should be set on your entire pool. atime=off should be used more +selectively. + +.. _free_space: + +Free Space +~~~~~~~~~~ + +Keep pool free space above 10% to avoid many metaslabs from reaching the +4% free space threshold to switch from first-fit to best-fit allocation +strategies. When the threshold is hit, the `metaslab +allocator `__ becomes very CPU +intensive in an attempt to protect itself from fragmentation. This +reduces IOPS, especially as more metaslabs reach the 4% threshold. + +The recommendation is 10% rather than 5% because metaslabs selection +considers both location and free space unless the global +metaslab_lba_weighting_enabled tunable is set to 0. When that tunable is +0, ZFS will consider only free space, so the the expense of the best-fit +allocator can be avoided by keeping free space above 5%. That setting +should only be used on systems with pools that consist of solid state +drives because it will reduce sequential IO performance on mechanical +disks. + +.. _lz4_compression: + +LZ4 compression +~~~~~~~~~~~~~~~ + +Set compression=lz4 on your pools' root datasets so that all datasets +inherit it unless you have a reason not to enable it. Userland tests of +LZ4 compression of incompressible data in a single thread has shown that +it can process 10GB/sec, so it is unlikely to be a bottleneck even on +incompressible data. Furthermore, incompressible data will be stored +without compression such that reads of incompressible data with +compression enabled will not be subject to decompression. Writes are so +fast that in-compressible data is unlikely to see a performance penalty +from the use of LZ4 compression. The reduction in IO from LZ4 will +typically be a performance win. + +Note that larger record sizes will increase compression ratios on +compressible data by allowing compression algorithms to process more +data at a time. + +.. _nvme_low_level_formatting: + +NVMe low level formatting +~~~~~~~~~~~~~~~~~~~~~~~~~ + +See +`Hardware#NVMe_low_level_formatting `__. + +.. _pool_geometry_1: + +Pool Geometry +~~~~~~~~~~~~~ + +Do not put more than ~16 disks in raidz. The rebuild times on mechanical +disks will be excessive when the pool is full. + +.. _synchronous_io: + +Synchronous I/O +~~~~~~~~~~~~~~~ + +If your workload involves fsync or O_SYNC and your pool is backed by +mechanical storage, consider adding one or more SLOG devices. Pools that +have multiple SLOG devices will distribute ZIL operations across them. +The best choice for SLOG device(s) are likely Optane / 3D XPoint SSDs. +See +`Hardware#Optane_.2F_3D_XPoint_SSDs `__ +for a description of them. If an Optane / 3D XPoint SSD is an option, +the rest of this section on synchronous I/O need not be read. If Optane +/ 3D XPoint SSDs is not an option, see +`Hardware#NAND_Flash_SSDs `__ for suggestions +for NAND flash SSDs and also read the information below. + +To ensure maximum ZIL performance on NAND flash SSD-based SLOG devices, +you should also overprovison spare area to increase +IOPS\ `3 `__. Only +about 4GB is needed, so the rest can be left as overprovisioned storage. +The choice of 4GB is somewhat arbitrary. Most systems do not write +anything close to 4GB to ZIL between transaction group commits, so +overprovisioning all storage beyond the 4GB partition should be alright. +If a workload needs more, then make it no more than the maximum ARC +size. Even under extreme workloads, ZFS will not benefit from more SLOG +storage than the maximum ARC size. That is half of system memory on +Linux and 3/4 of system memory on illumos. + +.. _overprovisioning_by_secure_erase_and_partition_table_trick: + +Overprovisioning by secure erase and partition table trick +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can do this with a mix of a secure erase and a partition table +trick, such as the following: + +#. Run a secure erase on the NAND-flash SSD. +#. Create a partition table on the NAND-flash SSD. +#. Create a 4GB partition. +#. Give the partition to ZFS to use as a log device. + +If using the secure erase and partition table trick, do *not* use the +unpartitioned space for other things, even temporarily. That will reduce +or eliminate the overprovisioning by marking pages as dirty. + +Alternatively, some devices allow you to change the sizes that they +report.This would also work, although a secure erase should be done +prior to changing the reported size to ensure that the SSD recognizes +the additional spare area. Changing the reported size can be done on +drives that support it with \`hdparm -N \` on systems that have +laptop-mode-tools. + +.. _nvme_overprovisioning: + +NVMe overprovisioning +^^^^^^^^^^^^^^^^^^^^^ + +On NVMe, you can use namespaces to achieve overprovisioning: + +#. Do a sanitize command as a precaution to ensure the device is + completely clean. +#. Delete the default namespace. +#. Create a new namespace of size 4GB. +#. Give the namespace to ZFS to use as a log device. e.g. zfs add tank + log /dev/nvme1n1 + +.. _whole_disks: + +Whole disks +~~~~~~~~~~~ + +Whole disks should be given to ZFS rather than partitions. If you must +use a partition, make certain that the partition is properly aligned to +avoid read-modify-write overhead. See the section on `Alignment +Shift `__ for a +description of proper alignment. Also, see the section on `Whole Disks +versus Partitions `__ +for a description of changes in ZFS behavior when operating on a +partition. + +Single disk RAID 0 arrays from RAID controllers are not equivalent to +whole disks. The `Hardware `__ page +explains in detail. + +.. _bit_torrent: + +Bit Torrent +----------- + +Bit torrent performs 16KB random reads/writes. The 16KB writes cause +read-modify-write overhead. The read-modify-write overhead can reduce +performance by a factor of 16 with 128KB record sizes when the amount of +data written exceeds system memory. This can be avoided by using a +dedicated dataset for bit torrent downloads with recordsize=16KB. + +When the files are read sequentially through a HTTP server, the random +nature in which the files were generated creates fragmentation that has +been observed to reduce sequential read performance by a factor of two +on 7200RPM hard disks. If performance is a problem, fragmentation can be +eliminated by rewriting the files sequentially in either of two ways: + +The first method is to configure your client to download the files to a +temporary directory and then copy them into their final location when +the downloads are finished, provided that your client supports this. + +The second method is to use send/recv to recreate a dataset +sequentially. + +In practice, defragmenting files obtained through bit torrent should +only improve performance when the files are stored on magnetic storage +and are subject to significant sequential read workloads after creation. + +.. _database_workloads: + +Database workloads +------------------ + +Setting redundant_metadata=mostly can increase IOPS by at least a few +percentage points by eliminating redundant metadata at the lowest level +of the indirect block tree. This comes with the caveat that data loss +will occur if a metadata block pointing to data blocks is corrupted and +there are no duplicate copies, but this is generally not a problem in +production on mirrored or raidz vdevs. + +MySQL +~~~~~ + +InnoDB +^^^^^^ + +Make separate datasets for InnoDB's data files and log files. Set +recordsize=16K on InnoDB's data files to avoid expensive partial record +writes and leave recordsize=128K on the log files. Set +primarycache=metadata on both to prefer InnoDB's +caching.\ `4 `__ +Set logbias=throughput on the data to stop ZIL from writing twice. + +Set skip-innodb_doublewrite in my.cnf to prevent innodb from writing +twice. The double writes are a data integrity feature meant to protect +against corruption from partially-written records, but those are not +possible on ZFS. It should be noted that Percona’s +blog\ `5 `__ +had advocated using an ext4 configuration where double writes were +turned off for a performance gain, but later recanted it because it +caused data corruption. Following a well timed power failure, an in +place filesystem such as ext4 can have half of a 8KB record be old while +the other half would be new. This would be the corruption that caused +Percona to recant its advice. However, ZFS’ copy on write design would +cause it to return the old correct data following a power failure (no +matter what the timing is). That prevents the corruption that the double +write feature is intended to prevent from ever happening. The double +write feature is therefore unnecessary on ZFS and can be safely turned +off for better performance. + +On Linux, the driver's AIO implementation is a compatibility shim that +just barely passes the POSIX standard. InnoDB performance suffers when +using its default AIO codepath. Set innodb_use_native_aio=0 and +innodb_use_atomic_writes=0 in my.cnf to disable AIO. Both of these +settings must be disabled to disable AIO. + +PostgreSQL +~~~~~~~~~~ + +Make separate datasets for PostgreSQL's data and WAL. Set recordsize=8K +on both to avoid expensive partial record writes. Set logbias=throughput +on PostgreSQL's data to avoid writing twice. + +SQLite +~~~~~~ + +Make a separate dataset for the database. Set the recordsize to 64K. Set +the SQLite page size to 65536 +bytes\ `6 `__. + +Note that SQLite databases typically are not exercised enough to merit +special tuning, but this will provide it. Note the side effect on cache +size mentioned at +SQLite.org\ `7 `__. + +.. _file_servers: + +File servers +------------ + +Create a dedicated dataset for files being served. + +See +`Performance_tuning#Sequential_workloads `__ +for configuration recommendations. + +.. _sequential_workloads: + +Sequential workloads +-------------------- + +Set recordsize=1M on datasets that are subject to sequential workloads. +Read +`Performance_tuning#Larger_record_sizes `__ +for documentation on things that should be known before setting 1M +record sizes. + +Set compression=lz4 as per the general recommendation for `LZ4 +compression `__. + +.. _video_games_directories: + +Video games directories +----------------------- + +Create a dedicated dataset, use chown to make it user accessible (or +create a directory under it and use chown on that) and then configure +the game download application to place games there. Specific information +on how to configure various ones is below. + +See +`Performance_tuning#Sequential_workloads `__ +for configuration recommendations before installing games. + +Note that the performance gains from this tuning are likely to be small +and limited to load times. However, the combination of 1M records and +LZ4 will allow more games to be stored, which is why this tuning is +documented despite the performance gains being limited. A steam library +of 300 games (mostly from humble bundle) that had these tweaks applied +to it saw 20% space savings. Both faster load times and significant +space savings are possible on compressible games when this tuning has +been done. Games whose assets are already compressed will see little to +no benefit. + +Lutris +~~~~~~ + +Open the context menu by left clicking on the triple bar icon in the +upper right. Go to "Preferences" and then the "System options" tab. +Change the default installation directory and click save. + +Steam +~~~~~ + +Go to "Settings" -> "Downloads" -> "Steam Library Folders" and use "Add +Library Folder" to set the directory for steam to use to store games. +Make sure to set it to the default by right clicking on it and clicking +"Make Default Folder" before closing the dialogue. + +.. _virtual_machines: + +Virtual machines +---------------- + +Virtual machine images on ZFS should be stored using either zvols or raw +files to avoid unnecessary overhead. The recordsize/volblocksize and +guest filesystem should be configured to match to avoid overhead from +partial record modification. This would typically be 4K. If raw files +are used, a separate dataset should be used to make it easy to configure +recordsize independently of other things stored on ZFS. + +.. _qemu_kvm_xen: + +QEMU / KVM / Xen +~~~~~~~~~~~~~~~~ + +AIO should be used to maximize IOPS when using files for guest storage.