ZFS on Linux/src 20f7b91tests/zfs-tests/tests/functional/vdev_zaps vdev_zaps_005_pos.ksh

ZTS: Fix vdev_zaps_005_pos on CentOS 6

The ancient version of blkid (v2.17.2) used in CentOS 6 will not
detect the newly created pool unless it has been written to.
Force a pool sync so `zpool import` will detect the newly created
pool.

Reviewed-by: John Kennedy <john.kennedy at delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #9199 

ZFS on Linux/src a9ebdfdmodule/lua llex.c, module/zfs abd.c vdev_raidz_math_scalar.c

Linux 5.3: Fix switch() fall though compiler errors

Fix some switch() fall-though compiler errors:

    abd.c:1504:9: error: this statement may fall through

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #9170 

ZFS on Linux/src f66a1f8. Makefile.am

Minor cleanup in Makefile.am

Split long lines where adding license info to dist archive.

Remove extra colon from target line.

Reviewed-by: Chris Dunlop <chris at onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ryan Moeller <ryan at ixsystems.com>
Closes #9189 
DeltaFile
+11-6Makefile.am
+11-61 files

ZFS on Linux/src c759b33etc/init.d zfs-functions.in

 zfs-functions.in: in_mtab() always returns 1

$fs used with the wrong sed command where should be $mntpnt instead
to match a variable exported by read_mtab()

The fix is mostly to reuse the sed command found in read_mtab()

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Michael Niewöhner <foss at mniewoehner.de>
Signed-off-by: Alexey Smirnoff <fling at member.fsf.org>
Closes #9168 

ZFS on Linux/src 325d288include/sys dsl_deadlist.h, module/zfs dsl_deadlist.c dsl_destroy.c

Add fast path for zfs_ioc_space_snaps() handling of empty_bpobj

When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs at snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing.  With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.

By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).

Reviewed-by: Pavel Zakharov <pavel.zakharov at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-58348
Closes #8744 

ZFS on Linux/src 3beb0a7module/zfs sa.c

Fix lockdep circular locking false positive involving sa_lock

There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
    zap_lockdir_impl+0x858/0x16c0 [zfs]
    zap_lockdir+0xd2/0x100 [zfs]
    zap_lookup_norm+0x7f/0x100 [zfs]
    zap_lookup+0x12/0x20 [zfs]
    sa_setup+0x902/0x1380 [zfs]
    zfsvfs_init+0x3d6/0xb20 [zfs]
    zfsvfs_create+0x5dd/0x900 [zfs]
    zfs_domount+0xa3/0xe20 [zfs]

and thread 2 trying to get sa_lock, either in sa_setup:
   sa_setup+0x742/0x1380 [zfs]
   zfsvfs_init+0x3d6/0xb20 [zfs]
   zfsvfs_create+0x5dd/0x900 [zfs]
   zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
   sa_build_index+0x13d/0x790 [zfs]
   sa_handle_get_from_db+0x368/0x500 [zfs]
   zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
   zfs_znode_alloc+0x3da/0x1a40 [zfs]
   zfs_zget+0x39a/0x6e0 [zfs]

    [18 lines not shown]
DeltaFile
+1-1module/zfs/sa.c
+1-11 files

ZFS on Linux/src ff4b68e. .gitignore, module Makefile.in

Linux 5.3 compat: Makefile subdir-m no longer supported

Uses obj-m instead, due to kernel changes.

See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Dominic Pearson <dsp at technoanimal.net>
Closes #9169 
DeltaFile
+12-12module/Makefile.in
+11-0.gitignore
+23-122 files

ZFS on Linux/src f6fbe25contrib/initramfs/scripts zfs.in

Set "none" scheduler if available (initramfs)

Existing zfs initramfs script logic will attempt to set the 'noop' 
scheduler if it's available on the vdev block devices. Newer kernels 
have the similar 'none' scheduler on multiqueue devices; this change 
alters the initramfs script logic to also attempt to set this scheduler 
if it's available.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Garrett Fields <ghfields at gmail.com>
Reviewed-by: Richard Laager <rlaager at wiktel.com>
Signed-off-by: Colm Buckley <colm at tuatha.org>
Closes #9042 

ZFS on Linux/src 1a26cb6tests/runfiles linux.run, tests/zfs-tests/tests/functional/refquota refquota_008_neg.ksh refquota_007_neg.ksh

Add more refquota tests

It used to be possible for zfs receive (and other operations related 
to clone swap) to bypass refquotas. This can cause a number of issues, 
and there should be an automated test for it.

Added tests for rollback and receive not overriding refquota.

Reviewed-by: Pavel Zakharov <pavel.zakharov at delphix.com>
Reviewed-by: John Kennedy <john.kennedy at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #9139 

ZFS on Linux/src f09fda5include/sys metaslab_impl.h, man/man5 zfs-module-parameters.5

Cap metaslab memory usage

On systems with large amounts of storage and high fragmentation, a huge 
amount of space can be used by storing metaslab range trees. Since 
metaslabs are only unloaded during a txg sync, and only if they have 
been inactive for 8 txgs, it is possible to get into a state where all 
of the system's memory is consumed by range trees and metaslabs, and 
txgs cannot sync. While ZFS knows how to evict ARC data when needed, 
it has no such mechanism for range tree data. This can result in boot 
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing 
context. Second, we store a multilist of all loaded metaslabs, sorted 
by their selection txg, so we can quickly identify the oldest 
metaslabs.  We use a multilist to reduce lock contention during heavy 
write workloads. Finally, we add logic that will unload a metaslab 
when we're loading a new metaslab, if we're using more than a certain 
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: George Wilson <gwilson at delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy at delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #9128 

ZFS on Linux/src 9323aadcontrib/initramfs Makefile.am, contrib/initramfs/hooks zfs.in

initramfs: fixes for (debian) initramfs

* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions
At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for
its initramfs. Include both in build when initramfs is configured.

* contrib/initramfs: include 60-zvol.rules and zvol_id
Include 60-zvol.rules and zvol_id and set udev as predependency instead
of debians zdev. This makes debians additional zdev hook unneeded.

* Correct initconfdir substitution for some distros
Not every Linux distro is using @sysconfdir@/default but @initconfdir@
which is already determined by configure. Let's use it.

* systemd: prevent possible conflict between systemd and sysvinit
Systemd will not load a sysvinit service if a unit exists with the same
name. This prevents conflicts between sysvinit and systemd.
In ZFS there is one sysvinit service that does not have a systemd
service but a target counterpart, zfs-import.target.
Usually it does not make any sense to install both but it is possisble.
Let's prevent any conflict by masking zfs-import.service by default.
This does not harm even if init.d/zfs-import does not exist.

Reviewed-by: Chris Wedgwood <cw at f00f.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>

    [5 lines not shown]

ZFS on Linux/src 0f8ff49module/zfs dbuf.c dsl_pool.c

dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()

Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.

From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed].  Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:

[1] The waiter didn't call into any of the two functions - which I
    find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
    with?).
[2] The waiter did call in one of the above function but it passed 0 as
    the space/delta to be dirtied (or undirtied) and then the callee
    returned immediately (e.g both `dsl_pool_dirty_space()` and
    `dsl_pool_undirty_space()` return immediately when space is 0).


    [72 lines not shown]

ZFS on Linux/src c8bbf7cmodule/zfs zfs_log.c

Improve write performance by using dmu_read_by_dnode()

In zfs_log_write(), we can use dmu_read_by_dnode() rather than
dmu_read() thus avoiding unnecessary dnode_hold() calls.

We get a 2-5% performance gain for large sequential_writes tests, >=128K
writes to files with recordsize=8K.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen at delphix.com>
Closes #9156 

ZFS on Linux/src 0e37a0fmodule/zfs dbuf.c dnode.c

Assert that a dnode's bonuslen never exceeds its recorded size

This patch introduces an assertion that can catch pitfalls in
development where there is a mismatch between the size of
reads and writes between a *_phys structure and its respective
in-core structure when bonus buffers are used.

This debugging-aid should be complementary to the verification
done by ztest in ztest_verify_dnode_bt().

A side to this patch is that we now clear out any extra bytes
past a bonus buffer's new size when the buffer is shrinking.

Reviewed-by: Matt Ahrens <matt at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tom Caputi <tcaputi at datto.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #8348 

ZFS on Linux/src e2b31b5module/zfs zfs_vfsops.c

Make txg_wait_synced conditional in zfsvfs_teardown

The call to txg_wait_synced in zfsvfs_teardown should
be made conditional on the objset having dirty data.
This can prevent unnecessary txg_wait_synced during
some unmount operations.

Reviewed-by: Matt Ahrens <matt at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski at datto.com>
Closes #9115 

ZFS on Linux/src dc04a8cinclude/sys refcount.h spa.h, module/zfs refcount.c zio.c

Prevent race in blkptr_verify against device removal

When we check the vdev of the blkptr in zfs_blkptr_verify, we can run 
into a race condition where that vdev is temporarily unavailable. This 
happens when a device removal operation and the old vdev_t has been 
removed from the array, but the new indirect vdev has not yet been 
inserted.

We hold the spa_config_lock while doing our sensitive verification. 
To ensure that we don't deadlock, we only grab the lock if we don't 
have config_writer held. In addition, I had to const the tags of the 
refcounts and the spa_config_lock arguments.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #9112 

ZFS on Linux/src 8e556c5include/sys zfs_znode.h, module/zfs zfs_log.c zil.c

Fix out-of-order ZIL txtype lost on hardlinked files

We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya at delphix.com>
Signed-off-by: Chunwei Chen <david.chen at nutanix.com>
Closes #8769
Closes #9061 

ZFS on Linux/src 475ebd7lib/libefi rdwr_efi.c

Fix device expansion when VM is powered off

When running on an ESXi based VM, I've found that "zpool online -e" will
not expand the zpool, if the disk was expanded in ESXi while the VM was
powered off.

For example, take the following scenario:

 1. VM running on top of VMware ESXi
 2. ZFS pool created with a given device "sda" of size 8GB
 3. VM powered off
 4. Device "sda" size expanded to 16GB
 5. VM powered on
 6. "zpool online -e" used on device "sda"

In this situation, after (2) the zpool will be roughly 8GB in size.
After (6), the expectation is the zpool's size will expand to roughly
16GB in size; i.e. expand to the new size of the "sda" device.
Unfortunately, I've seen that after (6), the zpool size does not change.

What's happening is after (5), the EFI label of the "sda" device will be
such that fields "efi_last_u_lba", "efi_last_lba", and "efi_altern_lba"
all reflect the new size of the disk; i.e. "33554398", "33554431", and
"33554431" respectively.


    [25 lines not shown]
DeltaFile
+90-28lib/libefi/rdwr_efi.c
+90-281 files

ZFS on Linux/src d2a3291module/zfs dsl_dataset.c

Mark dsl_livelist_should_disable() static

This function is not used outside of dsl_dataset.c

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed by: Sara Hartse <sara.hartse at delphix.com>
Signed-off-by: Allan Jude <allanjude at freebsd.org>
Closes #9154 

ZFS on Linux/src c8242a9cmd/zdb zdb.c, include/sys spa_impl.h

spa_load_verify() may consume too much memory

When a pool is imported it will scan the pool to verify the integrity 
of the data and metadata. The amount it scans will depend on the 
import flags provided. On systems with small amounts of memory or 
when importing a pool from the crash kernel, it's possible for 
spa_load_verify to issue too many I/Os that it consumes all the memory 
of the system resulting in an OOM message or a hang.

To prevent this, we limit the amount of memory that the initial pool
scan can consume. This change will, by default, use 1/16th of the ARC
for scan I/Os to prevent running the system out of memory during import.

Reviewed-by: Matt Ahrens <matt at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Signed-off-by: George Wilson george.wilson at delphix.com
External-issue: DLPX-65237
External-issue: DLPX-65238
Closes #9146 

ZFS on Linux/src a43570cinclude/sys zfs_znode.h, module/zfs zfs_znode.c zfs_vnops.c

Change boolean-like uint8_t fields in znode_t to boolean_t

Given znode_t is an in-core structure, it's more readable to have
them as boolean. Also co-locate existing boolean fields with them
for space efficiency (expecting 8 booleans to be packed/aligned).

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #9092 

ZFS on Linux/src fccbd1dinclude/spl/sys kmem_cache.h, module/spl spl-zlib.c

Drop KMC_NOEMERGENCY

This is not implemented. If it were implemented, using it would risk
deadlocks on pre-3.18 kernels. Lets just drop it.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Michael Niewöhner <foss at mniewoehner.de>
Signed-off-by: Richard Yao <ryao at gentoo.org>
Closes #9119 

ZFS on Linux/src 3b9edd7man/man8 zfs-program.8, module/zfs zcp_iter.c

Introduce getting holds and listing bookmarks through ZCP

Consumers of ZFS Channel Programs can now list bookmarks,
and get holds from datasets. A minor-refactoring was also
applied to distinguish between user and system properties
in ZCP.

Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Ported-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Signed-off-by: Dan Kimmel <dan.kimmel at delphix.com>

OpenZFS-issue: https://illumos.org/issues/8862
Closes #7902 

ZFS on Linux/src 2081db7module/zfs spa_log_spacemap.c

Sort log spacemap tunables in alphabetical order

Beside the whole commit being a nit in reality it should
bring the diffs of the spa_log_spacemap.c source file
between ZoL and delphix/zfs to 0.

Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Chris Dunlop <chris at onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9143 

ZFS on Linux/src c81f179cmd/zdb zdb.c, include/sys metaslab_impl.h range_tree.h

Metaslab max_size should be persisted while unloaded

When we unload metaslabs today in ZFS, the cached max_size value is
discarded. We instead use the histogram to determine whether or not we
think we can satisfy an allocation from the metaslab. This can result in
situations where, if we're doing I/Os of a size not aligned to a
histogram bucket, a metaslab is loaded even though it cannot satisfy the
allocation we think it can. For example, a metaslab with 16 entries in
the 16k-32k bucket may have entirely 16kB entries. If we try to allocate
a 24kB buffer, we will load that metaslab because we think it should be
able to handle the allocation. Doing so is expensive in CPU time, disk
reads, and average IO latency. This is exacerbated if the write being
attempted is a sync write.

This change makes ZFS cache the max_size after the metaslab is
unloaded. If we ever get a free (or a coalesced group of frees) larger
than the max_size, we will update it. Otherwise, we leave it as is. When
attempting to allocate, we use the max_size as a lower bound, and
respect it unless we are in try_hard. However, we do age the max_size
out at some point, since we expect the actual max_size to increase as we
do more frees. A more sophisticated algorithm here might be helpful, but
this works reasonably well.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matt Ahrens <mahrens at delphix.com>

    [2 lines not shown]

ZFS on Linux/src 99e755dmodule/zfs fm.c

Don't wakeup unnecessarily in 'zpool events -f'

ZED can prevent CPU's from properly sleeping.

Rather than periodically waking up in the zevents code, just go to sleep and wait for a 
wakeup.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: DHE <git at dehacked.net>
Closes #9091
DeltaFile
+1-2module/zfs/fm.c
+1-21 files

ZFS on Linux/src 8098465tests/runfiles linux.run, tests/zfs-tests/tests/functional/removal removal_cancel.ksh Makefile.am

Test cancelling a removal in ZTS

This patch adds a new test that sanity checks cancelling a removal.

Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: John Kennedy <john.kennedy at delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9101

ZFS on Linux/src 48be0dfmodule/zfs dsl_pool.c

lockdep false positive - move txg_kick() outside of ->dp_lock

This fixes a lockdep warning by breaking a link between ->tx_sync_lock
and ->dp_lock.

The deadlock envisioned by lockdep is this:
    thread 1 holds db->db_mtx and tries to get dp->dp_lock:
        dsl_pool_dirty_space+0x70/0x2d0 [zfs]
        dbuf_dirty+0x778/0x31d0 [zfs]

    thread 2 holds bpo->bpo_lock and tries to get db->db_mtx:
        dmu_buf_will_dirty_impl
        dmu_buf_will_dirty+0x6b/0x6c0 [zfs]
        bpobj_iterate_impl+0xbe6/0x1410 [zfs]

    thread 3 holds tx->tx_sync_lock and tries to get bpo->bpo_lock:
        bpobj_space+0x63/0x470 [zfs]
        dsl_scan_active+0x340/0x3d0 [zfs]
        txg_sync_thread+0x3f2/0x1370 [zfs]

    thread 4 holds dp->dp_lock and tries to get tx->tx_sync_lock
       txg_kick+0x61/0x420 [zfs]
       dsl_pool_need_dirty_delay+0x1c7/0x3f0 [zfs]

This patch is orginally from Brian Behlendorf and slightly simplified

    [10 lines not shown]

ZFS on Linux/src cae97c8man/man5 zpool-features.5

List log_spacemap feature in zpool-features.5 manual

Update zpool-features.5 manpage to describe the log_spacemap feature.

Reviewed-by: Matthew Ahrens <mahrens at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Pavel Zakharov <pavel.zakharov at delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9096

ZFS on Linux/src f489458. configure.ac, contrib Makefile.am

Add channel program for property based snapshots

Channel programs that many users find useful should be included with zfs
in the /contrib directory. This is the first of these contributions. A
channel program to recursively take snapshots of datasets with the
property com.sun:auto-snapshot=true.

Reviewed-by: Kash Pande <kash at tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Clint Armstrong <clint at clintarmstrong.net>
Closes #8443 
Closes #9050 

ZFS on Linux/src 1ba4f3emodule/zfs spa_log_spacemap.c

9072 handle error of zap_cursor_retrieve() for log spacemap zap

In spa_ld_log_sm_metadata(), it is possible for zap_cursor_retrieve()
to return errors other than the expected ENOENT (e.g. when we are at
the end of the zap). Ensure that these error cases are handled
correctly by the import path.

Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed by: Sara Hartse <sara.hartse at delphix.com>
Reviewed by: Matt Ahrens <matt at delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9074

ZFS on Linux/src 2fcf448module/zfs metaslab.c

mismerged log spacemap comment for metaslab_verify_weight_and_frag

When the log spacemap commit was merged in ZoL, the
metaslab_verify_unflushed_changes() debugging function
was deleted as the feature was pretty much stable by
then. Unfortunately though there was a reference to
it from a comment in metaslab_verify_weight_and_frag().

This patch deletes the reference and pastes that
comment as is.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: Igor Kozhukhov <igor at dilos.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9097 

ZFS on Linux/src a6c8289contrib/initramfs Makefile.am, contrib/initramfs/hooks Makefile.am

install path fixes

* rpm: correct pkgconfig path

pkconfig files get installed to $datarootdir/pkgconfig but rpm expects
them to be at $datadir. This works when $datarootdir==$datadir which is
the case most of the time but will fail when they differ.

* install: make initramfs-tools path static

Since initramfs-tools' path is nothing we can control as it is an
external package it does not make any sense to install zfs additions
anywhere else. Simply use /usr/share/initramfs-tools as path.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Richard Laager <rlaager at wiktel.com>
Signed-off-by: Michael Niewöhner <foss at mniewoehner.de>
Closes #9087 

ZFS on Linux/src 85ce79blib/libzfs libzfs_util.c

Increase default zcmd allocation to 256K

When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.
These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Colin Ian King <colin.king at canonical.com>
Signed-off-by: Michael Niewöhner <foss at mniewoehner.de>
Closes #9084 

ZFS on Linux/src 0eb8ba6module/zfs sa.c zfs_vnops.c

Improve performance by using dmu_tx_hold_*_by_dnode()

In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand.  This eliminates some calls to dnode_hold(), which can be
expensive.  This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.

We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.

This also removes some unnecessary casts that are in the area.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
Closes #9081 

ZFS on Linux/src 1e620c9tests/runfiles linux.run, tests/zfs-tests/cmd/online_recv online_recv.c Makefile.am

Revert "Develop tests for issues #5866 and #8858"

This reverts commit 693c1fc478cc8118dd0168c4815c0ae3be41c9c3.  This
change resulted in a kmem leak being observed in existing code which
needs to be identified and addressed.

Reviewed-by: Paul Zuchowski <pzuchowski at datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Issue #8978
Closes #9090 

ZFS on Linux/src adf495emodule/lua ldo.c

Fix channel programs on s390x

When adapting the original sources for s390x the JMP_BUF_CNT was
mistakenly halved due to an incorrect assumption of the size of
a unsigned long.  They are 8 bytes for the s390x architecture.
Increase JMP_BUF_CNT accordingly.

Authored-by: Don Brady <don.brady at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reported-by: Colin Ian King <canonical.com>
Tested-by: Colin Ian King <canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #8992
Closes #9080
DeltaFile
+1-1module/lua/ldo.c
+1-11 files

ZFS on Linux/src 453bb47etc/systemd/system zfs-share.service.in

Race between zfs-share and zfs-mount services

When a system boots the zfs-mount.service and the
zfs-share.service can start simultaneously. What may be
unclear is that sharing a filesystem will first mount
the filesystem if it's not already mounted. This means
that both service can race to mount the same fileystem.
This race can result in a SEGFAULT or EBUSY conditions.

This change explicitly defines the start ordering between the
two services such that the zfs-mount.service is solely
responsible for mounting filesystems eliminating the race
between "zfs mount -a" and "zfs share -a" commands.

Reviewed-by: Sebastien Roy <sebastien.roy at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: George Wilson <george.wilson at delphix.com>
Closes #9083 

ZFS on Linux/src 693c1fctests/runfiles linux.run, tests/zfs-tests/cmd/online_recv online_recv.c Makefile.am

Develop tests for issues #5866 and #8858

Provide zfstest coverage for these two issues which
were a panic accessing extended attributes and
a problem comparing 64 bit and 32 bit generation
numbers.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski at datto.com>
Issue #5866
Issue #8858 
Closes #8978 

ZFS on Linux/src 9fb6abetests/zfs-tests/tests/functional/suid suid_write_to_file.c suid_write_to_none.ksh

Implement secpolicy_vnode_setid_retain()

Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.

https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #9035 
Closes #9043 

ZFS on Linux/src 4b5c9d9cmd/zed/agents zfs_agents.c

zed crashes when devid not present

zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The
gs_devid is NULL, but the nvl has a "devid" entry.

zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is
present in nvl, but then later it and zfs_agent_iter_vdev() assume that
DEV_IDENTIFIER is present and thus gs_devid is set.

Typically this is not a problem because usually either all vdevs have
devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if
the vdev has devid before dereferencing gs_devid, the problem isn't
typically encountered. However, if some vdevs have devid's and some do
not, then the problem is easily reproduced.  This can happen if the pool
has been moved from a system that has devid's to one that does not.

The fix is for zfs_agent_iter_vdev() to only try to match the devid's if
both nvl and gsp have devid's present.

Reviewed-by: Prashanth Sreenivasa <pks at delphix.com>
Reviewed-by: Don Brady <don.brady at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu at gmail.com>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-65090

    [2 lines not shown]

ZFS on Linux/src 37f03dacmd/zdb zdb.c, module/zfs spa.c dsl_deadlist.c

Fast Clone Deletion

Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.

This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.

We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.

Reviewed-by: Sean Eric Fagan <sef at ixsystems.com>
Reviewed-by: Matt Ahrens <matt at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Signed-off-by: Sara Hartse <sara.hartse at delphix.com>
Closes #8416 

ZFS on Linux/src d274ac5module/zfs zfs_ioctl.c

Don't directly cast unsigned long to void*

Cast to uintptr_t first for portability on integer to/from pointer
conversion.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #9065 

ZFS on Linux/src 1ff4682include/sys dmu_zfetch.h, module/zfs dmu_zfetch.c dnode.c

Replace zf_rwlock with a mutex

The rwlock implementation on linux does not perform as well as mutexes.
We can realize a performance benefit by replacing the zf_rwlock with a
mutex.  Local microbenchmarks show ~50% improvement, and over NFS we see
~5% improvement on several of the ZFS Performance Tests cases,
especially randwrite and seq_write.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Reviewed-by: Olaf Faaland <faaland1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
Closes #9062 

ZFS on Linux/src 09276fdmodule/zfs zfs_vnops.c

Fix module_param() type for zfs_read_chunk_size

zfs_read_chunk_size is unsigned long.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #9051 

ZFS on Linux/src d84c5a1tests/zfs-tests/tests/functional/cli_root/zpool_status zpool_status_-c_searchpath.ksh zpool_status_-c_homedir.ksh, tests/zfs-tests/tests/functional/cli_user/zpool_status zpool_status_-c_searchpath.ksh zpool_status_003_pos.ksh

Move some tests to cli_user/zpool_status

The tests in tests/functional/cli_root/zpool_status should all require
root. However, linux.run has "user =" specified for those tests, which
means they run as a normal user.  When I removed that line to run them
as root, the following tests did not pass:

zpool_status_003_pos
zpool_status_-c_disable
zpool_status_-c_homedir
zpool_status_-c_searchpath

These tests need to be run as a normal user.  To fix this, move these
tests to a new tests/functional/cli_user/zpool_status directory.

Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80 at gmail.com>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #9057 

ZFS on Linux/src 7f31908module/zfs metaslab.c

Tricky semantics of ms_max_size in metaslab_should_allocate()

metaslab_should_allocate() is used in two places:
[1] When trying to select a metaslab to allocate from
[2] When trying to allocate from a metaslab

In [2] we always expect the metaslab to be loaded, and after
the refactoring of the log spacemap changes, whenever we load
a metaslab we set ms_max_size to the biggest range in the
ms_allocatable tree. Thus, when it is used in [2], if that
field is 0, it means that the metaslab doesn't have any
segments that can be used for allocations now (though it may
have some free space but that space can be in the freeing,
freed, or deferred trees).

In [1] a metaslab can be loaded or unloaded at which point 0
can either mean the metaslab doesn't have any space or the
metaslab is just not loaded thus we go ahead and try to make
an estimation based on its weight.

The issue here is when we call the above function for [2] and
the metaslab doesn't have any allocatable space, we still go
ahead and check its ms_weight which may be out of date because
we haven't ran metaslab_sync_done() yet. At that point we are
allowing an allocation to be attempted even though we know

    [9 lines not shown]
DeltaFile
+10-7module/zfs/metaslab.c
+10-71 files

ZFS on Linux/src 43a8536cmd/ztest ztest.c, include libzfs.h

Race condition between spa async threads and export

In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt at delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9015 
Closes #9044 

ZFS on Linux/src 1c44a5cmodule/zfs arc.c

hdr_recl calls zthr_wakeup() on destroyed zthr

There exists a race condition were hdr_recl() calls
zthr_wakeup() on a destroyed zthr. The timeline is the
following:

[1] hdr_recl() runs first and goes intro zthr_wakeup()
    because arc_initialized is set.
[2] arc_fini() is called by another thread, zeroes
    that flag, destroying the zthr, and goes into
    buf_init().
[3] hdr_recl() tries to enter the destroyed mutex
    and we blow up.

This patch ensures that the ARC's zthrs are not offloaded
any new work once arc_initialized is set and then destroys
them after all of the ARC state has been deleted.

Reviewed by: Matt Ahrens <matt at delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9047 
DeltaFile
+16-4module/zfs/arc.c
+16-41 files

ZFS on Linux/src bac15c1cmd/zdb zdb.c

zdb: don't print log spacemap stats in pools without the feature

Creating a pool with not features enabled and running
`zdb -mmmmmm on` it before the patch:

```
Log Space Maps in Pool:

Log Space Map Obsolete Entry Statistics:
0        valid entries out of 0        - txg 0
0        valid entries out of 0        - total
```

After this patch the above output goes away.

Reviewed by: Matt Ahrens <matt at delphix.com>
Reviewed by: Sara Hartse <sara.hartse at delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Closes #9048 
DeltaFile
+6-0cmd/zdb/zdb.c
+6-01 files