zio: try to execute TYPE_NULL ZIOs on the current task
Many TYPE_NULL ZIOs are used to provide a sync point for child ZIOs, and
do not do any actual work themselves. However, they are still dispatched
to a dedicated, single-thread taskq, which leads to their execution
being entirely task switch and dequeue overhead for no actual reason.
This commit changes it so that when selecting a parent ZIO to execute,
if the parent is TYPE_NULL and has no done function (that is, no
additional work), it is executed on the same thread. This reduces task
switches and frees up CPU cores for other work.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #16134
vdev probe to slow disk can stall mmp write checker
Simplify vdev probes in the zio_vdev_io_done context to
avoid holding the spa config lock for a long duration.
Also allow zpool clear if no evidence of another host
is using the pool.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Olaf Faaland <faaland1 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Don Brady <don.brady at klarasystems.com>
Closes #15839
Fix arcstats for FreeBSD after zfetch support
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #16141
Overflowing refreservation is bad
Someone came to me and pointed out that you could pretty
readily cause the refreservation calculation to exceed
2**64, given the 2**17 multiplier in it, and produce
refreservations wildly less than the actual volsize in cases where
it should have failed.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain at gmail.com>
Closes #15996
GCC: Fixes for gcc 14 on Fedora 40
- Workaround dangling pointer in uu_list.c (#16124)
- Fix calloc() transposed arguments in zpool_vdev_os.c
- Make some temp variables unsigned to prevent triggering a
'-Werror=alloc-size-larger-than' error.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #16124
Closes #16125
Fix updating the zvol_htable when renaming a zvol
When renaming a zvol, insert it into zvol_htable using the new name, not
the old name. Otherwise some operations won't work. For example,
"zfs set volsize" while the zvol is open.
Sponsored by: Axcient
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk at axcient.com>
Signed-off-by: Alan Somers <asomers at FreeBSD.org>
Closes #16127
Closes #16128
Python 3.12 deprecated python3-distutils
As for python-3.12 the distutils package has been deprecated.
The latest ax_python_devel.m4 macro from the autoconf archive
has been updated accordingly so let's pull in the new version.
We can also drop the changes made to our customized version
to continue if the development version is not installed since
this functionality has been included upstream.
Reviewed-by: Rich Ercolani <rincebrain at gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #16126
Closes #16129
Fast Dedup: ZAP Shrinking
This allows ZAPs to shrink. When there are two empty sibling leafs,
one of them is collapsed and its storage space is reused.
This improved performance on directories that at one time contained
a large number of files, but many or all of those files have since
been deleted.
This also applies to all other types of ZAPs as well.
Sponsored-by: iXsystems, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Alexander Stetsenko <alex.stetsenko at klarasystems.com>
Closes #15888
Make more taskq parameters writable
There is no reason for these module parameters to be read-only.
Being modified they just apply on next pool import/creation, that
is useful for testing different values.
Reviewed-by: Rich Ercolani <rincebrain at gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16118
L2ARC: Cleanup buffer re-compression
When compressed ARC is disabled, we may have to re-compress when
writing into L2ARC. If doing so we can't fit it into the original
physical size, we should just fail immediately, since even if it
may still fit into allocation size, its checksum will never match.
While there, refactor the code similar to other compression places
without using abd_return_buf_copy().
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16038
zfs-kmod: fix empty rpm requires/conflicts
Fix an error in zfs-kmod.spec that causes kmod-zfs packages not to
include the correct RPM requires/conflicts relationships. With this
change applied, RPM correctly no longer allows kmod-zfs & zfs-dkms
packages to be installed together.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Todd Seidelmann <18294602+seidelma at users.noreply.github.com>
Closes #16121
Refactor dbuf_read() for safer decryption
In dbuf_read_verify_dnode_crypt():
- We don't need original dbuf locked there. Instead take a lock
on a dnode dbuf, that is actually manipulated.
- Block decryption for a dnode dbuf if it is currently being
written. ARC hash lock does not protect anonymous buffers, so
arc_untransform() is unsafe when used on buffers being written,
that may happen in case of encrypted dnode buffers, since they
are not copied by dbuf_dirty()/dbuf_hold_copy().
In dbuf_read():
- If the buffer is in flight, recheck its compression/encryption
status after it is cached, since it may need arc_untransform().
Tested-by: Rich Ercolani <rincebrain at gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16104
zfs get: add '-t fs' and '-t vol' options
Make `zfs get` accept `fs` for `filesystem` and `vol` for `volume`.
Reviewed-by: Rob Norris <rob.norris at klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ryan <errornointernet at envs.net>
Closes #16117
ztest: use ASSERT3P to compare pointers
With a sufficiently modern gcc (I saw this with gcc13), gcc complains
when casting pointers to an integer of a different type (even a larger
one). On 32-bt ASSERT3U does this on 32-bit systems by casting a 32-bit
pointer to uint64_t so use ASSERT3P which uses uintptr_t.
Fixes: 5caeef02fa53 RAID-Z expansion feature
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis at sri.com>
Closes #16115
ZTS: user_namespace_004.ksh avoid error in cleanup if unsupported
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Seth Troisi <sethtroisi at google.com>
Closes #16114
Add newline to two zpool messages
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Seth Troisi <sethtroisi at google.com>
Closes #16113
Parallel pool import
This commit allow spa_load() to drop the spa_namespace_lock so
that imports can happen concurrently. Prior to dropping the
spa_namespace_lock, the import logic will set the spa_load_thread
value to track the thread which is doing the import.
Consumers of spa_lookup() retain the same behavior by blocking
when either a thread is holding the spa_namespace_lock or the
spa_load_thread value is set. This will ensure that critical
concurrent operations cannot take place while a pool is being
imported.
The zpool command is also enhanced to provide multi-threaded support
when invoking zpool import -a.
Lastly, zinject provides a mechanism to insert artificial delays
when importing a pool and new zfs tests are added to verify parallel
import functionality.
[4 lines not shown]
abd_iter_page: rework to handle multipage scatterlists
Previously, abd_iter_page() would assume that every scatterlist would
contain a single page (compound or no), because that's all we ever
create in abd_alloc_chunks(). However, scatterlists can contain multiple
pages of arbitrary provenance, and if we get one of those, we'd get all
the math wrong.
This reworks things to handle multiple pages in a scatterlist, by
properly finding the right page within it for the given offset, and
understanding better where the end of the page is and not crossing it.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reported-by: Brian Atkinson <batkinson at lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Brian Atkinson <batkinson at lanl.gov>
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #16108
Handle FLUSH errors as "expected"
Before #16061 zio_vdev_io_done() was not used for FLUSH requests.
Addition of it triggers reprobe each TXG for vdevs not supporting
them. Since those errors are often expected, they are normally
handled by individual vdev drivers and should be ignored here.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at klarasystems.com>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16110
tests/quota: consistently clear quota property between tests
When run in isolation, quota_005_pos would fail in cleanup because it
would attempt restore the previous quota, which was 0, and so get an
error (because you can't set quota to '0', you have to use 'none').
It worked as part of the quota tag set because the previous tests did
not clean up their quota, so there was always a non-zero quota to return
to.
This adds a simple quota reset function, and has all quota tests run it
at cleanup. For the ones that weren't cleaning up, they now do, and for
quota_005_pos, which was trying to do the right thing, it now just
resets it.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
[2 lines not shown]
tests/quota_005_pos: use a long int for doubling the quota size
When run in isolation, quota_005_pos would see an empty ~300G dataset.
Doubling it's space overflows a int32, which meant it was trying to then
set the quota to a negative value, and would fail.
When run as part of the quota tests, the filesystem appears to have
stuff in it, and so a lower available space, which doesn't overflow, and
so succeeds.
The bare minimum fix seems to be to use a int64 for the available space,
so it can be comfortably doubled. Here it is.
(Also a typo fix and a tiny bit of cleanup).
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
[2 lines not shown]
bdev_discard_supported: understand discard_granularity=0
Kernel documentation for the discard_granularity property says:
A discard_granularity of 0 means that the device does not support
discard functionality.
Some older kernels had drivers (notably loop, but also some USB-SATA
adapters) that would set the QUEUE_FLAG_DISCARD capability flag, but
have discard_granularity=0. Since 5.10 (torvalds/linux at b35fd7422c2f) the
discard entry point blkdev_issue_discard() has had a check for this,
which would immediately reject the call with EOPNOTSUPP, and throw a
scary diagnostic message into the log. See #16068.
Since 6.8, the block layer sets a non-zero default for
discard_granularity (torvalds/linux at 3c407dc723bb), and a future kernel
will remove the check entirely[1].
As such, there's no good reason for us to enable discard when
[10 lines not shown]
Add zfetch stats in arcstats
arc_summary also reports zfetch stats but it's inconvenient to monitor
contiguously incrementing numbers. Adding them in arcstats allows us to
observe streams more conveniently.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #16094
Fix: FreeBSD Arm64 does not build currently
The define LD_VERSION isn't defined on FreeBSD Arm64 when OpenZFS is
build with the default compiler: clang.
I used only gcc for testing - my fault.
Fast fix as suggested by @mmatuska
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Martin Matuska <mm at FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs at mcmilk.de>
Closes #16103
L2ARC: Relax locking during write
Previous code held ARC state sublist lock throughout all L2ARC
write process, which included number of allocations and even ZIO
issues. Being blocked in any of those places the code could also
block ARC eviction, that could cause OOM activation or even dead-
lock if system is low on memory or one is too fragmented.
Fix it by dropping the lock as soon as we see a block eligible
for L2ARC writing and pick it up later using earlier inserted
marker. While there, also reduce scope of hash lock, moving
ZIO allocation and other operations not requiring header access
out of it. All operations requiring header access move under
hash lock, since L2_WRITING flag does not prevent header eviction
only transition to arc_l2c_only state with L1 header.
To be able to manipulate sublist lock and marker as needed add few
more multilist functions and modify one.
[3 lines not shown]
BRT: Skip getting length in brt_entry_lookup()
Unlike DDT, where ZAP values may have different lengths due to
compression, all BRT entries are identical 8-byte counters. It
does not make sense to first fetch the length only to assert it.
zap_lookup_uint64() is specifically designed to work with counters
of different size and should return error if something odd found.
Calling it straight allows to save some measurable CPU time.
Reviewed-by: Pawel Jakub Dawidek <pawel at dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15950
BRT: Make BRT block sizes configurable
Similar to DDT make BRT data and indirect block sizes configurable
via module parameters. I am not sure what would be the best yet,
but similar to DDT 4KB blocks kill all chances of compression on
vdev with ashift=12 or more, that on my tests reaches 3x.
While here, fix documentation for respective DDT parameters.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15967
ZAP: Massively switch to _by_dnode() interfaces
Before this change ZAP called dnode_hold() for almost every block
access, that was clearly visible in profiler under heavy load, such
as BRT. This patch makes it always hold the dnode reference between
zap_lockdir() and zap_unlockdir(). It allows to avoid most of dnode
operations between those. It also adds several new _by_dnode() APIs
to ZAP and uses them in BRT code. Also adds dmu_prefetch_by_dnode()
variant and uses it in the ZAP code.
After this there remains only one call to dmu_buf_dnode_enter(),
which seems to be unneeded. So remove the call and the functions.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15951
BRT: Skip duplicate BRT prefetches
If there is a pending entry for this block, then we've already
issued BRT prefetch for it within this TXG, so don't do it again.
BRT vdev lookup and following zap_prefetch_uint64() call can be
pretty expensive and should be avoided when not necessary.
Reviewed-by: Pawel Jakub Dawidek <pawel at dawidek.net>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15941
Update resume token at object receive.
Before this change resume token was updated only on data receive.
Usually it is enough to resume replication without much overlap.
But we've got a report of a curios case, where replication source
was traversed with recursive grep, which through enabled atime
modified every object without modifying any data. It produced
several gigabytes of replication traffic without a single data
write and so without a single resume point.
While the resume token was not designed to resume from an object,
I've found that the send implementation always sends object before
any data. So by requesting resume from offset 0 we are effectively
resuming from the object, followed (or not) by the data at offset
0, just as we need it.
Reviewed-by: Allan Jude <allan at klarasystems.com>
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15927