hammer2 - performance pass
* Get rid of vfs.hammer2.cluster_write and stop using cluster_write()
for the block device I/O. This coupled into common unlock/lock
situations on chains which would acquire and retire the DIO, and
usually thus also the underlying buffer, many times before it
really needed to be committed.
This greatly reduces unnecessary writes to disk.
* Increase HAMMER2_FLUSH_DEPTH_LIMIT to 60. It was set to 10 for
debugging purposes. This created an O(N^2) overhead situation
in hammer2_flush(). 20,000 dirty inodes would translate to
30 million chain scans, resulting in cpu-bound stalls for long
periods of time.
Fixing this value reduces a 20,000 dirty inode flush to around
200,000 chain scans (100x faster).
* Use hammer2_chain_ref_hold() and hammer2_chain_drop_unhold()
to reduce the amount of buffer cache buffer cycling that occurs
during a flush, by retaining the DIO associated with a parent
chain across its unlock/recurse/relock sequence.
The number of buffers held locked is limited by the flush recursion
depth.
UnifiedSplitRaw