DragonFlyBSD/src 8e5d7c4sys/kern subr_cpu_topology.c, sys/platform/pc64/acpica acpi_srat.c

kernel - Fix NUMA contention due to assymetric memory

* Fix NUMA contention in situations where memory is associated
  with CPU cores assymetrically.  In particular, with the 2990WX,
  half the cores will have no memory associated with them.

* This was forcing DFly to allocate memory from queues belonging to
  other nearby cores, causing unnecessary SMP contention, as well
  as burn extra time iterating queues.

* Fix by calculating the average number of free pages per-core,
  and then adjust any VM page queue with pages less than the average
  by stealing pages from queues with greater than the average.
  We use a simple iterator to steal pages, so the CPUs with less
  (or zero) direct-attached memory will operate more UMA-like
  (just on 4K boundaries instead of 256-1024 byte boundaries).

* Tested with a 64-thread concurrent compile test.  systat -pv 1
  showed all remaining contention disappear.  Literally, *ZERO*
  contention when we run the test with each thread in its own jail
  with no shared resources.

* NOTE!  This fix is specific to asymetric NUMA configurations
  which are fairly rare in the wild and will not speed up more
  conventional systems.

* Before and after timings on the 2990WX.

  cd /tmp/src
  time make -j 128 nativekernel NO_MODULES=TRUE > /dev/null

  BEFORE
  703.915u 167.605s 0:49.97 1744.0%       9993+749k 22188+8io 216pf+0w
  699.550u 171.148s 0:50.87 1711.5%       9994+749k 21066+8io 150pf+0w

  AFTER
  678.406u 108.857s 0:45.66 1724.1%       10105+757k 22188+8io 216pf+0w
  674.805u 115.256s 0:46.67 1692.8%       10077+755k 21066+8io 150pf+0w

  This is a 4.2 second difference on the second run, an over 8%
  improvement which is nothing to sneeze at.
DeltaFile
+153-16sys/vm/vm_page.c
+16-1sys/kern/subr_cpu_topology.c
+7-1sys/vm/vm_page.h
+2-0sys/sys/cpu_topology.h
+1-0sys/platform/pc64/acpica/acpi_srat.c
+179-185 files

UnifiedSplitRaw