Joe Obbish walks us through a scenario with automatic soft-NUMA leading to poor performance:
Consider a server with soft-NUMA nodes of 8 schedulers with
MAXDOP 8. The first parallel query will be sent to numa node 0. The number of active workers matches the number of schedulers exactly so each active worker is assigned to a different scheduler in the NUMA node. The second parallel query will be sent to NUMA node 1. The third parallel query will be sent to NUMA node 2, and so on. Execution of serial queries or creation of sessions does not matter. That advances a counter that’s separate from the “global enumerator” used for parallel query scheduler placement. As far as I can tell the scheduler assigned to execution context 0 does not affect the scheduling of the parallel worker threads, although it can certainly affect parallel query performance.The scenario described above doesn’t sound so bad. It can work well if the parallel queries take roughly about the same amount of time to complete and query
MAXDOPmatches the number of schedulers per soft-NUMA node. Problems can emerge when at least one of those is not true. With the spread selection type it’s possible that the amount of work already assigned to schedulers has no effect on parallel query scheduler placement. Let that sink in. You could have 100 serial queries all assigned to schedulers in numa node 0 but SQL Server may send a parallel query to that NUMA node. It depends on the position of the “global enumerator” as opposed to current work on the server.
Joe offers up some alternatives if you find yourself dealing with this issue. Definitely a must-read.