Joe Chang throws down the gauntlet:
Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.
Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.
Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4’s.
This is some very interesting research. Joe has a some gaps in his research (meaning that there’s scope for people to expand upon this), but this is 100% worth the read.