I wrote a python program that runs a producer and a consumer for 30 minutes with different message sizes and measures how many messages per second it can deliver, or the Kafka cluster throughput.
I did not care about the message content, so the consumer only reads the messages from the topic and then discards them. I used a Three partition topic. I guess that on larger clusters with more partitions the performance will be better, but the message size – throughput ratio will remain roughly the same.
So I wrote a small python program that generates a dummy message in the desired size, then spawns two threads, one is a producer and the other is a consumer. The producer send the same message over and over and the consumer reads the messages from the topic and count how many messages it has read. The main program stops after 30 minutes but before it stops it prints how many messages were consumed and how many messages were consumed per second.
Read on for the results. More importantly, test in your own environment with your own equipment, as that value’s likely to differ a bit.
There is a mechanism by which we can significantly influence memory latency in a multi-processor (socket) server system, that being memory locality. But few applications actually make use of the NUMA APIs in this regard. Some hypervisors like VMware allow VMs to be created with cores and memory from the same node. What may not be appreciated, however, is that even local node memory on a multi-processor system has significantly higher latency than memory access on a (physical) single-socket system.
That the single processor system has low memory latency was interesting but non-actionable bit of knowledge, until recently. The widespread practice in IT world was to have the 2-way system as the baseline standard. Single socket systems were relegated to small business and turnkey solutions. From long ago to a few years ago, there was a valid basis for this, though the reasons changed over the years. When multi-core processors began to appear, the 2-way became much more powerful than necessary for many secondary applications. But this was also the time virtualization became popular, which gave new reason to continue the 2-way as baseline practice.
Joe points out that for a highly-used transactional system, the lower memory latency might make a single-socket server perform better than a multi-socket server.
It might seem that the 2-socket system continues to be a good choice, as two processors with an intermediate number of cores is less expensive than one processor with twice as many cores. An example is the Xeon Gold 6132 14-core versus the Xeon Platinum 8180 28-core processors. In addition, the two-socket system has twice the memory capacity and nominally twice as much memory bandwidth.
So, end of argument, right? Well, no.
Click through for his argument in favor of single-socket machines for OLTP systems.
AMD is really pushing the idea of a single-socket EPYC system as a better alternative to a two-socket Intel system for many server workloads. According to AMD, it will be much less expensive, yet will have plenty of cores, memory, and PCIe 3.0 lanes, along with no NUMA overhead. One key advantage AMD is touting is their Infinity Fabric modular interconnect technology, that works both within a single processor and between multiple processors.
For SQL Server 2016/2017 usage, you would still want the “top of the line” SKU for a given physical core count, to get the most performance for each physical core license that you buy. Unlike Intel, AMD does not increase the base clock speed in the lower core count models. These EPYC systems have a lot of PCIe 3.0 lanes and very high memory density, so they might work really well for large SQL Server DW/Reporting workloads. For OLTP workloads, the key will be how much single-threaded performance AMD is able to get from this first-generation of EPYC, and how they compare to Intel’s new Skylake-SP processors. Figure 3 shows the fastest EPYC processor at each core count, which is what you would want for SQL Server usage.
There aren’t too many hard numbers yet, but the worst case scenario is that they force Intel to improve their offerings.
All execution plans iterators that require memory grants have two fundamental code paths, one path for when the memory grant is blown and memory spills out into tempdb and one for when the memory grant is correct or under-estimated. Perhaps the database engine team may at some point include a third option, which is for when the grant can be accommodated inside the CPU cache.
As an example, if you run a log record generation intensive workload on the same CPU socket as the log writer, usually socket 0, this will run in a shorter time compared to running the exact same workload in a different socket
This is the type of post where I catch just enough of it to know that I need to dig deeper and learn more.
The Skylake-SP has a different cache architecture that changes from a shared-distributed model used in Broadwell-EP/EX to a private-local model used in Skylake-SP. How this change will affect SQL Server workloads remains to be seen.
In Broadwell-EP/EX, each physical core had a 256KB private L2 cache, while all of the cores shared a larger L3 cache that could be as large as 60MB (typically 2.5MB/core). All of the lines in the L2 cache for each core were also present in the inclusive, shared L3 cache.
In Skylake-SP, each physical core has a 1MB private L2 cache, while all of the cores share a larger L3 cache that can be as large as 38.5MB (typically 1.375MB/core). All of the lines in the L2 cache for each core may not be present in the non-inclusive, shared L3 cache.
A larger L2 cache increases the hit ratio from the L2 cache, resulting in lower effective memory latency and lowered demand on the L3 cache and the mesh interconnect. L2 cache is typically about 4X faster than L3 cache in Skylake-SP. Figure 2 details the new cache architecture changes in Skylake-SP.
Glenn explains what the performance ramifications of these changes are, and also gives a consumer caveat regarding a major price difference based on memory capacity per socket.
That’s from the “Hardware and Software Requirements for Installing SQL Server” product page. I’ve had people ask if I was using a Raspberry Pi, or some other Micro ATX PC. The answer is neither; the problem with a Pi is that it’s not a 64-bit processor. Pis use ARM architecture, and SQL Server doesn’t (yet) support ARM. Also, most Pi 3’s run at 1.2Ghz and only support 1GB of RAM. As for MicroATX form factor PCs, they’re closer to what we’d need, but they’re still heavy. Plus, you’d need a pretty substantial power supply that we couldn’t (safely) support that high up in the sky. Even if you stripped it down to bare components, it’d be pushing it.
There are companies that make small SoC solutions, but after evaluating them we determined that they were either pretty flaky or got so hot they risked bursting into flames even just booting into Windows. Instead, we found a really unique piece of hardware: the Intel Joule.
You wants 1×6 (one socket, 6 cores) because standard edition will only use the first 4 sockets in a server (up to 16 cores combined). There’s no getting around that.
From a NUMA perspective as long a vNUMA at the Hypervisor is disabled then it doesn’t matter as SQL Server standard edition isn’t NUMA aware (NUMA awareness is an Enterprise Edition feature).
Read on for a more nuanced answer when it comes to Enterprise Edition.
An example of this is if you wish to configure 384GB of RAM on a new server. The server has 24 memory slots. You could populate each of the memory slots with 16GB sticks of memory to get to the 384GB total. Or, you could spend a bit more money to buy 32GB sticks of memory and only fill up half of the memory slots. Your outcome is the same amount of RAM. Your price tag on the memory is slightly higher than the relatively cheaper smaller sticks.
In this configuration, your 16GB DIMM configuration runs the memory 22% slower than if you buy the higher density sticks. Check out page 63 of the server build guide for an HPE Proliant DL380 Gen9 server. The fully populated 16GB stick configuration runs the memory at 1866 MHz. If you only fill in the 32GB sticks on half the slots, the memory runs at 2400 MHz.
Very interesting information.
Since SQL Server 2016 Standard Edition has such a low per-instance memory limit, you should purposely choose an appropriate memory configuration that will let you use all of the license-limit memory while also getting the best memory performance possible. Only populating one DIMM per memory channel will give you the absolute best memory performance supported by your processor(s).
The major server vendors, such as Dell, offer detailed guidance on the possible memory configurations for their servers, depending on the number and specific type of processor selected. For SQL Server 2016 Standard Edition in a two-socket server with two Intel Xeon E5-2600 v4 family processors, choosing eight, 32GB DDR4 DIMMs would give you 256GB of RAM, running at the maximum supported speed of 2400MT/s.
This would allow you to set max server memory (for the buffer pool) to 131,072 MB (128GB), and still have plenty of memory left over for the operating system and for possible use by columnstore indexes and in-memory-OLTP. You would also have sixteen empty DIMM slots that could be used for future RAM expansion (which you could take advantage of if you did a subsequent Edition upgrade to Enterprise Edition). Another use for some of those empty DIMM slots would be for “tail of the log caching” on NVDIMMs (which is supported in SQL Server 2016 Standard Edition with SP1).
Click through for some very helpful advice. If your budget is tight enough that Enterprise Edition is out of the question, it’d be terrible to pick something which wastes hardware or, even worse, wastes hardware while still forcing you to pay more for licensing.