All execution plans iterators that require memory grants have two fundamental code paths, one path for when the memory grant is blown and memory spills out into tempdb and one for when the memory grant is correct or under-estimated. Perhaps the database engine team may at some point include a third option, which is for when the grant can be accommodated inside the CPU cache.
As an example, if you run a log record generation intensive workload on the same CPU socket as the log writer, usually socket 0, this will run in a shorter time compared to running the exact same workload in a different socket
This is the type of post where I catch just enough of it to know that I need to dig deeper and learn more.
The Skylake-SP has a different cache architecture that changes from a shared-distributed model used in Broadwell-EP/EX to a private-local model used in Skylake-SP. How this change will affect SQL Server workloads remains to be seen.
In Broadwell-EP/EX, each physical core had a 256KB private L2 cache, while all of the cores shared a larger L3 cache that could be as large as 60MB (typically 2.5MB/core). All of the lines in the L2 cache for each core were also present in the inclusive, shared L3 cache.
In Skylake-SP, each physical core has a 1MB private L2 cache, while all of the cores share a larger L3 cache that can be as large as 38.5MB (typically 1.375MB/core). All of the lines in the L2 cache for each core may not be present in the non-inclusive, shared L3 cache.
A larger L2 cache increases the hit ratio from the L2 cache, resulting in lower effective memory latency and lowered demand on the L3 cache and the mesh interconnect. L2 cache is typically about 4X faster than L3 cache in Skylake-SP. Figure 2 details the new cache architecture changes in Skylake-SP.
Glenn explains what the performance ramifications of these changes are, and also gives a consumer caveat regarding a major price difference based on memory capacity per socket.
That’s from the “Hardware and Software Requirements for Installing SQL Server” product page. I’ve had people ask if I was using a Raspberry Pi, or some other Micro ATX PC. The answer is neither; the problem with a Pi is that it’s not a 64-bit processor. Pis use ARM architecture, and SQL Server doesn’t (yet) support ARM. Also, most Pi 3’s run at 1.2Ghz and only support 1GB of RAM. As for MicroATX form factor PCs, they’re closer to what we’d need, but they’re still heavy. Plus, you’d need a pretty substantial power supply that we couldn’t (safely) support that high up in the sky. Even if you stripped it down to bare components, it’d be pushing it.
There are companies that make small SoC solutions, but after evaluating them we determined that they were either pretty flaky or got so hot they risked bursting into flames even just booting into Windows. Instead, we found a really unique piece of hardware: the Intel Joule.
You wants 1×6 (one socket, 6 cores) because standard edition will only use the first 4 sockets in a server (up to 16 cores combined). There’s no getting around that.
From a NUMA perspective as long a vNUMA at the Hypervisor is disabled then it doesn’t matter as SQL Server standard edition isn’t NUMA aware (NUMA awareness is an Enterprise Edition feature).
Read on for a more nuanced answer when it comes to Enterprise Edition.
An example of this is if you wish to configure 384GB of RAM on a new server. The server has 24 memory slots. You could populate each of the memory slots with 16GB sticks of memory to get to the 384GB total. Or, you could spend a bit more money to buy 32GB sticks of memory and only fill up half of the memory slots. Your outcome is the same amount of RAM. Your price tag on the memory is slightly higher than the relatively cheaper smaller sticks.
In this configuration, your 16GB DIMM configuration runs the memory 22% slower than if you buy the higher density sticks. Check out page 63 of the server build guide for an HPE Proliant DL380 Gen9 server. The fully populated 16GB stick configuration runs the memory at 1866 MHz. If you only fill in the 32GB sticks on half the slots, the memory runs at 2400 MHz.
Very interesting information.
Since SQL Server 2016 Standard Edition has such a low per-instance memory limit, you should purposely choose an appropriate memory configuration that will let you use all of the license-limit memory while also getting the best memory performance possible. Only populating one DIMM per memory channel will give you the absolute best memory performance supported by your processor(s).
The major server vendors, such as Dell, offer detailed guidance on the possible memory configurations for their servers, depending on the number and specific type of processor selected. For SQL Server 2016 Standard Edition in a two-socket server with two Intel Xeon E5-2600 v4 family processors, choosing eight, 32GB DDR4 DIMMs would give you 256GB of RAM, running at the maximum supported speed of 2400MT/s.
This would allow you to set max server memory (for the buffer pool) to 131,072 MB (128GB), and still have plenty of memory left over for the operating system and for possible use by columnstore indexes and in-memory-OLTP. You would also have sixteen empty DIMM slots that could be used for future RAM expansion (which you could take advantage of if you did a subsequent Edition upgrade to Enterprise Edition). Another use for some of those empty DIMM slots would be for “tail of the log caching” on NVDIMMs (which is supported in SQL Server 2016 Standard Edition with SP1).
Click through for some very helpful advice. If your budget is tight enough that Enterprise Edition is out of the question, it’d be terrible to pick something which wastes hardware or, even worse, wastes hardware while still forcing you to pay more for licensing.
Let me tell you the story of a small BI project I did a couple of years back. The project building a small data mart for the local branch of a very large multinational in consumer goods. The data was gathered by the retailers in the country, who send the details back to the company for analyses of the sales performance. The data was to be hosted internally at the company, not at the client. However, the server was commissioned but not yet delivered. In the meantime, I would simply start developing on my own machine – which was quite a powerful laptop with 8 cores and 32GB of RAM – and we would migrate the solution once the server arrived.
Read on for what happened. I was half-expecting “And that’s when my laptop became the new server” to be the punch line.
It might seem that given the pace of hardware change, such old model cannot possibly valid, resulting horrible execution plans. Around 1995 or so, the high-performance HDD was 7200RPM with a sequential bandwidth of 4-5MB/s. The mean rotational latency for 7200RPM is 4ms. An average seek time of 8.5ms seems reasonable, though I have not kept any documentation from this period. This would correspond to 80 IOPS per disk at queue depth 1 per HDD. So, it seems curious that the SQL Server cost model is based on the random IOPS of 4 disks, but the sequential IO of 2 HDDs.
Performance HDDs progressed first to 10K RPM around 1996, and then to 15K around 2000, with corresponding rotational latencies of 3 and 2ms respectively. Average seek time was reduced over time to 3ms with developments in powerful rare-earth magnets. The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD. But no further progress was made on HDD rotational speed. In same time period, hard disk sequential IO phenomenally quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.
In other words, the SQL Server cost model is based on a 1350/320 = 4.2 ratio. But 15K HDDs in the mid-2000’s were 100MB/s × 128 pages/MB = 12,800 pages/sec sequential to 200 IOPS random for a ratio of 64:1. It turns out that achieving nearly the sequential IO capability of HDDs from SQL Server required a special layout strategy, as outlined in the Fast Track Data Warehouse Reference Architecture papers, which few people followed. This was due to the fixed, inflexible IO pattern of SQL Server, which required the disk configuration to match that of SQL Server instead of being able to adjust the IO pattern to match that of the storage configuration.
It’s worth taking the time to read. I like Glenn Berry’s idea in the comments of building relative CPU/IO/memory measures and applying them rather than using the same values that were good for twenty years ago.
There are different types of SRAM. High-performance SRAM has 6 transistors, 6T. Intel may use 8T Intel Labs at ISSCC 2012 or even 10T for low power? (see real world tech NTV). It would seem that SRAM should be about six times less dense than DRAM, depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.
There is a Micron slide in Micro 48 Keynote III that says SRAM does not scale on manufacturing process as well as DRAM. Instead of 6:1, or 0.67Gbit SRAM at the same die size as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area? Another source says 100:1 might be appropriate.
Eye-balling the Intel Broadwell 10-core (LCC) die, the L3 cache is 50mm2, listed as 25MB. It includes tags and ECC on both data and tags? There could be 240Mb or more in the 25MB L3? Then 1G could fit in a 250mm2 die, plus area for the signals going off-die.
There is a lot of depth in this blog post.
So having read up on automatic soft-NUMA, I was eager to see what it did with my main production servers when I upgraded them. My main pair of production servers (they are paired into an Availability Group) have 4 NUMA nodes with 16 physical cores per node and hyperthreading for a total of 32 logical cores per node with 1.5 TB of RAM. Obviously, we are using core-based Enterprise Edition for these servers. I thought I knew what automatic soft-NUMA would do, and wanted to confirm if my expectations were right.
Read on, but it looks like there’s a “to be continued…” here.