Your Memory Configuration Could Slow Down Your System

David Klee warns against just throwing memory into servers willy-nilly:

An example of this is if you wish to configure 384GB of RAM on a new server. The server has 24 memory slots. You could populate each of the memory slots with 16GB sticks of memory to get to the 384GB total. Or, you could spend a bit more money to buy 32GB sticks of memory and only fill up half of the memory slots. Your outcome is the same amount of RAM. Your price tag on the memory is slightly higher than the relatively cheaper smaller sticks.

In this configuration, your 16GB DIMM configuration runs the memory 22% slower than if you buy the higher density sticks. Check out page 63 of the server build guide for an HPE Proliant DL380 Gen9 server. The fully populated 16GB stick configuration runs the memory at 1866 MHz. If you only fill in the 32GB sticks on half the slots, the memory runs at 2400 MHz.

Very interesting information.

Standard Edition Hardware

Glenn Berry tackles the question of maximizing bang for buck with hardware for SQL Server Standard Edition:

Since SQL Server 2016 Standard Edition has such a low per-instance memory limit, you should purposely choose an appropriate memory configuration that will let you use all of the license-limit memory while also getting the best memory performance possible. Only populating one DIMM per memory channel will give you the absolute best memory performance supported by your processor(s).

The major server vendors, such as Dell, offer detailed guidance on the possible memory configurations for their servers, depending on the number and specific type of processor selected. For SQL Server 2016 Standard Edition in a two-socket server with two Intel Xeon E5-2600 v4 family processors, choosing eight, 32GB DDR4 DIMMs would give you 256GB of RAM, running at the maximum supported speed of 2400MT/s.

This would allow you to set max server memory (for the buffer pool) to 131,072 MB (128GB), and still have plenty of memory left over for the operating system and for possible use by columnstore indexes and in-memory-OLTP. You would also have sixteen empty DIMM slots that could be used for future RAM expansion (which you could take advantage of if you did a subsequent Edition upgrade to Enterprise Edition). Another use for some of those empty DIMM slots would be for “tail of the log caching” on NVDIMMs (which is supported in SQL Server 2016 Standard Edition with SP1).

Click through for some very helpful advice.  If your budget is tight enough that Enterprise Edition is out of the question, it’d be terrible to pick something which wastes hardware or, even worse, wastes hardware while still forcing you to pay more for licensing.

Checking Server Specs

Koen Verbeeck had an issue with server power:

Let me tell you the story of a small BI project I did a couple of years back. The project building a small data mart for the local branch of a very large multinational in consumer goods. The data was gathered by the retailers in the country, who send the details back to the company for analyses of the sales performance. The data was to be hosted internally at the company, not at the client. However, the server was commissioned but not yet delivered. In the meantime, I would simply start developing on my own machine – which was quite a powerful laptop with 8 cores and 32GB of RAM – and we would migrate the solution once the server arrived.

Read on for what happened.  I was half-expecting “And that’s when my laptop became the new server” to be the punch line.

Thoughts On Cost-Based Optimizers

Joe Chang has the makings of an academic paper on the shortcomings of the current SQL Server cost optimizer model:

It might seem that given the pace of hardware change, such old model cannot possibly valid, resulting horrible execution plans. Around 1995 or so, the high-performance HDD was 7200RPM with a sequential bandwidth of 4-5MB/s. The mean rotational latency for 7200RPM is 4ms. An average seek time of 8.5ms seems reasonable, though I have not kept any documentation from this period. This would correspond to 80 IOPS per disk at queue depth 1 per HDD. So, it seems curious that the SQL Server cost model is based on the random IOPS of 4 disks, but the sequential IO of 2 HDDs.

Performance HDDs progressed first to 10K RPM around 1996, and then to 15K around 2000, with corresponding rotational latencies of 3 and 2ms respectively. Average seek time was reduced over time to 3ms with developments in powerful rare-earth magnets. The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD. But no further progress was made on HDD rotational speed. In same time period, hard disk sequential IO phenomenally quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.

In other words, the SQL Server cost model is based on a 1350/320 = 4.2 ratio. But 15K HDDs in the mid-2000’s were 100MB/s × 128 pages/MB = 12,800 pages/sec sequential to 200 IOPS random for a ratio of 64:1. It turns out that achieving nearly the sequential IO capability of HDDs from SQL Server required a special layout strategy, as outlined in the Fast Track Data Warehouse Reference Architecture papers, which few people followed. This was due to the fixed, inflexible IO pattern of SQL Server, which required the disk configuration to match that of SQL Server instead of being able to adjust the IO pattern to match that of the storage configuration.

It’s worth taking the time to read.  I like Glenn Berry’s idea in the comments of building relative CPU/IO/memory measures and applying them rather than using the same values that were good for twenty years ago.

Thinking Post-DRAM

Joe Chang argues that we may benefit more from a hardware architecture which uses lower-latency, lower-capacity RAM:

There are different types of SRAM. High-performance SRAM has 6 transistors, 6T. Intel may use 8T Intel Labs at ISSCC 2012 or even 10T for low power? (see real world tech NTV). It would seem that SRAM should be about six times less dense than DRAM, depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.

There is a Micron slide in Micro 48 Keynote III that says SRAM does not scale on manufacturing process as well as DRAM. Instead of 6:1, or 0.67Gbit SRAM at the same die size as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area? Another source says 100:1 might be appropriate.

Eye-balling the Intel Broadwell 10-core (LCC) die, the L3 cache is 50mm2, listed as 25MB. It includes tags and ECC on both data and tags? There could be 240Mb or more in the 25MB L3? Then 1G could fit in a 250mm2 die, plus area for the signals going off-die.

There is a lot of depth in this blog post.

Automatic Soft-NUMA In SQL Server

Robert Davis wants to find information on soft-NUMA in his SQL Server instance:

So having read up on automatic soft-NUMA, I was eager to see what it did with my main production servers when I upgraded them. My main pair of production servers (they are paired into an Availability Group) have 4 NUMA nodes with 16 physical cores per node and hyperthreading for a total of 32 logical cores per node with 1.5 TB of RAM. Obviously, we are using core-based Enterprise Edition for these servers. I thought I knew what automatic soft-NUMA would do, and wanted to confirm if my expectations were right.

Read on, but it looks like there’s a “to be continued…” here.

Thinking About Memory Latency

Joe Chang throws down the gauntlet:

Naturally, the database and application should be architected together with the SQL Server NUMA tuning options to support good scaling on multi-socket NUMA systems. If we neglected this in the original design, I am sure many DBA-developers know how well such a suggestion would be received by management.

Is there another option? Well yes. Get rid of the NUMA latency issue with a non-NUMA system. Such a system has a single processor socket, hence one memory node. Before anyone scoffs, the one socket is not just a Xeon E3 with four cores.

Still, a single quad-core processor today is 40 times more powerful than a 4-way system from twenty years ago (100,000 tpm-C per core is probably possible if TPC-C were still in use, versus 10,000 on a 4-way in 1996). The Xeon E3 could probably support many medium sized organizations. Maximum memory capacity is 64GB (4x16GB unbuffered ECC DIMMs, $130 each). My recollection is that many IO problems went away at the 32-64GB level. And we could still have powerful IO with 2 PCI-E x8 SSDs, or even 2 x4’s.

This is some very interesting research.  Joe has a some gaps in his research (meaning that there’s scope for people to expand upon this), but this is 100% worth the read.

Wiring A Raspberry Pi 3

Drew Furgiuele begins his project to build an easy button for backups:

I should also pause for a second and talk about wiring hobby boards like this. Good news first: you won’t electrocute yourself on it. I mean, if you do something really dumb like try to wire it underwater or eat it or something then maybe you could but you shouldn’t ever receive a shock while working with a board like this, even plugged in. The bad news is that even though you won’t damage yourself, you could very well damage the board if you just randomly plug things in. Here’s a hard and fast rule: if you’re not an electronics expert or an electrical engineer, leave it to experts to tell you where and how to wire. I’m not calling myself an expert here, but I have sort of a basic understanding of how to wire these things up. The point I’m attempting at making is: if you want to really learn and understand circuit design, there are lots of great resources of where to get started. And it’s quite a rabbit hole to go down, but it’s well worth your time if you want to learn more.

Read the whole thing.  Over a weekend, with your Pi 3.

SQL Server Easy Button

Drew Furgiuele has an easy button for SQL Server:

Sounds great, right? I bet some of you already already thinking “Oh man, I can’t wait to run the Linux version SQL Server on this thing!” There’s just one really big catch: the CPUs on Pi boards are ARM-architecture based. Unlike modern processors in our desktops and laptops, these chips are more akin to what you find in mobile phones or other small devices. It also means programs you run or write on your computer are probably 32 or 64 bit and designed for Intel or AMD processors. ARM is a completely different architecture, so we can’t upload something to it and expect it to run. Programs have to be designed for it.

Furthermore a lot of “stock” Pi operating system images are Linux based so it can be difficult to write code that interfaces with .NET or Windows-based services. Not that you can’t; you can certainly write bash scripts that make wget or curl requests.

Based on my experiences at least, nothing with Windows IoT was really that easy…  This is an intro post with a shopping list attached, so I am looking forward to Drew’s making everybody’s lives easier on a budget of $98.

New Samsung Drives

Joe Chang looks at the Samsung 960 PRO SSD:

All the previous PCI-E x4 gen3 NVMe SSDs were rated between 2,000-2,500MB/s in large block read. The 960 Pro is rated for 3,500MB/s read. This is pretty much the maximum possible bandwidth for PCI-E x4 gen3. Each PCI-E gen3 lane is 8Gbit/s, but the realizable bandwidth is less. In earlier generation products, an upper bound of 800MB/s realizable per 8Gbit/s nominal signaling rate was typical.

Presumably there was a reason why every PCI-E x4 was in the 2000-2500MB/s bandwidth. It could be that these were 8-channel controllers and the NAND interface was 333-400MB/s. Even though 8 x 400MB/s = 3,200MB/s, it is expected that excess bandwidth is necessary on the downstream side. The could be other reasons as well, perhaps the DRAM for caching NAND meta-data. Intel had an 18-channel controller, which produced 2,400MB/s in the P750 line, and 2,800MB/s in the P3x00 line.

If you’re looking at a test lab server, this might be a good disk for you.


April 2017
« Mar