Press "Enter" to skip to content

Category: Hardware

Persistent Memory for SQL Server on Linux

The SQL Server team shows how you can configure persistent memory for SQL Server on Linux:

With the release of SQL Server 2019 on Linux, Microsoft introduced persistent memory (PMEM) support on Linux. This is an exciting development, as previous versions of SQL Server on Linux didn’t support PMEM. Let’s look at how to configure the PMEM for SQL Server on Linux.

SQL Server 2016 introduced support for non-volatile DIMMs and an optimization called Tail of the Log Caching on NVDIMM. These leveraged Windows Server direct access to a persistent memory device in DAX mode to reduce the number of operations needed to harden a log buffer to persistent storage.

SQL Server 2019 extends the support for PMEM devices to Linux, providing full enlightenment of data and transaction logs placed on PMEM. Enlightenment is a way to access the storage device using efficient user-space memcpy() operations. Rather than going through the file system and storage stack, SQL Server leverages DAX support on Linux to place data directly into the device. This helps to reduce latency.

Click through for the configuration steps.

Comments closed

On Quantum Supremacy

John Cook has some thoughts on Google’s quantum supremacy announcement:

Google announced today that it has demonstrated “quantum supremacy,” i.e. that they have solved a problem on a quantum computer that could not be solved on a classical computer. Google says

Our machine performed the target computation in 200 seconds, and from measurements in our experiment we determined that it would take the world’s fastest supercomputer 10,000 years to produce a similar output.

IBM disputes this claim. They don’t dispute that Google has computed something with a quantum computer that would take a lot of conventional computing power, only that it “would take the world’s fastest supercomputer 10,000 years” to solve. IBM says it would take 2.5 days.

If you want to jump the gun but also stay on the Microsoft stack, the Q# programming language is open-source and you can run a simulator on your machine. Manning also has a Q# book in the works.

Comments closed

Reviewing the AMD EPYC Line for SQL Server

Glenn Berry takes a look at whether we want to invest in AMD’s latest server-grade processor:

On August 7, 2019, AMD finally unveiled their new 7nm EPYC 7002 Series of server processors, formerly code-named “Rome” at the AMD EPYC Horizon Event in San Francisco. This is the second generation EPYC server processor that uses the same Zen 2 architecture as the AMD Ryzen 3000 Series desktop processors. These new processors are socket compatible with the previous generation AMD EPYC 7001 Series processors, so they will work in existing model servers (with a BIOS update). Despite that, you will need a new model server to be able to use PCIe 4.0 support from the newer processors.

The AMD EPYC 7002 series includes 19 public launch SKUs that have anywhere from 8 to 64 physical cores, plus SMT, for twice the number of logical cores per processor. There are fourteen SKUs that will work in both one-socket and two-socket servers. There are also five less expensive processor SKUs (which have a “P” suffix) that only work in one-socket servers. This processor family has enough compute horsepower, memory bandwidth and capacity, and I/O bandwidth to support large server workloads on a single-socket server.

It certainly looks competitive. And that’s a great thing for consumers, even those who never make the switch, as it will force Intel to up its game.

Comments closed

AMD and Server CPUs

Glenn Berry has an interesting post on why he’s seriously considering recommending AMD CPUs to people:

AMD claims a 15% Instructions Per Clock (IPC) increase between the desktop Zen+ and Zen 2 generations, and we are likely to see a similar increase between the previous AMD EPYC 7001 “Naples” and the AMD EPYC 7002 series processors.

So far, we don’t know the official base and turbo clock speeds, but there was a recent leak of partial specifications and pricing by a European retailer that listed max boost clock speeds of up to 3.4 GHz. We won’t know the actual single-threaded performance of these processors until they have been released and benchmarked by neutral third-party testers. I am optimistic that they will have higher single-threaded CPU performance than Intel Cascade Lake-SP processors.

I’ve always had a soft spot in my heart for AMD, so I’d love to see them come through with a serious competitor to Intel in the server space, for nostalgic reasons but also to make price more competitive and to make Intel get back on its game.

Comments closed

Flink’s Network Stack

Nico Kruber dives into the internals of Apache Flink’s network stack:

Flink’s network stack is one of the core components that make up the flink-runtime module and sit at the heart of every Flink job. It connects individual work units (subtasks) from all TaskManagers. This is where your streamed-in data flows through and it is therefore crucial to the performance of your Flink job for both the throughput as well as latency you observe. In contrast to the coordination channels between TaskManagers and JobManagers which are using RPCs via Akka, the network stack between TaskManagers relies on a much lower-level API using Netty.

This blog post is the first in a series of posts about the network stack. In the sections below, we will first have a high-level look at what abstractions are exposed to the stream operators and then go into detail on the physical implementation and various optimisations Flink did. We will briefly present the result of these optimisations and Flink’s trade-off between throughput and latency. Future blog posts in this series will elaborate more on monitoring and metrics, tuning parameters, and common anti-patterns.

There’s a lot in here and it’s worth reading.

Comments closed

AMD vs Intel CPUs For Data Processing Jobs

Hariharan Iyer and Abhishek Srivastava run some tests against AWS’s new AMD-powered EC2 instances:

Our summary findings from TPCDS benchmarks are as follows:
– TPCDS queries are not as sensitive to local disk performance (and hence to EBS volume sizes)
– r5 (Intel) instances are consistently faster than r5a (AMD) instances. However, the speedup depends on the engine and the speedup for r5 (Intel) is lower for Spark (10%) than for Hive (25%).
– r5 instances are also either cheaper (by about 10% for Hive) or the same cost (for Spark)

At least for Hadoop and Spark work, Intel CPUs are a bit better, but there is some nuance in the story so check it out.

Comments closed

Analytics Platform System V7 Released

Microsoft has released a new version of their Analytics Platform System:

Microsoft is pleased to announce that the Analytics Platform System (APS) appliance update 7 (AU7) is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing (MPP) system based on SQL Server for data warehouse specific workloads on-premises.

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

This is useful for the six customers which can afford the licensing for APS.

Comments closed

The Optimal Kafka Message Size

Guy Shilo wants to figure out the right chunk size for a Kafka message:

I wrote a python program that runs a producer and a consumer for 30 minutes with different message sizes and measures how many messages per second it can deliver, or the Kafka cluster throughput.

I did not care about the message content, so the consumer only reads the messages from the topic and then discards them. I used a Three partition topic. I guess that on larger clusters with more partitions the performance will be better, but the message size – throughput ratio will remain roughly the same.

So I wrote a small python program that generates a dummy message in the desired size, then spawns two threads, one is a producer and the other is a consumer. The producer send the same message over and over and the consumer reads the messages from the topic and count how many messages it has read. The main program stops after 30 minutes but before it stops it prints how many messages were consumed and how many messages were consumed per second.

Read on for the results.  More importantly, test in your own environment with your own equipment, as that value’s likely to differ a bit.

Comments closed

When Nanoseconds Count

Joe Chang thinks about single-socket servers:

There is a mechanism by which we can significantly influence memory latency in a multi-processor (socket) server system, that being memory locality. But few applications actually make use of the NUMA APIs in this regard. Some hypervisors like VMware allow VMs to be created with cores and memory from the same node. What may not be appreciated, however, is that even local node memory on a multi-processor system has significantly higher latency than memory access on a (physical) single-socket system.

That the single processor system has low memory latency was interesting but non-actionable bit of knowledge, until recently. The widespread practice in IT world was to have the 2-way system as the baseline standard. Single socket systems were relegated to small business and turnkey solutions. From long ago to a few years ago, there was a valid basis for this, though the reasons changed over the years. When multi-core processors began to appear, the 2-way became much more powerful than necessary for many secondary applications. But this was also the time virtualization became popular, which gave new reason to continue the 2-way as baseline practice.

Joe points out that for a highly-used transactional system, the lower memory latency might make a single-socket server perform better than a multi-socket server.

Comments closed

The Argument For Single-Socket Servers

Joe Chang wants us to think about socket counts:

It might seem that the 2-socket system continues to be a good choice, as two processors with an intermediate number of cores is less expensive than one processor with twice as many cores. An example is the Xeon Gold 6132 14-core versus the Xeon Platinum 8180 28-core processors. In addition, the two-socket system has twice the memory capacity and nominally twice as much memory bandwidth.

So, end of argument, right? Well, no.

Click through for his argument in favor of single-socket machines for OLTP systems.

Comments closed