Category: Hardware

On August 7, 2019, AMD finally unveiled their new 7nm EPYC 7002 Series of server processors, formerly code-named “Rome” at the AMD EPYC Horizon Event in San Francisco. This is the second generation EPYC server processor that uses the same Zen 2 architecture as the AMD Ryzen 3000 Series desktop processors. These new processors are socket compatible with the previous generation AMD EPYC 7001 Series processors, so they will work in existing model servers (with a BIOS update). Despite that, you will need a new model server to be able to use PCIe 4.0 support from the newer processors.
The AMD EPYC 7002 series includes 19 public launch SKUs that have anywhere from 8 to 64 physical cores, plus SMT, for twice the number of logical cores per processor. There are fourteen SKUs that will work in both one-socket and two-socket servers. There are also five less expensive processor SKUs (which have a “P” suffix) that only work in one-socket servers. This processor family has enough compute horsepower, memory bandwidth and capacity, and I/O bandwidth to support large server workloads on a single-socket server.

It certainly looks competitive. And that’s a great thing for consumers, even those who never make the switch, as it will force Intel to up its game.

Comments closed

AMD and Server CPUs

Published 2019-07-17 by Kevin Feasel

Glenn Berry has an interesting post on why he’s seriously considering recommending AMD CPUs to people:

AMD claims a 15% Instructions Per Clock (IPC) increase between the desktop Zen+ and Zen 2 generations, and we are likely to see a similar increase between the previous AMD EPYC 7001 “Naples” and the AMD EPYC 7002 series processors.
So far, we don’t know the official base and turbo clock speeds, but there was a recent leak of partial specifications and pricing by a European retailer that listed max boost clock speeds of up to 3.4 GHz. We won’t know the actual single-threaded performance of these processors until they have been released and benchmarked by neutral third-party testers. I am optimistic that they will have higher single-threaded CPU performance than Intel Cascade Lake-SP processors.

I’ve always had a soft spot in my heart for AMD, so I’d love to see them come through with a serious competitor to Intel in the server space, for nostalgic reasons but also to make price more competitive and to make Intel get back on its game.

Comments closed

Flink’s Network Stack

Published 2019-06-07 by Kevin Feasel

Nico Kruber dives into the internals of Apache Flink’s network stack:

Flink’s network stack is one of the core components that make up the flink-runtime module and sit at the heart of every Flink job. It connects individual work units (subtasks) from all TaskManagers. This is where your streamed-in data flows through and it is therefore crucial to the performance of your Flink job for both the throughput as well as latency you observe. In contrast to the coordination channels between TaskManagers and JobManagers which are using RPCs via Akka, the network stack between TaskManagers relies on a much lower-level API using Netty.
This blog post is the first in a series of posts about the network stack. In the sections below, we will first have a high-level look at what abstractions are exposed to the stream operators and then go into detail on the physical implementation and various optimisations Flink did. We will briefly present the result of these optimisations and Flink’s trade-off between throughput and latency. Future blog posts in this series will elaborate more on monitoring and metrics, tuning parameters, and common anti-patterns.

There’s a lot in here and it’s worth reading.

Comments closed

AMD vs Intel CPUs For Data Processing Jobs

Published 2019-01-23 by Kevin Feasel

Hariharan Iyer and Abhishek Srivastava run some tests against AWS’s new AMD-powered EC2 instances:

Our summary findings from TPCDS benchmarks are as follows:
– TPCDS queries are not as sensitive to local disk performance (and hence to EBS volume sizes)
– r5 (Intel) instances are consistently faster than r5a (AMD) instances. However, the speedup depends on the engine and the speedup for r5 (Intel) is lower for Spark (10%) than for Hive (25%).
– r5 instances are also either cheaper (by about 10% for Hive) or the same cost (for Spark)

At least for Hadoop and Spark work, Intel CPUs are a bit better, but there is some nuance in the story so check it out.

Comments closed

Analytics Platform System V7 Released

Published 2018-06-05 by Kevin Feasel

Microsoft has released a new version of their Analytics Platform System:

Microsoft is pleased to announce that the Analytics Platform System (APS) appliance update 7 (AU7) is now generally available. APS is Microsoft’s scale-out Massively Parallel Processing (MPP) system based on SQL Server for data warehouse specific workloads on-premises.
Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

This is useful for the six customers which can afford the licensing for APS.

Comments closed

The Optimal Kafka Message Size

Published 2018-03-02 by Kevin Feasel

Guy Shilo wants to figure out the right chunk size for a Kafka message:

I wrote a python program that runs a producer and a consumer for 30 minutes with different message sizes and measures how many messages per second it can deliver, or the Kafka cluster throughput.
I did not care about the message content, so the consumer only reads the messages from the topic and then discards them. I used a Three partition topic. I guess that on larger clusters with more partitions the performance will be better, but the message size – throughput ratio will remain roughly the same.
So I wrote a small python program that generates a dummy message in the desired size, then spawns two threads, one is a producer and the other is a consumer. The producer send the same message over and over and the consumer reads the messages from the topic and count how many messages it has read. The main program stops after 30 minutes but before it stops it prints how many messages were consumed and how many messages were consumed per second.

Read on for the results. More importantly, test in your own environment with your own equipment, as that value’s likely to differ a bit.

Comments closed

When Nanoseconds Count

Published 2018-02-09 by Kevin Feasel

Joe Chang thinks about single-socket servers:

There is a mechanism by which we can significantly influence memory latency in a multi-processor (socket) server system, that being memory locality. But few applications actually make use of the NUMA APIs in this regard. Some hypervisors like VMware allow VMs to be created with cores and memory from the same node. What may not be appreciated, however, is that even local node memory on a multi-processor system has significantly higher latency than memory access on a (physical) single-socket system.
That the single processor system has low memory latency was interesting but non-actionable bit of knowledge, until recently. The widespread practice in IT world was to have the 2-way system as the baseline standard. Single socket systems were relegated to small business and turnkey solutions. From long ago to a few years ago, there was a valid basis for this, though the reasons changed over the years. When multi-core processors began to appear, the 2-way became much more powerful than necessary for many secondary applications. But this was also the time virtualization became popular, which gave new reason to continue the 2-way as baseline practice.

Joe points out that for a highly-used transactional system, the lower memory latency might make a single-socket server perform better than a multi-socket server.

Comments closed

The Argument For Single-Socket Servers

Published 2017-12-27 by Kevin Feasel

Joe Chang wants us to think about socket counts:

It might seem that the 2-socket system continues to be a good choice, as two processors with an intermediate number of cores is less expensive than one processor with twice as many cores. An example is the Xeon Gold 6132 14-core versus the Xeon Platinum 8180 28-core processors. In addition, the two-socket system has twice the memory capacity and nominally twice as much memory bandwidth.
So, end of argument, right? Well, no.

Click through for his argument in favor of single-socket machines for OLTP systems.

Comments closed

Early Thoughts On New AMD CPUs

Published 2017-08-29 by Kevin Feasel

Glenn Berry talks about the new AMD processor lines and how they might work with SQL Server:

AMD is really pushing the idea of a single-socket EPYC system as a better alternative to a two-socket Intel system for many server workloads. According to AMD, it will be much less expensive, yet will have plenty of cores, memory, and PCIe 3.0 lanes, along with no NUMA overhead. One key advantage AMD is touting is their Infinity Fabric modular interconnect technology, that works both within a single processor and between multiple processors.
For SQL Server 2016/2017 usage, you would still want the “top of the line” SKU for a given physical core count, to get the most performance for each physical core license that you buy. Unlike Intel, AMD does not increase the base clock speed in the lower core count models. These EPYC systems have a lot of PCIe 3.0 lanes and very high memory density, so they might work really well for large SQL Server DW/Reporting workloads. For OLTP workloads, the key will be how much single-threaded performance AMD is able to get from this first-generation of EPYC, and how they compare to Intel’s new Skylake-SP processors. Figure 3 shows the fastest EPYC processor at each core count, which is what you would want for SQL Server usage.

There aren’t too many hard numbers yet, but the worst case scenario is that they force Intel to improve their offerings.

Comments closed

Thinking About Databases At Scale

Published 2017-07-18 by Kevin Feasel

Chris Adkin has a great post explaining some of the hardware and query principles behind scale issues:

All execution plans iterators that require memory grants have two fundamental code paths, one path for when the memory grant is blown and memory spills out into tempdb and one for when the memory grant is correct or under-estimated. Perhaps the database engine team may at some point include a third option, which is for when the grant can be accommodated inside the CPU cache.
As an example, if you run a log record generation intensive workload on the same CPU socket as the log writer, usually socket 0, this will run in a shorter time compared to running the exact same workload in a different socket

This is the type of post where I catch just enough of it to know that I need to dig deeper and learn more.

Comments closed