Press "Enter" to skip to content

Day: January 9, 2020

Spark on Docker on YARN on Cloud

Adam Antal has included all of the layers:

Bringing your own libraries to run a Spark job on a shared YARN cluster can be a huge pain. In the past, you had to install the dependencies independently on each host or use different Python package management softwares. Nowadays Docker provides a much simpler way of packaging and managing dependencies so users can easily share a cluster without running into each other, or waiting for central IT to install packages on every node. Today, we are excited to announce the preview of Spark on Docker on YARN available on CDP DataCenter 1.0 release.

Joking about stack length aside, this looks really useful.

Comments closed

Optimal Kafka Partitioning

Paul Brebner is on a quest:

This blog provides an overview around the two fundamental concepts in Apache Kafka : Topics and Partitions. While developing and scaling our Anomalia Machina application we have discovered that distributed applications using Kafka and Cassandra clusters require careful tuning to achieve close to linear scalability, and critical variables included the number of Kafka topics and partitions. In this blog, we test that theory and answer questions like “What impact does increasing partitions have on throughput?” and “Is there an optimal number of partitions for a cluster to maximize write throughput?” And more!

Read on for some interesting findings.

Comments closed

Upgrading SQL Server Windows Docker Containers

Emanuele Meazzo shows how you can upgrade SQL Server if you are using a Windows Docker container instead of Linux:

With the 1st CU for SQL 2019 released just yesterday, and Microsoft updating the docker image right away, the only natural response for me was to update the docker instance that I showed you how to deploy a few months back.

In theory, a docker container can’t be really “updated”, they’re meant to be stateless machines that you spin up and down responding to changes in demand; what we’re technically doing is creating a new container, based on a new image, that has the same configuration and uses the same persistent storage as the old one.

Read on to see how you can perform this upgrade without losing your data.

Comments closed

Undercover Catalogue 0.4

David Fowler announces a new release of the Undercover Catalogue:

The first major change that 0.4.0 brings is centralisation. With previous versions of the Catalogue, it’s been a requirement to have the Catalogue schema and procs installed on every server that you want to monitor.

0.4 changes that, there is now no need to have anything installed on any of the target instances. Simply install the Catalogue in one place, on your central configuration server and add any instances that you require monitored to the Catalogue.ConfigInstances table.

This makes it much easier to add in instances to the Catalogue.

There are a few other updates as well, so check them out.

Comments closed

Partitioning on Columnstore Table Loading

Aaron Bertrand continues a series around learning about columnstore indexes:

In part 1, I showed how both page and columnstore compression could reduce the size of a 1TB table by 80% or more. While I was impressed I could shrink a table from 1TB to 50GB, I wasn’t very happy with the amount of time it took (anywhere from 2 to 14 hours). With some tips graciously borrowed from folks like Joe ObbishLonny NiederstadtNiko Neugebauer, and others, in this post I will try to make some changes to my original attempt to get better load performance. Since the regular columnstore index didn’t compress better than page compression on this data set, and took 13 hours longer to get there, I’ll focus solely on the more advanced solution using COLUMNSTORE_ARCHIVE compression.

Click through for part 2.

Comments closed