Press "Enter" to skip to content

Author: Kevin Feasel

Spark on Docker on YARN on Cloud

Adam Antal has included all of the layers:

Bringing your own libraries to run a Spark job on a shared YARN cluster can be a huge pain. In the past, you had to install the dependencies independently on each host or use different Python package management softwares. Nowadays Docker provides a much simpler way of packaging and managing dependencies so users can easily share a cluster without running into each other, or waiting for central IT to install packages on every node. Today, we are excited to announce the preview of Spark on Docker on YARN available on CDP DataCenter 1.0 release.

Joking about stack length aside, this looks really useful.

Comments closed

Upgrading SQL Server Windows Docker Containers

Emanuele Meazzo shows how you can upgrade SQL Server if you are using a Windows Docker container instead of Linux:

With the 1st CU for SQL 2019 released just yesterday, and Microsoft updating the docker image right away, the only natural response for me was to update the docker instance that I showed you how to deploy a few months back.

In theory, a docker container can’t be really “updated”, they’re meant to be stateless machines that you spin up and down responding to changes in demand; what we’re technically doing is creating a new container, based on a new image, that has the same configuration and uses the same persistent storage as the old one.

Read on to see how you can perform this upgrade without losing your data.

Comments closed

Optimal Kafka Partitioning

Paul Brebner is on a quest:

This blog provides an overview around the two fundamental concepts in Apache Kafka : Topics and Partitions. While developing and scaling our Anomalia Machina application we have discovered that distributed applications using Kafka and Cassandra clusters require careful tuning to achieve close to linear scalability, and critical variables included the number of Kafka topics and partitions. In this blog, we test that theory and answer questions like “What impact does increasing partitions have on throughput?” and “Is there an optimal number of partitions for a cluster to maximize write throughput?” And more!

Read on for some interesting findings.

Comments closed

Partitioning on Columnstore Table Loading

Aaron Bertrand continues a series around learning about columnstore indexes:

In part 1, I showed how both page and columnstore compression could reduce the size of a 1TB table by 80% or more. While I was impressed I could shrink a table from 1TB to 50GB, I wasn’t very happy with the amount of time it took (anywhere from 2 to 14 hours). With some tips graciously borrowed from folks like Joe ObbishLonny NiederstadtNiko Neugebauer, and others, in this post I will try to make some changes to my original attempt to get better load performance. Since the regular columnstore index didn’t compress better than page compression on this data set, and took 13 hours longer to get there, I’ll focus solely on the more advanced solution using COLUMNSTORE_ARCHIVE compression.

Click through for part 2.

Comments closed

Undercover Catalogue 0.4

David Fowler announces a new release of the Undercover Catalogue:

The first major change that 0.4.0 brings is centralisation. With previous versions of the Catalogue, it’s been a requirement to have the Catalogue schema and procs installed on every server that you want to monitor.

0.4 changes that, there is now no need to have anything installed on any of the target instances. Simply install the Catalogue in one place, on your central configuration server and add any instances that you require monitored to the Catalogue.ConfigInstances table.

This makes it much easier to add in instances to the Catalogue.

There are a few other updates as well, so check them out.

Comments closed

Accessing S3 Data from Apache Spark

Divyansh Jain shows how we can connect to AWS’s S3 using Apache Spark:

Now, coming to the actual topic that how to read data from S3 bucket to Spark. Well, it is not very easy to read S3 bucket by just adding Spark-core dependencies to your Spark project and use spark.read to read you data from S3 Bucket.

So, to read data from an S3, below are the steps to be followed:

This isn’t a built-in source, so there is a little bit of work to do, but it’s not that bad.

Comments closed

Delegating Authentication using Managed Service Accounts

Jamie Wick helps us solve the classic Kerberos double-hop problem:

If the Report Server service doesn’t have permission to delegate to the SQL Server, it will try to connect anonymously (step 4 in the diagram above). Which results in this login error:

Login failed for user ‘NT AUTHORITY\ANONYMOUS LOGON’. Reason: Could not find a login matching the name provided. [CLIENT: <Client IP Address>]

Historically report server and SQL server services, that needed the ability to delegate authentication to other servers, were configured to run using an Active Directory user account. Enabling delegation on these accounts was simply a matter of setting the Trust level on the Delegation tab of the account’s properties (with Active Directory Users & Computers).

But Jamie is here to show us a better way.

Comments closed

Creating a Custom Partitioner for Apache Kafka

Swapnil Gosavi walks us through the process of creating a custom partitioner in Apache Kafka:

Assume we are collecting data from different departments. All the departments are sending data to a single topic named department. I planned five partitions for the topic. But, I want two partitions dedicated to a specific department, named IT, and the remaining three partitions for the rest of the departments. How would you achieve this?

You can solve this requirement, and any other type of partitioning needs by implementing a custom partitioner.

This is quite useful when you don’t necessarily want topic explosion but you do want more than what the classic partitioner allows.

Comments closed

The Costs of Virtualization

David Klee points out that virtualization, configured correctly, should not harm SQL Server performance much:

A wonderful reader of my blog sent me a note (thanks Jess!) about a single line notation in the latest SQL Server release notes. The notes is as follows.

Running SQL Server on a virtual machine will be slower than running natively because of the overhead of virtualization.

The question was simple. Why would Microsoft add this disclaimer? It was being used as a negative talking point towards SQL Server virtualization, and holding the DBA team back from getting the benefits of virtualization.

David gives us some rough numbers on what that means. Spoiler alert: if you set up your environment right, it’s not much.

Comments closed

Generating an Email List from Active Directory Users

James Livingston takes us through an interesting solution to a common problem:

If you’ve ever performed some impactful maintenance on a SQL Server, you probably notified users. If you’re great at documentation and already know exactly who to contact, this script isn’t for you. If you don’t have a user email list, this script will create it for you!

I used to manage 500 SQL Server instances and there was daily maintenance\changes going on constantly. I wrote this PowerShell script to automatically create an email list for me. This PowerShell script gathers the login information from an instance of SQL Server and then pulls their email address from Active Directory.

Read on to see the script in action.

Comments closed