PolyBase and Dockerized Hadoop

I have a solution to a problem which vexed me for quite some time:

Quite some time ago, I posted about PolyBase and the Hortonworks Data Platform 2.5 (and later) sandbox.

The summary of the problem is that data nodes in HDP 2.5 and later are on a Docker private network. For most cases, this works fine, but PolyBase expects publicly accessible data nodes by default—one of its performance enhancements with Hadoop was to have PolyBase scale-out group members interact directly with the Hadoop data nodes rather than having everything go through the NameNode and PolyBase control node.

Click through for the solution.

Null Checks in Spark DataFrames

Bipin Patwardhan gives us four techniques for validating whether data in Spark exists:

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column ‘null’). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named “df.”

Click through for those techniques.

strace and SQL Server Containers

Anthony Nocentino tries using strace to diagnose SQL Server process activity in a container:

We’re attaching to an already running docker container running SQL. But what we get is an idle SQL Server process this is great if we have a running workload we want to analyze but my goal for all of this is to see how SQL Server starts up and this isn’t going to cut it.
My next attempt was to stop the sql19 container and quickly start the strace container but the strace container still missed events at the startup of the sql19 container. So I needed a better way.

Don’t worry—Anthony finds a better way.

Port Forwarding in Kubernetes

Andrew Pruski shows off Kubernetes port forwarding and how to connect to a SQL Server instance:

The load balanced service’s IP can be usually be used to connect into the SQL instance running in the pod, but what if we’re unable to connect? Does the issue lie with the service or the pod?

In order to narrow this down, port forwarding can be used to directly connect to the pod: –

Read the whole thing.

dbatools: the Book

Chrissy LeMaire has an exciting announcement:

After nearly 10 months of work, early access to Learn dbatools in a Month of Lunches is now available from our favorite publisher, Manning Publications!

For years, people have asked if any dbatools books are available and the answer now can finally be yes, mostly. Learn dbatools in a Month of Lunches, written by me and Rob Sewell (the DBA with the beard), is now available for purchase, even as we’re still writing it. And as of today, you can even use the code bldbatools50 to get a whopping 50% off.

They’re in active book development, so buy a copy now and watch as the book evolves.

Backup to URL in Azure

Joey D’Antoni recommends backing up database to Blob Storage via URL in Azure:

Unlike in your on-premises environment, where you might have up to a 32 Gbps fibre channel connection to your storage array and then a separate 10 Gbps connection to the file share where you write your SQL Server backups, in the cloud you have a single connection to both storage and the rest of the network. That single connection is metered and correlates to the size (and $$$) of your VM. So bandwidth is somewhat sacred, since backups and normal storage traffic go over the same limited tunnel. This doesn’t mean you can’t have good storage performance, it just means you have to think about things. In the case of the customer I mentioned, they were saturating their network pipe, by writing backups to the file system, and then having the Azure backup service backup their VM, they were saturating their pipe and making regular SQL Server I/Os wait.

Read on to see what the alternative is.


September 2019
« Aug