Memory Management in Spark

Rishitesh Mishra has started a new series on slow or failing Spark applications and starts with the big reason:

If we were to got all Spark developers to vote, out of memory (OOM) conditions would surely be the number one problem everyone has faced. This comes as no big surprise as Spark’s architecture is memory-centric. Some of the most common causes of OOM are:
* Incorrect usage of Spark
* High concurrency
* Inefficient queries
* Incorrect configuration

Definitely worth the read.

An Analysis of Popular R Packages

John Mount gives us a directed graph of popular R packages:

In the above diagram each node is an R package. We are restricting our selves to popular extension packages: those that have at least 1,000 indirect uses in CRAN (via Depends/Imports/LinkingTo; excluding base packages such as stats and utils).

John has some interesting insights here as well.

Dynamic Routing with Kafka Streams

Yeva Byzek explains how you can use Kafka Streams to perform dynamic routing of messages:

A cleaner way is to provide the service with a separate stream that contains only the relevant subset of events that the microservice cares about. To achieve this, a streaming application can branch the original event stream into different substreams using the method KStream#branch(). This results in new Kafka topics, so then the microservice can subscribe to one of the branched streams directly.

For example, in the finance domain, consider a fraud remediation microservice that should process only the subset of events suspected of being fraudulent. As shown below, the original stream of events is branched into two new streams: one for suspicious events and one for validated events. This enables the fraud remediation microservice to process just the stream of suspicious events, without ever seeing the validated events.

Read on to learn more.

The (Un)Importance of Index Fragmentation

Tibor Karaszi argues that index fragmentation is a less serious issue than most DBAs think:

You know the story. Every week or so, we defragment the indexes. Many of us uses Ola Hallengren’s great script for this, some uses Maintenance Plans, and there are of course other alternatives as well. But are we just wasting time and effort? Quite probably we are. I’m going to start with some basics, and then do some reasoning, and finally give you some numbers of a very simple test that I ran. The T-SQL code is available. If you give it a try, please let us know your finding for your environment by adding a comment. I will do some generalizations and simplifications, to avid this post being 10 times longer.

Jeff Moden has a couple of great talks on the topic which really pushed me in this direction. Grab his slides from the SQL Saturday site for a much deeper look at this topic.

Learning with Limited Data

Shioulin Sam and Nisha Muktewar have new research on machine learning when getting labeled data is time-consuming or difficult:

We are excited to release Learning with Limited Labeled Data, the latest report and prototype from Cloudera Fast Forward Labs.

Being able to learn with limited labeled data relaxes the stringent labeled data requirement for supervised machine learning. Our report focuses on active learning, a technique that relies on collaboration between machines and humans to label smartly.

Active learning makes it possible to build applications using a small set of labeled data, and enables enterprises to leverage their large pools of unlabeled data. In this blog post, we explore how active learning works. (For a higher level introduction, please see our previous blogpost.

The research itself is behind a paywall but you can see their write-up to get an idea of the topic.

Understanding the Page Resource Cracker

John Morehouse looks at the new sys.fn_PageResCracker() function in SQL Server 2019:

In a previous blog post, I discussed two new methods in SQL Server 2019 to determine exactly which page a request might be waiting for when there is contention.  One of these new methods involves a new function, fn_pagerescracker.   Naturally, I wanted to see how this function operates.  Let’s look at the Master database to investigate how it works!

Click through for the function definition and what it all means.

Desired State Configuration Resources

Jess Pomfret takes us through resources in Desired State Configuration:

A critical part of our DSC configuration is made up of resources. These are the building blocks we need to to define our desired state.  There are two kinds of resources that we can use: class based and MOF based (most common). We are going to focus our efforts today on looking at MOF based resources.

Resources come packaged up as modules and our servers, which use at least WMF 4.0, come with several built-in. We have two main options for additional resources; we can find DSC resource modules in the PowerShell Gallery or we can write our own.

Jess wraps up the post with five useful resources for database administrators.

The Value of Differential Backups

Jamie Wick explains what differential backups are and when they can be useful:

SQL Server natively supports 3 types of backups: Full, Differential and Log. Full backups take a complete backup of the entire database, while Log backups take a backup of the database’s transaction log. So, What are Differential backups?Are they really necessary?

Read on to see what Jamie has to say.

Limitations on Inlining Functions

Erik Darling takes us through a limitation to SQL Server 2019 scalar function inlining:

There’s a lot of excitement (alright, maybe I’m sort of in a bubble with these things) about SQL Server 2019 being able to inline most scalar UDFs.

But there’s a sort of weird catch with them. It’s documented, but still.

If you use GETDATE in the function, it can’t be inlined.

GETDATE() and its bretheren are non-deterministic so I figured that would be an issue. Check out the documentation for the other limitations.


April 2019
« Mar