Data Frame Partial Caching

Kevin Feasel



Arijit Tarafdar shows how to capture partitions of a data frame in Spark, either horizontally or vertically:

In many Spark applications, performance benefit is obtained from caching the data if reused several times in the applications instead of reading them each time from persistent storage. However, there can be situations when the entire data cannot be cached in the cluster due to resource constraint in the cluster and/or the driver. In this blog we describe two schemes that can be used to partially cache the data by vertical and/or horizontal partitioning of the Distributed Data Frame (DDF) representing the data. Note that these schemes are application specific and are beneficial only if the cached part of the data is used multiple times in consecutive transformations or actions.

In the notebook we declare a Student case class with name, subject, major, school and year as members. The application is required to find out the number of students by name, subject, major, school and year.

Partitioning is an interesting idea for trying to speed up Spark performance by keeping everything in memory even when your entire data set is a bit too large.

Infrastructure Error

Suresh Kandoth explains the various causes of “Login-based server access validation failed with an infrastructure error” error messages:

The important piece of information from that message is the part about server access validation failed. So let’s examine that carefully. During the login process, the database engine has to perform several checks regarding the login and its various attributes before letting the application connect to the SQL Server instance. Out of that big list, related to this error condition, there are two checks to find out if this login is authorized to access this server instance. There are two permissions that come into play while performing these authorization checks:

– Does the login have the SERVER class permission named CONNECT SQL for this server instance?

– Does the login have the ENDPOINT class permission named CONNECT for the specific endpoint on which the client application established the connection?

Suresh lays out a number of scenarios and explains why they could cause this error to occur.

Continuous Integration

James Anderson is starting a series on continuous integration:

I had been using source control for years but it’s always felt like a tick box exercise that I was doing because I had to. I had never used it to review old versions to see where code went wrong or to quickly roll back changes if I decided I no longer wanted to go in a certain direction with the code. I never felt like I was getting anything back from using source control. Sometimes it takes a problem to arise for you to see the value of a solution.

In 2015 I started to inherit the code base for our internal maintenance database, the UtilityDB. This database is used to store performance metrics and to manage tasks such as index maintenance and backups. This database is installed on all of our instances.

This first post is an introduction to the series, and it looks like he’ll cover some heady topics.

Microsoft R Client

Kevin Feasel



Buck Woody discusses whether Microsoft R Client really is a client:

Enter the Microsoft R Client. It includes Microsoft R Open, and adds in some of the ScaleR functions, which makes processing data faster and more efficient. And again, it’s a full R environment – you can write and run code, right there on your desktop. But the important bit is that it can connect to a Microsoft R Server (MRS) by seting something called the “Compute Context“, which tells the R environment to run on a more powerful, scalable server environment, like you may be used to with SQL Server.

The naming is a bit of a head-scratcher, to be honest.

Notifying Different Sets Of Operators

Jason Clements has an interesting solution to the problem of user notification:

The other day I was asked How to notify multiple operators using database mail for failed jobs and a different operators for successes.

First I looked at the operator email addresses field [email protected];[email protected]….etc is not helping as there is a limit on the characters in the email name entry of operator and we still have the problem we need different groups for success and failures.

It makes perfect sense, but is non-trivial.  I like it.

Licensing Hyper-Converged Architectures

David Klee discusses SQL Server licensing on a unique (but growing in popularity) hardware architecture:

Now, look at the hyper-converged architectures. The CPUs used for I/O handling are the same as those that your VMs use to power your applications. A substantial portion of the host’s CPU power is now needed for I/O handling, and this activity comes first in the CPU scheduling queues.

This fact, by itself, is not necessarily a problem. Most virtualization host CPUs are relatively lightly utilized, and this amount of CPU needed for I/O handling is readily absorbed without causing a performance problem.

But, larger scale SQL Servers can read and write exceptionally large amounts of data around the clock. The I/O handling at the host layer can start to drain resources from the host. The additional activity scheduling time inside the hypervisor could be slowing down these SQL Servers without you knowing it.

As always, work out how your licensing would look under different alternatives or you could end up wasting a pretty penny.

Alternatives To Invoke-SQLCmd

Drew Furgiuele explains why he prefers not to use Invoke-SQLCmd and what he does instead:

I hope you’re wide awake and ready to get mad, because I have a very unpopular opinion regarding SQL Server and PowerShell: I don’t like using Invoke-SQLCmd . That’s not to say I haven’t used it; some of my older scripts on my GitHub use it but recently I have stopped. I think the cmdlet is of marginal use when compared to other methods to executing queries against a SQL Server database. Maybe you agree, but you probably don’t, so before you grab the torches and pitchforks and head to the comments, I encourage you to read on. It’s a little long, but I think it outlays my thoughts pretty well.

I think Drew makes a good point.  Read the whole thing.

Killing SSMS Using Powershell

Michael Bourgon shows how to kill an instance of SSMS using Powershell:

WMI in action!  I was running SSMS 2016 and it locked up on me.  So while I waited for it to become responsive I started up SSMS 2014. Still locked, but which do I kill in taskman? (The easy answer, is, of course: change the open query and then look in Applications, not Processes). I didn’t think of that, so used WMI.  You could also get fancy and figure out which is the oldest instance of SSMS and do it that way.

Pretty simple: Tell it the path (which is 130 for SSMS 2016), and kill that.

Click through for the script.  You can do more than just kill processes this way.  Think of a scenario in which you create a whitelist of valid processes and regularly check to see if anything outside that whitelist is running.  It’s a little more difficult to do than this script, but not that much tougher.

Cluster Shared Volumes

Edwin Sarmiento discusses Cluster Shared Volumes within Failover Cluster Instances:

CSVs still use the concept of a shared disk but with an added layer of abstraction. Instead of having the shared disk be accessible to only one WSFC node at a time, all of the WSFC nodes have their own logical paths to it. This is made possible thru the Server Message Block (SMB) protocol. Since the WSFC nodes are connected to each other thru the heartbeat network (it’s another reason to have a dedicated network for inter-node communication,) CSV can take advantage of this route to the shared storage to send I/O commands.

Don’t be confused. While all of the WSFC nodes have both physical and logical paths to the shared disk, only the node that owns the SQL Server clustered resource can own it, just like when using traditional shared disks. In other words, the owner node dictates how to send the I/O commands – either directly thru its own access path or thru the heartbeat network. This is why you won’t see any direct dependency on the shared disk if you look at the Dependency Report. It also means that the failover process will be faster with CSVs.

If you’re using SQL Server 2014 or later and have built a cluster, check this feature out and see if it fits in your environment.

Approximation Or Classification?

A blog post on the Algolytics blog discusses different approximation and classification models and when to use each:

Even if your target variable is a numeric one, sometimes it’s better to use classification methods instead of approximation ones. For instance if you have mostly zero target values and just a few non-zero values. Change the latter to 1, in this case you’ll have two categories: 1 (positive value of your target variable ) and 0. You can also split numerical variable into multiple subgroups : apartment prices for low, medium and high by equal subset width and predict them using classification algorithms. This process is called discretization.

Both types of models are common in machine learning, so a good understanding of when to use which is important.


June 2018
« May