Press "Enter" to skip to content

Curated SQL Posts

Single-Node PySpark

Gengliang Weng, et al, explain that even a single Spark node can be useful:

It’s been a few years since Intel was able to push CPU clock rate higher. Rather than making a single core more powerful with higher frequency, the latest chips are scaling in terms of core count. Hence, it is not uncommon for laptops or workstations to have 16 cores, and servers to have 64 or even 128 cores. In this manner, these multi-core single-node machines’ work resemble a distributed system more than a traditional single core machine.

We often hear that distributed systems are slower than single-node systems when data fits in a single machine’s memory. By comparing memory usage and performance between Spark and Pandas using common SQL queries, we observed that is not always the case. We used three common SQL queries to show single-node comparison of Spark and Pandas:

Query 1. SELECT max(ss_list_price) FROM store_sales

Query 2. SELECT count(distinct ss_customer_sk) FROM store_sales

Query 3. SELECT sum(ss_net_profit) FROM store_sales GROUP BY ss_store_sk

To demonstrate the above, we measure the maximum data size (both Parquet and CSV) Pandas can load on a single node with 244 GB of memory, and compare the performance of three queries.

Click through for the results.

Comments closed

Relationships Between Numerical Features

Stacia Varga continues her exploratory data analysis series using hockey data:

Let’s start with something easy and understandable to analyze. If I put age on the horizontal axis and weight on the vertical axis. It’s a common practice to put an explanatory variable on the horizontal axis and a response variable on the vertical axis. In other words, I’m looking to see how an increase in age (explanation) affects – or not – weight (response) for all the hockey players in the current season, regardless of team.

If I put age on the horizontal axis – does this explain weight? Sort of – the combinations of age and weight have some groupings. It almost appears that there is a greater number of younger, heavier players than older, heavier players, but it’s hard to tell here how the age/weight combinations are distributed because I can’t see all the individual points.

Read the whole thing, while keeping in mind that correlation does not imply causation.

Comments closed

Alerting On tempdb Growth

Lori Brown shows how to use a SQL Agent alert to warn you if tempdb grows beyond a certain size:

Lastly, create a SQL Alert to notify you as soon as tempdb grows past the threshold you stipulate. Using the GUI to create the alert, you need to fill out every field on the General page and make sure the Enabled checkbox is marked. Create a Name for the alerts, then specify the Type as SQL Server performance condition alert. The Object should be Databases, the Counter is Data File(s) Size (KB), and the Instance will be tempdb. The alert will trigger if counter rises above the value. The Value will depend upon the cumulative size of your tempdb files. In this case each tempdb file is 12GB (or 12,288,000 KB), so the total size is 98,304,000 KB.

I liked the approach of only firing the SQL Agent job after a trigger was met, rather than running a job which queries and then creates an e-mail afterward.

Comments closed

Default Displayed Properties In Powershell

Claudio Silva explains the default displayed properties in Powershell and how you can find non-default properties:

First, let me say that this person knows that Select-Object can be used to select the properties we want, so he tried to guess the property name using a trial/error approach.

The person tried:

Get-Service WinRM | Select-Object Startup, Status, Name, DisplayName

and also:

Get-Service WinRM | Select-Object StartupType, Status, Name, DisplayName

But all of them were just empty.

There is a better way.

Comments closed

Switching Between Windows And Linux Containers

Chris Taylor demonstrates a couple ways of switching from Linux to Windows containers in Docker for Windows:

If you are using Docker for Windows and want to switch between Linux or Windows containers you can do this by right clicking the Docker “Whale” in the systray and selecting “Switch to Windows containers”:

….but no one likes clicking around do they!

There is an alternative way to do this which I use in my docker session demo’s which makes things so much easier and the switch is a lot quicker!

Click through for the Powershell call, which has the added benefit of being scriptable.

Comments closed

Automatically Updating dbatools

Garry Bargsley gives us two ways to update dbatools on a schedule:

I have been using dbatools heavily since I was introduced to it.  I have automated processes and created new processes with it.  There are new commands that come out almost daily that fill in certain gaps or enhance current commands.  One way to stay current with these updates is to update your dbatools install frequently.

How better to do this than to have an auto update process that will run daily and get the latest dbatools version for you…

I have put together two ways of doing this based on your preferred method.  One is via a SQL Agent Job and the other is using a Windows Task Scheduler job.

Read on for examples of both techniques.

Comments closed

Changing Docker Named Volume Locations

Andrew Pruski answers an attendee question:

A few weeks ago I was presenting at SQL Saturday Raleigh and was asked a question that I didn’t know the answer to.

The question was, “can you change the location of named volumes in docker?”

This is one of the things that I love about presenting, being asked questions that I don’t know the answer to. They give me something to go away and investigate (many thanks to Dave Walden (b|t) for his help!)

Read on for Andrew’s answer.

Comments closed

Microsoft R Open 3.4.4

David Smith announces Microsoft R Open 3.4.4:

An update to Microsoft R Open (MRO) is now available for download on Windows, Mac and Linux. This release upgrades the R language engine to version 3.4.4, which addresses some minor issues with timezone detection and some edge cases in some statistics functions. As a maintenance release, it’s backwards-compatible with scripts and packages from the prior release of MRO.

MRO 3.4.4 points to a fixed CRAN snapshot taken on April 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

David also spills the beans on when we’ll see MRO 3.5.0.

Comments closed

Let’s Not Talk About Timestamp

Randolph West hits us with a misnamed SQL Server data type:

It occurred to me that we haven’t covered the TIMESTAMP data type in this series about dates and times.

TIMESTAMP is the Windows Millennium Edition of data types. It has nothing to do with date and time. It’s a row version. Microsoft asks that we stop calling it TIMESTAMP and use ROWVERSION instead.

Much like DECIMAL is a synonym of NUMERIC, so too is TIMESTAMP a synonym of ROWVERSION. Please call it a ROWVERSION and pretend that TIMESTAMP doesn’t exist. Microsoft is deeply sorry for the confusion.

As I say, dates and times are hard.  But at least this is easy:  if you don’t use it, you won’t have problems with it.

Comments closed