Press "Enter" to skip to content

Day: May 23, 2019

Exploratory Data Analysis with inspectdf

Laura Ellis continues a dive into Exploratory Data Analysis, this time using the inspectdf package:

I like this package because it’s got a lot of functionality and it’s incredibly straightforward to use. In short, it allows you to understand and visualize column types, sizes, values, value imbalance & distributions as well as correlations. Better yet, you can run each of these features for an individual data frame, or compare the differences between two data frames.

I liked the inspectdf package so much that in this blog, I’m going to extend my previous EDA tutorial with an overview of the package.

There are some interesting functions which make EDA easier, so check it out.

Comments closed

MRAN Changes and a Survey

David Smith discusses potential changes to MRAN:

As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We’re contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven’t been used. But before we do that we’d  like to hear from the community first. Have you used MRAN snapshots? If so, how are you using them? How many different snapshots have you used, and how often do you change that up? Please leave your feedback at the survey link below by June 14, and we’ll use the feedback we gather in our decision-making process. Responses are anonymous, and we’ll summarize the responses in a future blog post. Thanks in advance!

Please take the survey as well. If you’ve used SQL Server Machine Learning Services (or SQL Server R Services), you’ve used MRAN.

Comments closed

PolyBase on Linux

I have a post showing how to set up PolyBase on Linux:

Now that we have SQL Server on Linux installed, we can begin to install PolyBase. There are some instructions here but because we started with the Docker image, we’ll need to do a little bit of prep work. Let’s get our shell on.

First, run docker ps to figure out your container ID. Mine is 818623137e9f. From there, run the following command, replacing the container ID with a reasonable facsimile of yours.

I actually fired up my copy of SimCity 2000 to take a screenshot for this post. The things I do for my audience.

Comments closed

Making SpeedPASSes Better

Wayne Sheffield shares some useful tips for making the SpeedPASS experience better:

As things were wrapping up for our event last year, there was an important change made at the SQLSaturday site for dealing with SpeedPASSes. Previously, the admission ticket was sized differently from everything else, which created hassles in cutting and in using perforated paper (and why there were custom-printed SpeedPASSes – so that we could make them fit). However, the change was that all of the labels are now the same size. This means that perforated paper can now be used to print out the SpeedPASS PDF files that are generated at the SQLSaturday site.

Additionally, I was looking for ways to automate some of the manual process of merging the PDF files as described in the previous post, especially with the work necessary in the Excel spreadsheet. The end result is that we now have a new process that we used, and I’m sharing it with everyone to help them out.

Read on for tips including perforated paper and Powershell scripts tying it all together.

Comments closed

Avoiding DONE Tokens in Loops

Emanuele Meazzo shows one reason why loops can be so much slower in T-SQL:

Not everybody knows that SQL Server sends a DONE Token to the client each time that a SQL statement completes (so, everything except variable declarations); For the query above you can basically track it with extended events by tracking the “SQL Statement Completed” event.
What happens in a loop? For each statement that’s completed, a token is sent, which means that for this loop that contains 3 statements (the WHILE loop itself it’s a statement) 15 Million tokens are sent to the client.

There can be a drastic difference, as Emanuele shows.

Comments closed

SQL Server 2019 CTP 3.0

The SQL Server team has announced the latest CTP for SQL Server 2019:

Big data clusters
– Scale out by supporting deployment configurations with an increased number of SQL Server instances in the compute pool. You can now specify up to 4 instances in the compute pool for optimal performance of your queries against data pool, storage pool, or other external data sources.
The mssqlctl utility includes updates to ease the big data cluster management experience with enhancements to the login experience. There is also a new command to list the cluster endpoints.
Persistent volumes abstract the details of how the storage is provided and how it’s consumed. In this release, we’re enhancing the supported storage configurations by enabling you to customize storage classes independently for logs and data. With these changes, we also consolidated the storage configurations for big data components, so that the number of persistent volume claims for a big data cluster has been reduced for a default minimum configuration cluster.

There are a few other changes announced in this CTP. Now that we’re at 3.0, the light is at the end of the tunnel.

Comments closed

Simplifying Imported Biml

David Stein takes us through an example of what imported Biml looks like and how you can make it better:

This post is part of a series, Short Cuts to Biml Mastery which shows you how to learn BimlScript through a series of short cuts and remain productive while doing so. In the previous article, I covered how to import existing packages into Biml using BimlExpress. In this article, we’ll examine that Biml and simplify it in preparation for automation. If you’ve been reading along in this series, you’ll remember that we started with a single SSIS package which extracts data from a SQL source and creates a CSV file with quote identifiers. 

Read on to see the example and David’s first steps toward improving it.

Comments closed