Press "Enter" to skip to content

Author: Kevin Feasel

Storytelling With Data

Vik Paruchuri walks through exploratory data analysis using New York City schools data:

Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area.

We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll:

  • Group full by school district.

  • Compute the average of each column for each school district.

  • Convert the school_dist field to remove leading 0s, so we can match our geograpghic district data.

Also check out part 1 if you missed it.

Comments closed

Suppress Assembly Output

Richie Lee shows how to hide assembly load information in a Powershell script:

Recently I’ve been wondering how I can suppress the output in my PowerShell scripts when loading assemblies into them. I used to find them useful; but now I find them annoying and they are no substitute for error handling ( I used to find them handy as a way of telling me that the script had got this far in the script).

There is more than one way to suppress output to the console, but for assembly loading, I prefer to use [void] because it looks neater than the alternatives:

I’d never used this technique; I always piped to $null.

Comments closed

Using Lookup Components

Thomas LeBlanc explains how to use Lookup components as part of warehouse loading:

The “Fail component” option will stop processing of the import if no match is found in the lookup table. This is not a good option for loading data into a fact table. Most systems would want the import to continue for the customer surrogate keys that are found.

  1. Ignore Failure – Null will replace lookup values selected. Those rows with no match are streamed to the normal flow in package with the Null value in match columns selected.

  2. Redirect rows to error output – red line output will show a failure but can pipe the data to any component. Those rows with no match are not streamed to the normal flow in package.

  3. Fail component – the package stops with a failure if no match exists, if all match there is no failure

  4. Redirect rows to no match output – output can be piped to another component and processing continues. Those rows with no match are not streamed to the normal flow in package.

Lookup components are one of the more powerful components in a data flow task.  Read the whole thing.

Comments closed

Restoring With Standby

Kenneth Fisher describes the WITH STANDBY option for database restorations:

It then leaves the database in a read-only state. So now you can restore the database back to a point in time (even mid log backup). Check the data. Restore a few minutes into the future. Check the data again. Over and over again until you are where you need to be. It’s still going to be tedious but better than doing the full restore over again each time you need to check, right?

On top of that we can use the same idea and combine it with log shipping. Now you not only have a spare in case of a DR situation but that spare can be read only most of the time (except when actually restoring). You can use it to run reports, ad-hoc queries, etc. (Don’t use it for CHECKDBs.)

Those are a few good uses of the WITH STANDBY option.

Comments closed

CISL 1.3.0

Niko Neugebauer has released version 1.3.0 of his Columstore script library:

I am extremely proud to share with everyone the news that the long-awaited and quite overdue release of the CISL – Columnstore Indexes Scripts Library is finally public – the 1.3.0 version. The most important part of this release is the support of the SQL Server 2016 & Azure SQLDatabase – all 3 scenarios (Nonclustered Columnstore, Disk-based Clustered Columnstore & the Memory-Optimised Clustered Columnstore Indexes) are included.
You will be able to explore all the important new architecture objects, such as Deleted Buffer & Deleted Table, plus the scripts for every version supports the new output results, even though there were no In-Memory tables in SQL Server 2012 for example.

If you use columnstore indexes, check this out.

Comments closed

Help With Extended Events

Jason Brimhall has two recent blog posts on figuring out Extended Events information.  First is a republication of an older article:

First let’s tackle the problem of discovery.  When we want to use extended events to try and troubleshoot a problem or to capture more information, it is really good to know if such an event exists.  There are many events that capture data for various different things within SQL Server.  More and more events are being added with each release.  More and more data is being made available to the DBA to help perform a better job and to help the DBA better understand what is really happening within the database environment.

In order to determine if there might be an event, that can provide the data for that one “thing” that may be happening within your environment, we could start by querying the SQL Server Internals.  This next query will do just that for us.

After you read that article and check out the queries there, Jason has another post on finding the right event:

In my previous article I demonstrated how to find an event based solely on the name or description of the event. This is fantastic if the event name (or description) contains one of the magical words you have used. What if the event name or description has nothing to do with the terms you selected? Or, what if the data you seek may be attached to the event but wouldn’t necessarily stand out as a description for that event (by name or description details for that event)?

Now comes the more difficult task right? If the name or description of the event doesn’t relate to the search terms then you just might overlook a few events and be stuck trying to troubleshoot a problem. An equally big problem this could cause is yet another invisible barrier to using Extended Events. It would be easy to slide down the slippery slope and not transition to Extended Events just because an event, applicable to the problem at hand, could not be found.

Check out both of these posts.

Comments closed

Clustering With Spark

Konur Unyelioglu shows how to implement k-means and Guassian clustering techniques in Apache Spark using MLlib:

Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.

There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.

This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.

Comments closed

Rename A Primary Key Constraint

Steve Jones shows how to rename a primary key constraint:

When you compare that with the same table in another database, what’s the likelihood that you’ll have the PK named PK__OrderDet__D3B9D30C7D677BB4? Probably pretty low.

This means that if you are looking to deploy changes, and perhaps compare the deployment from one database to the next, you’ll think you have different indexes. Most comparison tools will then want to change the index on your target server, which might be using this technique. Or the choice might be something that performs much worse.

What we want to do is get this named the same on all databases. In this case, the easiest thing to do with rename the constraint on all systems. This is easy to do with sp_rename, which is better than dropping and rebuilding the index.

Do read this and avoid renaming a constraint the bad way.

Comments closed

Statistic Column Sort Order

Shaun J. Stuart points out an inconsistency in display order for columns on a statistic:

What’s going on? Why are the columns in the statistic not in the same order as the columns in the index? Well, it turns out, they are. If we look on the Details page, we see the density vector is, in fact, created as Col2, Col1, Col3, which is the order of the columns in the index:

Read the whole thing to avoid confusion next time you look at the statistics GUI.

Comments closed