Press "Enter" to skip to content

Curated SQL Posts

Extracting Numerical Data Points From Images

Matt Allington visualizes changes in the Gartner magic quadrant for BI tools:

Today Gartner released the 2019 magic quadrant for Business Intelligence.  As expected (by me at least), Microsoft is continuing its trail blazing and now has a clear lead over Tableau in both ability to execute and completeness of vision.  I thought it would be interesting to see a trend over time for the last 5 years, as this is the time period that I have been a professional Power BI Consultant.  I needed some way to extract the numerical data points from the images I had collected.  This article shows you how to do that.  Here is the final output – a scatter chart with a play axis in Power BI of course.

I was just commenting the other day about how somebody should do this and Matt went and did it.

Comments closed

Clearing sysssislog

Eduardo Pivaral shows that you should clear out some of the bigger SSIS tables occasionally:

If you have SQL Server Integration Services installed on your server, and you left the default configurations a table named sysssislog is created on MSDB database, it contains logging entries for packages executed on that instance.

If you are not careful enough, this table can grow uncontrollably over time and can make subsequent insertions very slow.

This kind of batched delete works for more than just the Integration Services tables; it’s a good plan wherever you have a large table and need to delete a fairly significant number of records from it.

Comments closed

sp_executesql WITH RECOMPILE

Erik Darling points out that running sp_executesql with the WITH RECOMPILE setting doesn’t really change anything:

This’ll give us the key lookup plan you see above. If I re-run the query and use the 2010-12-30 date, we’ll re-use the key lookup plan.

That’s an example of how parameters are sniffed.

Sometimes, that’s not a good thing. Like, if I passed in 2008-12-30, we probably wouldn’t like a lookup too much.

One common “solution” to parameter sniffing is to tack a recompile hint somewhere.

Click through for Erik’s demonstration.

Comments closed

Azure Data Studio: Extension Installation

Grant Fritchey shows how easy it is to add an extension to Azure Data Studio:

If you’re even thinking about experimenting with, let alone actively using, Azure Data Studio, you need to plan on installing a few extensions. Buck Woody has a great list that you should look through in this blog post. If you’re just getting started with Azure Data Studio, I have an introduction here.

Depending on the extension, this could be a simple as a mouse click. However, not all the extensions are that easy. Let’s explore this just a little so when you do start using Azure Data Studio, things are easy.

You can reasonably install Management Studio and never think about adding extensions. Don’t do that with Azure Data Studio, though: a lot of the benefit comes from its extensibility. And Microsoft tends to add things as extensions before bringing them into the base product.

Comments closed

Working With Skewed Data In Pig

Dmitry Tolpeko explains how you can use the Weighted Range Partitioner in Apache Pig to work with highly skewed data:

The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s too much.

Reducing the number of map tasks is not always an option as it increases the execution time, and the number of files is still high (500 tasks will produce 125,000 files per day), while our goal is to have about 500-700 files daily for this single table.

A well-known solution for this problem is to force a reduce phase. But what if the data is highly skewed, for example, when there are few event types having hundreds of millions rows, while others just have a few million rows?

This is a combination of an interesting problem and an interesting solution.

Comments closed

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming:

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.

Comments closed

Dealing With Massive Database Counts

Edward Harrison describes some of the issues you run into when working on a SQL Server instance with thousands of databases:

One of the headaches we frequently have to deal with is the excessive amount of time it takes for SQL Servers with the discussed design to go through any planned patches. We have seen this while carrying out some planned maintenance for one of our customers and the work simply comprised failing over a SQL Server instance, from the active node, over onto the passive node so that the former active node could be patched. The passive node had had its SQL Server patched earlier in the day while the active node was still on the older patch level.

This SQL Server had 10,000 databases and all that we needed to do was to simply click the buttons to failover the SQL Server and then wait for it come online on the other node. In the end, this failover took nearly four hours to complete because upgrade scripts had to be applied to each of the 10,000 databases.

Even a couple thousand databases on an instance can slow things down considerably. Consider an operation which takes 10 milliseconds like laying out a database in the object explorer in SSMS. With 10K databases, that 10ms activity (note that I’m making up the number) would take 100 seconds. And these are often blocking tasks, meaning you’re sitting there spinning your wheels for 100 seconds before you can do the thing you wanted to do. The pain is real. And for server activities, most administrative functions (backups, CHECKDB, index rebuilding, etc.) is done one at a time, meaning you have a lot of time spent on this stuff.

Comments closed

Installing R From Powershell

Tomaz Kastrun shows us how to install R and RStudio via Powershell:

For the brevity of this post, I will only download couple of R packages from CRAN repository, but this list is indefinite.
There are ways many ways to retrieve the CRAN packages for particular R version using powershell. I will just demonstrate this by using Invoke-WebRequest cmdlet.
Pointing your cmdlet to URL: https://cran.r-project.org/bin/windows/contrib/3.5  where  list of all packages for this version is available. But first we need to extract the HTML tag where information is stored.

There’s quite a bit of code here, but the upside is that you get the ability to automate server installs.

Comments closed

Creating Benchmarks With WorkloadTools

Gianluca Sartori announces a new tool:

Throughout my career, I had to go through the pain of benchmarking often enough to get fed up with all the existing tools and decide to code my own. The result of this endeavor is WorkloadTools: a collection of tools to collect, analyze and replay SQL Server workloads, on premises and in the cloud.

At the moment, the project includes 3 tools:
SqlWorkload – a command line tool to capture, replay and analyze a workload
ConvertWorkload – a command line tool to convert existing workloads (traces and extended events) to the format used by SqlWorkload
WorkloadViewer – a GUI tool to visualize and analyze workload data

Click through for the link to check it out.

Comments closed

Power Query SSIS Source Preview

Joost van Rossum is excited about a new SSIS data source in preview:

First you need to download and install the latest SSDT (Version 15.9.0) and since it is still in preview, you can only use it within SSDT or on a Azure Integration Runtime in Azure Data Factory.

If you drag the new Power Query Source to your Data Flow canvas and edit it, you can paste your Power Query script from Power BI (or Excel) in the query textbox. So there is not yet(?) an editor, but this should simplify a lot of tasks that could were previously could only be solved with .Net scripting.

Read on for more details. Because it’s a preview, there’s still a lot of work yet to do it, it seems. But on the bright side, it’s new functionality in Integration Services.

Comments closed