Press "Enter" to skip to content

Curated SQL Posts

RevoScaleR

Tomaz Kastrun explains how the RevoScaleR package is useful:

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

If you’re using SQL Server R Services, these rx functions will become very important to you.

Comments closed

Azure SQL Database Extended Events

Arun Sirpal compares on-prem extended events to what’s available in Azure SQL Database:

There are 22 actions and 261 events. Naturally less than your local based SQL Servers, for example on my local 2014 machine running the above query returned 50 actions and 284 events.

There are a few subtle differences and a couple not-so-subtle differences, so it’s worth digging into if you plan to spin up an Azure SQL Database database.

Comments closed

Local Azure Data Lake

Julie Koesmarno shows how to set up Azure Data Lake for local testing:

Late last year, I presented a Cognitive Intelligence demo using Azure Data Lake (ADL) at PASS Summit keynote. It was a fun and quick demo! Watch it here :)

In case you’re new to ADL, you can now (since Dec 2015) develop, compile and run ADL locally in Visual Studio. This is huge! Because you don’t have to worry about your ADL Analytics Unit (AU) consumptions. Plus, this allows you to try it before you buy it too!

Click through for the step-by-step installation instructions.

Comments closed

SSRS Sender Display Names

Andy Mallon shows how to customize a display name for a sender e-mail address in Reporting Services:

The business wants the sending email address to be pretty. They want to identify the application, project, company, or… who knows what. The business wants anything but the server name. They might want it to come from NoReply@domain.com or CRM@domain.com or Reports@domain.com or…who knows what.

The problem with these more generic sender addresses is that they don’t help me troubleshoot. Is CRM@domain.com coming from SSRS or the application server? Which SSRS server is Reports@domain.com coming from? I don’t want to check 10 SSRS servers to figure out which one sent a report.

Click through to find out how to make the troubleshooting DBA and the business side happy.

Comments closed

Understanding Memory Grants

Erik Darling explains memory grants in SQL Server:

Our query memory grants range from around 8 MB to around 560 MB. This isn’t even ordering BY the larger columns, this is just doing the work to sort results by them. Even if you’re a smarty pants, and you don’t use unnecessary ORDER BY clauses in your queries, SQL may inject them into your query plans to support operations that require sorted data. Things like stream aggregates, merge joins, and occasionally key lookups may still be considered a ‘cheaper’ option by the optimizer, even with a sort in the plan.

Of course, in our query plans, we have warnings on the last two queries, which had to order the VARCHAR(8000) column.

This shows just how much difference a simple column size can make.

Comments closed

Superheat

David Smith shows off a very cool heatmap package called superheat:

While the superheat pacakge uses the ggplot2 package internally, it doesn’t itself follow the grammar of graphics paradigm: the function is more like a traditional base R graphics function with a couple of dozen options, and it creates a plot directly rather than returning a ggplot2 object that can be further customized. But as long as the options cover your heatmap needs (and that’s likely), you should find it a useful tool next time you need to represent data on a grid.

The superheat package apparently works with any R version after 3.1 (and I can confirm it works on the most recent, R 3.3.2). This arXiv paper provides some details and several case studies, and you can find more examples here. Check out the vignette for detailed usage instructions, and download it from its GitHub repository linked below.

Click through for some great-looking examples.

Comments closed

Spark UDFs

Curtis Howard explains how to create a user-defined function in Apache Spark:

It’s important to understand the performance implications of Apache Spark’s UDF features.  Python UDFs for example (such as our CTOF function) result in data being serialized between the executor JVM and the Python interpreter running the UDF logic – this significantly reduces performance as compared to UDF implementations in Java or Scala.  Potential solutions to alleviate this serialization bottleneck include:

  1. Accessing a Hive UDF from PySpark as discussed in the previous section.  The Java UDF implementation is accessible directly by the executor JVM.  Note again that this approach only provides access to the UDF from the Apache Spark’s SQL query language.
  2. Making use of the approach also shown to access UDFs implemented in Java or Scala from PySpark, as we demonstrated using the previously defined Scala UDAF example.

In general, UDF logic should be as lean as possible, given that it will be called for each row.  As an example, a step in the UDF logic taking 100 milliseconds to complete will quickly lead to major performance issues when scaling to 1 billion rows.

Definitely worth a read.  UDFs in Spark can come at a performance penalty, so they aren’t free.

Comments closed

Parsing Text Fragments

Aaron Bertrand looks at a way of speeding up LIKE %Something% queries and builds a fragment table:

It’s clear that in this specific case – with an address column of nvarchar(60) and a max length of 26 characters – breaking up each address into fragments can bring some relief to otherwise expensive “leading wildcard” searches. The better payoff seems to happen when the search pattern is larger and, as a result, more unique. I’ve also demonstrated why EXISTS is better in scenarios where multiple matches are possible – with a JOIN, you will get redundant output unless you add some “greatest n per group” logic.

Read the whole thing.  If you’re interested in the concept, I recommend reading up on n-grams, like Alan Burstein’s series and this TechNet article on implementing N-Grams in SQL Server.

Comments closed

Agent Schedule Unique IDs

Chris Sommer has a gripe about the schedule_uid column in SQL Agent jobs:

When you script out a SQL Agent Job you’ll notice that the job schedule will have a schedule_uid parameter (providing your job has a schedule). The gottcha lies in that schedule_uid. If you create another job schedule with the same schedule_uid, it will overwrite the schedule for any jobs that are using it. i.e. Any other jobs that are using that schedule_uid will start using the new schedule. Normally I consider UID’s as very unique and chances of a collision are low, but if you do a fair amount of copying jobs between SQL Servers there’s a good chance this will bite you eventually. That’s what happened to us (more than once).

Lets see if I can explain it better in an example.

Something to think about when scripting SQL Agent jobs.

Comments closed

Testing Nested Sets

Nate Johnson continues his nested set series:

Now the complex.  To validate that our tree is properly structured, the following statements need to be true:

  1. Each node’s Right value is greater than its Left.

  2. More to the point, each node’s Right value is greater than all of its ancestors’ Left values.

  3. Similarly, each node’s Left value is less than all of its descendants’ Left values (and Right values, obviously!)

  4. Leaf nodes have no gaps between Left & Right: Right = Left + 1

  5. Depth is easy to verify because we already wrote the rCTE to calculate it!

  6. And of course, no orphans – all ParentIDs lead to an actual parent node, except of course if they’re NULL (root nodes).

Read on for further explanation of these points.

Comments closed