Press "Enter" to skip to content

Curated SQL Posts

Using The Spark Connector To Speed Up Data Loads

Denzil Riberio explains how you can use the Spark connector for Azure SQL DB and SQL Server to speed up inserting data from Spark into SQL Server 15x over the native JDBC client:

Since the load was taking longer than expected, we examined the sys.dm_exec_requests DMV while load was running, and saw that there was a fair amount of latch contention on various pages, which wouldn’t not be expected if data was being loaded via a bulk API.

Examining the statements being executed, we saw that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row; therefore, the operation is not a bulk insert. One can further example the Spark JDBC connector source code, it builds a batch consisting of singleton insert statements, and then executes the batch via the prep/exec model.

It’s the power of bulk insertion.

Comments closed

Reducing Backup Sizes

Ned Otter gives us some options for reducing the size of full backups:

SPARSE columns

IF a column contains mostly NULLs, then depending on the data type, you can achieve space savings by using the SPARSE property (documentation here). SPARSE columns can be used with filtered indexes to theoretically reduce storage space and increase query performance. But there are a boatload of gotchas, such as issues with query plan caching (filtered indexes), and the fact that if you use SPARSE columns, neither the table or indexes can have any form of compression (the documentation is clear about not supporting table compression, but does not mention index compression being an issue – but it is).

As the documentation clearly states, when converting a column from non-sparse to sparse, the following steps are taken:

  1. Adds a new column to the table in the new storage size and format
  2. For each row in the table, updates and copies the value stored in the old column to the new column
  3. Removes the old column from the table schema
  4. Rebuilds the table (if there is no clustered index) or rebuilds the clustered index to reclaim space used by the old column

For large tables with even a few columns that you wanted to convert to SPARSE, this process would take forever, because you must do this for each column you want to convert.

I don’t like sparse columns at all, but I do like the rest of Ned’s options.

Comments closed

Updated DIML Catalog Browser

Andy Leonard announces a new version of his Catalog Browser utility:

Catalog Browser first displays the reference mapping in the context of the environment named DEV_Person. DEV_Person is a Catalog Environment that contains a collection of Catalog Environment Variables.

Catalog Browser next displays the reference mapping in the context of the SSS Connection Manager named AdventureWorks2014.OLEDB that consumes the Reference between the DEV_Person environment and the Load_Person.dtsx SSIS package. Note that this Reference Mapping is displayed as <Property Name> –> <Environment Variable Name>, or “ConnectionString –> SourceConnectionString”. Why? Catalog Browser is displaying the Reference Mapping from the perspective of the Connection Manager property.

The third instance of Values Everywhere is shown in the Package Connection References node. Remember, a reference “connects” a package or project to an SSIS Environment Variable (learn more at SSIS Catalog Environments– Step 20 of the Stairway to Integration Services).  From the perspective of the reference, the reference mapping is displayed as  <Environment Variable Name> –> <Property Name>, or “SourceConnectionString –> ConnectionString”. Why? Catalog Browser is displaying the Reference Mapping from the perspective of the Reference.

Check it out.

Comments closed

Update Statistics After An Upgrade

Erin Stellato gives us some good life advice:

There are a variety of methods we use for helping customers upgrade to a new SQL Server version, and one question we often get asked is whether or not statistics need to be updated as part of the upgrade process.

tl;dr

Yes.  Update statistics after an upgrade.  Further, if you’re upgrading to 2012 or higher from an earlier version, you should rebuild your indexes (which will update index statistics, so then you just need to update column statistics).

Make this one of your “Not too long; did read” posts of the day.

Comments closed

Choosing Between SQL, M, And DAX

Paul Turley chooses the form of his destructor:

A Date dimension table is an essential component in most any data warehouse or reporting database so techniques to generate these tables have been around for a long time.  The foundation of a Date dimension table is a table containing one row per contiguous date in a range that includes every possible transaction date or fact record.  To make reporting easier, it is common practice to have multiple date dimensions in the semantic model.  For example, if sales transaction facts have an Order Date and a Delivery Date, and both are used independently for reporting; there may be an Order Date dimension and a Delivery Date dimension in the model.

A common practice for building the dimension table is to just populate a single Date type column with the sequential date values.  After these rows are inserted, date part functions may be used to populate additional columns by referencing the Date value in an expression.  Most every language includes, for example, a MONTH() and YEAR() function to convert a date value into these date parts.

I’m hoping that Paul puts together several of these types of post, where he contrasts building something in SQL, M, and DAX so we can see which language helps most where.

Comments closed

Error Processing SSIS Task When TargetServerVersion Is SQL Server 2016

Shabnam Watson diagnoses an error condition when trying to run an Analysis Services processing task inside SQL Server Integration Services:

I ran into this problem a while ago at a client. They upgraded from Visual Studio 2013 to 2015 and the SSAS processing tasks started to error out immediately. The solution turned out to be setting the TargetSeverVersion to anything but SQL Server 2016. In this case, it was set to 2014 and that fixed the error.

Recently I ran into this post https://twitter.com/SQLKohai/status/994335086425399297 by Matt Cushing (@SQLKohai) and decided to dig in more.  Initially when I tested it, all was working fine. After I installed SSDT 2015 to test, I started getting the same error in SSDT 2017.  I played around with a DLL and got SSDT 2017 to work with all TargetVersionServers again. At the end I managed to break it again after I went through an uninstall and reinstall of all versions of SSDT. The reason I did the reinstall of SSDT was that I thought I might have had a broken registry entry that I was hoping the installation would fix. This did not work!

Read on for the solution and a detailed dive into the problem.

Comments closed

Non-Blocking Aggregations

Daniel Hutmacher tilts at windmills:

It’s not entirely uncommon to want to group by a computed expression in an aggregation query. The trouble is, whenever you group by a computed expression, SQL Server considers the ordering of the data to be lost, and this will turn your buttery-smooth Stream Aggregate operation into a Hash Match (aggregate) or create a corrective Sort operation, both of which are blocking.

Is there anything we can do about this? Yes, sometimes, like when those computed expressions are YEAR() and MONTH(), there is. But you should probably get your nerd on for this one.

There are many ways to solve a problem, and sometimes the best method is indirect.

Comments closed

Building A Neural Network With TensorFlow

Julien Heiduk gives us an example of building a neural network with TensorFlow:

To use Tensorflow we need to transform our data (features) in a special format. As a reminder, we have just the continuous features. So the first function used is: tf.contrib.layers.real_valued_column. The others cells allowed to us to create a train set and test set with our training dataset. The sampling is not the most relevant but it is not the goal of this article. So be careful! The sample 67-33 is not the rule!

It’s probably an indicator that I’m a casual, but I prefer to use Keras as an abstraction layer rather than working directly with TensorFlow.

Comments closed

Scientific Debt

David Robinson gives us the data science analog to technical debt:

In my new job as Chief Data Scientist at DataCamp, I’ve been thinking about the role of data science within a business, and discussing this with other professionals in the field. On a panel earlier this year, I realized that data scientists have a rough equivalent to this concept: “scientific debt.”

Scientific debt is when a team takes shortcuts in data analysis, experimental practices, and monitoring that could have long-term negative consequences. When you hear a statement like:

  • “We don’t have enough time to run a randomized test, let’s launch it”
  • “To a first approximation this effect is probably linear”
  • “This could be a confounding factor, but we’ll look into that later”
  • “It’s directionally accurate at least”

you’re hearing a little scientific debt being “borrowed”.

Read the whole thing.  I strongly agree with the premise.

Comments closed

Is Query Folding Taking Place?

Chris Webb has a quick way of seeing if query folding is taking place when importing data from Analysis Services into Power Query (either Power BI or Excel):

As a quick follow-on from last week’s post on how to detect whether query folding is taking place when importing from OData data sources, if you’re importing data from Analysis Services you have a similar problem: how do you know whether query folding is taking place? Ensuring that query folding takes place for as many of the steps in your query – especially those that filter or otherwise reduce the amount of data returned – is very important for data refresh performance.

Although the Power Query engine generates MDX queries when importing from Analysis Services in the same way it generates SQL queries when it imports from a relational database, the View Native Query option doesn’t work for Analysis Services data sources. You can of course use a Profiler trace or xEvents to see the MDX, but for most users that will not be an option for security reasons.

Read on for a better alternative.

Comments closed