Press "Enter" to skip to content

Curated SQL Posts

Migrating Excel Power Pivot Models To SSAS

Imke Feldmann has a walkthrough to show how to migrate a Power Pivot model in Excel into SQL Server Analysis Services by way of Power BI:

In Visual Studio there is a wizard to migrate an Excel Power Pivot model to a SSAS model. But this will not bring over the M-queries unfortunately. But there is a workaround to achieve this. It requires SQL Server 2017 or higher.

Click through for a list of operations and a video showing how it’s done.

Comments closed

Checking For Temp Table Existence

Wayne Sheffield offers a clinic on temp tables:

I was recently reviewing a newly created T-SQL stored procedure. This procedure was verifying temporary table existence with the following code:

Seeing this takes me back to one of my favorite presentations, where I compare Temporary Tables and Table Variables. In this, I go over several of the methods that I have seen for how code found on the internet actually does this task… and I show why they are all doing it wrong.

Read on to understand why this isn’t the correct answer.

Comments closed

Verifying SSIS Database Connections With ssisUnit

Bartosz Ratajczyk shows how to test project-level connections in SQL Server Integration Services with ssisUnit:

Previously we successfully prepared tests for variables and parameters using VariableCommandand and ParameterCommand. Now it’s time to communicate with the database, and for that, I will use connection manager defined on the project level. I know from the ssisUnit tutorials it works perfect with package connection managers, so it’s time to verify it against the projects. I will test the package 10_ProjectCM.dtsx – it is just getting a single value from the table in a database and storing it in a variable. All the packages and unit tests are on my GitHub.

The package contains three SQL Tasks: the first just checks if we can communicate with the database using SELECT 1 statement, the second gets the information from the table, and the third repeats the second on the container level.

Click through for the tests.

Comments closed

Optimizing SSIS Throughput With Buffer Properties

Andy Leonard explains how he uses data flow properties to tune SQL Server Integration Services package performance:

I started answering a question on SQL Community Slack’s #ssis channel and I realized this would be better served as a blog post. The question was about three SSIS Data Flow properties: DefaultBufferSize, Engine Thread and DefaultBufferMaxRows.

I rarely change the EngineThreads property.

DefaultBufferSize and DefaultBufferMaxRows are two ways of managing the size limits of a Data Flow buffer. The two Data Flow Task properties can – and should – be treated as a single property. DefaultBufferSize is the number of bytes per buffer. DefaultBufferMaxRows is the number of rows per buffer. The defaults are 10,485,760 (10M) and 10,000, respectively.

Click through to learn more about these properties.

Comments closed

Optimizing Powershell Module Load Times

Chrissy LeMaire explains ways that the dbatools team reduced how long it takes to load their module:

We noticed that the longest part of importing the module was importing all the extra SMO DLL’s that we require for many of the commands. We import about 150 DLLs and it looks like that number will only grow as we begin to support more functionality (such as Integration services, etc.)

To address this concern, Fred added multi-threading via runspaces to our import process. Too cool! This resulted in a significant decrease in time.

Read on for more details; some of these tips might work on other slow modules, too.

Comments closed

Spark Architecture: The Spark Streaming Receiver

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming:

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of  spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Comments closed

Hadoop 3.1 Released

Wangda Tan and Vinod Kumar Vavilapalli have a post on Hadoop 3.1.0:

This release is *not* yet ready for production use. Critical issues are being ironed out via testing and downstream adoption. Production users should wait for a 3.1.1/3.1.2 release.

The Hadoop community fixed 768 JIRAs (https://s.apache.org/apache-hadoop-3.1.0-all-tickets) in total as part of the 3.1.0 release. Of these fixes:
– 141 in Hadoop Common
– 266 in HDFS
– 329 in YARN
– 32 in MapReduce
Apache Hadoop 3.1.0 contains a number of significant features and enhancements.

YARN supporting GPUs and FPGAs is very interesting.

Comments closed

Deleting Lots Of Data

Kenneth Fisher wants to delete a lot of rows:

I recently had the task of deleting a bit over a billion rows from a table. Now I could have done just this:

DELETE FROM tablename WHERE createdate >= '1/1/2017'

But I have a few problems here. The table has no index on createdate, potentially causing problems with tempdb (the sort on createdate). Although in this case tempdb is pretty large because of some large batch work done at various times. I’m also going to be deleting > billion rows of ~6 billion which is probably going to fill up the log of the database (which fortunately isn’t in use yet) and end up rolling back my delete anyway. Even if I don’t fill up the log, I’m still going to bloat it pretty badly (autogrowth). And last, and anything but least, this is on a production server. Even if this database was on its own drive (meaning growth of the log can’t cause a problem with any other databases) that tempdb thing (let alone other resource usage) is going to be an issue.

Read on to see how to delete in batches.  My pattern is to have an explicit transaction within the WHILE loop, opening and closing for each deletion operation.  That has worked pretty well in the past when deleting large numbers of rows from a table.  It might also make sense to put a temporary filtered index on the table, dropping it afterward.

Comments closed

The Shuffling Operator And Azure SQL DW

Arun Sirpal is ready to deal:

For the purposes of this post the TSQL shown is elementary (don’t be surprised by that), the point is really about SHUFFLE. So, I select the estimated plan for the following code.

SELECT SOD.[SalesOrderID],SOD.[ProductID], SOH.[TotalDue]
FROM [SalesLT].[SalesOrderDetail] SOD
JOIN [SalesLT].[SalesOrderHeader] SOH ON
SOH.[SalesOrderID] = SOD.[SalesOrderID]
WHERE SOH.[TotalDue] > 1000

Shuffle me once, why not shuffle me twice. If you REALLY want to see the EXPLAIN command output, then it looks like this snippet below.

The DSQL operation clearly states SHUFFLE_MOVE. Why am I getting this? What does it mean?

Shuffling data isn’t the worst thing in the world, but it is a fairly expensive operation all things considered.  Ideally, your warehouse architecture limits the number of shuffle operations, but considering that you can only hash on one key, sometimes it’s inevitable.

Comments closed

The Most Powerful Force In The Universe, In DAX

Matt Allington talks about compound interest and shows how to calculate it in DAX:

Now back to the point of this article.  Compounding growth is very easy to do in Excel because you can write individual cell formulas (to do what ever you want), and each new formula can reference the answer from the previous formula as the starting point for the new formula. There is no such ability in the DAX language.  To solve such problems in DAX, you have to change the way you think and start to think about how to write a single formula that will work over an entire TABLE of data (or columns or multiple tables) – no cell by cell individual formulas are possible.

Below I will step you through the process of finding a solution to this problem.  As I often mention in my articles, it is the process that I believe is most important.  I seldom know how to answer a complex DAX problem when I start out (that’s the very definition of “complex”), and instead I follow a process to help me solve the problem.  Take a careful read below.  If you apply the same process when you write your formulas, you will be well on your way to becoming a DAX Superhero.

It’s an interesting problem when your growth rate is not always the same, but Matt has you covered.

Comments closed