Press "Enter" to skip to content

Curated SQL Posts

Using The Bot Framework

Jakub Kaczmarek demonstrates using the Microsoft Bot Framework:

Before starting a new bot project, you need to consider if it really is a solution for your business case. It’s not recommended to start bot development just because it’s a hot topic. However, in some cases, this kind of software can save a lot of time, money and resources. The following list of bot example use cases might help in making the decision:

  • Answer for typical questions

    • A bot can make use of Q&A knowledge to receive user question and provide an appropriate answer.
    • Questions can be matched to correct answers using a LUIS (language understanding intelligent service) cognitive service.
    • Reduced time can be spent by help desk staff answering typical questions.
    • Example use cases are help chat, contact pages and web stores.
  • Alternative system interface

    • By integrating a bot with external systems (e.g. Outlook, Jira, CRM, SharePoint) a bot can become an alternative interface to work with these systems.
    • A bot can simply ask some questions and gather the answers given by the user to submit data that normally would be filled in on a form.
    • Example use cases are creating support tickets, uploading SharePoint documents, making calendar appointments, and providing translations.
  • Entertainment & education

    • A bot can be also used to entertain and educate its recipients by sending various kinds of content to the user.
    • It’s a good idea to use media types like videos, audio, images and links to knowledge base articles.
    • Example use cases are workout coach, recipes book and product adviser.
  • Notification bot

    • A bot can be scheduled to initialize conversations at appropriate time, notifying the user about some actions or reminding about things he should do.

    • It’s important to remember that sending proactive messages is not always possible – it depends on the channel used for communication.

    • Example use cases are meeting reminders and timesheet reminders.

I try to avoid the term “intelligent bots” because we’re at least 2 or three generations away from that.  But it’s definitely worth getting your hands dirty with them today, at least to learn their limitations.

Comments closed

Migrating Excel Power Pivot Models To SSAS

Imke Feldmann has a walkthrough to show how to migrate a Power Pivot model in Excel into SQL Server Analysis Services by way of Power BI:

In Visual Studio there is a wizard to migrate an Excel Power Pivot model to a SSAS model. But this will not bring over the M-queries unfortunately. But there is a workaround to achieve this. It requires SQL Server 2017 or higher.

Click through for a list of operations and a video showing how it’s done.

Comments closed

Checking For Temp Table Existence

Wayne Sheffield offers a clinic on temp tables:

I was recently reviewing a newly created T-SQL stored procedure. This procedure was verifying temporary table existence with the following code:

Seeing this takes me back to one of my favorite presentations, where I compare Temporary Tables and Table Variables. In this, I go over several of the methods that I have seen for how code found on the internet actually does this task… and I show why they are all doing it wrong.

Read on to understand why this isn’t the correct answer.

Comments closed

Verifying SSIS Database Connections With ssisUnit

Bartosz Ratajczyk shows how to test project-level connections in SQL Server Integration Services with ssisUnit:

Previously we successfully prepared tests for variables and parameters using VariableCommandand and ParameterCommand. Now it’s time to communicate with the database, and for that, I will use connection manager defined on the project level. I know from the ssisUnit tutorials it works perfect with package connection managers, so it’s time to verify it against the projects. I will test the package 10_ProjectCM.dtsx – it is just getting a single value from the table in a database and storing it in a variable. All the packages and unit tests are on my GitHub.

The package contains three SQL Tasks: the first just checks if we can communicate with the database using SELECT 1 statement, the second gets the information from the table, and the third repeats the second on the container level.

Click through for the tests.

Comments closed

Optimizing SSIS Throughput With Buffer Properties

Andy Leonard explains how he uses data flow properties to tune SQL Server Integration Services package performance:

I started answering a question on SQL Community Slack’s #ssis channel and I realized this would be better served as a blog post. The question was about three SSIS Data Flow properties: DefaultBufferSize, Engine Thread and DefaultBufferMaxRows.

I rarely change the EngineThreads property.

DefaultBufferSize and DefaultBufferMaxRows are two ways of managing the size limits of a Data Flow buffer. The two Data Flow Task properties can – and should – be treated as a single property. DefaultBufferSize is the number of bytes per buffer. DefaultBufferMaxRows is the number of rows per buffer. The defaults are 10,485,760 (10M) and 10,000, respectively.

Click through to learn more about these properties.

Comments closed

Optimizing Powershell Module Load Times

Chrissy LeMaire explains ways that the dbatools team reduced how long it takes to load their module:

We noticed that the longest part of importing the module was importing all the extra SMO DLL’s that we require for many of the commands. We import about 150 DLLs and it looks like that number will only grow as we begin to support more functionality (such as Integration services, etc.)

To address this concern, Fred added multi-threading via runspaces to our import process. Too cool! This resulted in a significant decrease in time.

Read on for more details; some of these tips might work on other slow modules, too.

Comments closed

Spark Architecture: The Spark Streaming Receiver

Oleksii Yermolenko gives us an overview of the Receiver object in Spark Streaming:

The key component of Spark streaming application is called Receiver. It is responsible for opening new connections with the sources, listening events from them and aggregating incoming data within the memory. If receiver’s worker node is running out of memory, it starts using disk storage for persistence operations. But this negatively impacts the overall application’s performance.

All incoming data is first aggregated within receiver into chunks called Blocks. After preconfigured interval of time called batchInterval Spark does logical aggregation of these blocks into another entity called Batch. Batch has links to all blocks formed by receivers and uses this information for generation of RDD. This is the main Spark’s entity which is used by the engine for the operations upon the data. Normally RDD would consist of a number of partitions where each partition would reference the block generated by the receiver on the start stage. Streaming application can have lots of receivers located at different physical nodes, so the actual data would be distributed across the cluster from the start. Batch interval is global for the whole application and is defined on the stage of creation of Streaming Context. Block generation interval is a receiver based property which could be defined through the configuration of  spark.streaming.blockInterval property. By default blocks would be generated every 200ms but you can tune this property according to the nature of your data.

Read the whole thing, which includes some tips on design.

Comments closed

Hadoop 3.1 Released

Wangda Tan and Vinod Kumar Vavilapalli have a post on Hadoop 3.1.0:

This release is *not* yet ready for production use. Critical issues are being ironed out via testing and downstream adoption. Production users should wait for a 3.1.1/3.1.2 release.

The Hadoop community fixed 768 JIRAs (https://s.apache.org/apache-hadoop-3.1.0-all-tickets) in total as part of the 3.1.0 release. Of these fixes:
– 141 in Hadoop Common
– 266 in HDFS
– 329 in YARN
– 32 in MapReduce
Apache Hadoop 3.1.0 contains a number of significant features and enhancements.

YARN supporting GPUs and FPGAs is very interesting.

Comments closed

Deleting Lots Of Data

Kenneth Fisher wants to delete a lot of rows:

I recently had the task of deleting a bit over a billion rows from a table. Now I could have done just this:

DELETE FROM tablename WHERE createdate >= '1/1/2017'

But I have a few problems here. The table has no index on createdate, potentially causing problems with tempdb (the sort on createdate). Although in this case tempdb is pretty large because of some large batch work done at various times. I’m also going to be deleting > billion rows of ~6 billion which is probably going to fill up the log of the database (which fortunately isn’t in use yet) and end up rolling back my delete anyway. Even if I don’t fill up the log, I’m still going to bloat it pretty badly (autogrowth). And last, and anything but least, this is on a production server. Even if this database was on its own drive (meaning growth of the log can’t cause a problem with any other databases) that tempdb thing (let alone other resource usage) is going to be an issue.

Read on to see how to delete in batches.  My pattern is to have an explicit transaction within the WHILE loop, opening and closing for each deletion operation.  That has worked pretty well in the past when deleting large numbers of rows from a table.  It might also make sense to put a temporary filtered index on the table, dropping it afterward.

Comments closed

The Shuffling Operator And Azure SQL DW

Arun Sirpal is ready to deal:

For the purposes of this post the TSQL shown is elementary (don’t be surprised by that), the point is really about SHUFFLE. So, I select the estimated plan for the following code.

SELECT SOD.[SalesOrderID],SOD.[ProductID], SOH.[TotalDue]
FROM [SalesLT].[SalesOrderDetail] SOD
JOIN [SalesLT].[SalesOrderHeader] SOH ON
SOH.[SalesOrderID] = SOD.[SalesOrderID]
WHERE SOH.[TotalDue] > 1000

Shuffle me once, why not shuffle me twice. If you REALLY want to see the EXPLAIN command output, then it looks like this snippet below.

The DSQL operation clearly states SHUFFLE_MOVE. Why am I getting this? What does it mean?

Shuffling data isn’t the worst thing in the world, but it is a fairly expensive operation all things considered.  Ideally, your warehouse architecture limits the number of shuffle operations, but considering that you can only hash on one key, sometimes it’s inevitable.

Comments closed