Press "Enter" to skip to content

Curated SQL Posts

You Should Use Biml

Meagan Longoria explains why you should use Biml if you’re building Integration Services packages:

Biml provides a way automate SSIS design patterns. This reduces the time required to complete a data integration project, and it helps employ consistent design patterns within and across projects. Re-generating multiple packages after making a change to a design pattern takes just a few minutes, so small changes to several similar packages are no longer a significant effort.

Automating SSIS design patterns allows teams to work more efficiently. Senior developers can stop solving the same problems over and over again. Instead, they can solve them once, automate the solution, and move on to new and interesting challenges. Junior developers still learn good development practices with Biml, but they require less supervision to create quality output in a shorter amount of time. SSIS developers that prefer typing code over the drag-and-drop interface of SQL Server Data Tools now get a better way to work in addition to the automation capabilities.

If there’s one piece of advice I can give ETL developers, it’s “learn Biml.”

Comments closed

Stored Procedure Last Run Times

Richie Lee has a script to see when stored procedures were last executed:

Quick script to get the last time a stored procedure was executed in the database. The reason for the seemingly over-engineered script is that different query plans can be generated, meaning that stored procedures can appear more than once in the list.

The query doesn’t quite work as-is, but making qs.execution_count into an aggregation and removing it from the GROUP BY would work.  I’d probably rewrite it to look a bit more like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
WITH querystats AS
(
    SELECT
        OBJECT_NAME(qt.objectid) AS ProcedureName,
        SUM(qs.execution_count) OVER (PARTITION BY OBJECT_NAME(qt.objectid)) AS ExecutionCount,
        qs.creation_time AS CreationTime,
        ROW_NUMBER() OVER (PARTITION BY OBJECT_NAME(qt.objectid) ORDER BY creation_time DESC) AS rownum
    FROM sys.dm_exec_query_stats AS qs
    CROSS APPLY sys.dm_exec_sql_text(qs.[sql_handle]) AS qt
    WHERE
        qt.[dbid] = DB_ID()
        AND OBJECT_NAME(qt.objectid) LIKE '%%'
)
SELECT
    qs.ProcedureName,
    qs.ExecutionCount,
    qs.ExecutionCount
FROM querystats qs
WHERE
    qs.rownum = 1;
Comments closed

KPI Indicators

Devin Knight’s Power BI Custom Visuals class continues:

Change the Banding type property to one of the following:

  • Increasing is better –  Increasing is best when you’re measuring things like sales or profit. If you go over your profit target that’s a good thing!

  • Decreasing is better – Decreasing is probably best when you’re looking at something like budgeting. Staying under budget is usually a good thing. Unless you being too far under budget means you won’t get that money again next year which leads to the last option

  • Closer is better – This is for when you need your data to land in the middle of a bell curve.  Meaning if you go too high or too low that’s a bad thing. This is often useful when looking at medical data.  For example, if your blood pressure is too high then that’s a bad thing, but if you’re blood pressure is too low that’s also a bad thing too. You need to land in the middle somewhere, which is what this option allows.

There’s plenty of good advice here, so check out the video.

Comments closed

Auto-Seeding Availability Groups

John Sterrett is the latest smart person to take a look at automatic seeding of Availability Groups:

The 600 GB databases took about 66 minutes to seed across the network from a primary replica to the secondary replica.  I noticed 1.4 Gbps of consistent throughput during the seeding process. This makes a lot of sense as it caps out around what the storage system can deliver in this environment.

The first thing I would look at for benchmarking throughput for network activity would be the bytes sent per second from the primary replica and bytes received per second on the secondary replicas.

John also includes an extended event session statement to track seeding.  Great read.

1 Comment

Finding Dates

Derik Hammer shows the right way and the wrong way of using date functions in a WHERE clause:

I then changed my thought process to find the age of a 65 year old who’s birth day is today. Then I compare the DateOfBirth column to that static value.

SELECT COUNT(*) FROM dbo.Person p WHERE DateOfBirth <= DATEADD(YEAR,-65,GETDATE())

With the above query I bought myself an index seek and 345 logical reads. That works out to <3% of the cost.

This is true not just for date functions, but rather is applicable to almost all scalar functions.

Comments closed

Startup Stored Procedures

Dave Mason looks at using sp_procoption to execute stored procedures at startup:

If you have more than one task you want to run at startup, you could include code for each task in a single stored procedure. But that’s generally bad coding practice. Go with the modular approach and create a separate stored procedure for each distinct task. With sp_procoption, you can set more than one stored procedure for automatic execution. The MSDN documentation was not clear regarding the order of execution, though. I thought there might be something similar to sp_settriggerorder, but I wasn’t able to find anything like that. I ran a test on SQL 2014 with 3 stored procedures set for automatic execution.

Dave has interesting notes on procedure run order, where these procedures need to live, and even some ideas on what you might put into startup stored procedures.

Comments closed

CASE Statements In GROUP BY Clauses

Grant Fritchey looks at CASE statements within GROUP BY clauses:

The same basic set of structures, scans against both tables, to arrive at the data. Cost estimates between the two plans are very different though, with the targeted queries having a much lower estimated cost.

Performance-wise, interestingly enough, the average execution time of the first query, only returning the 10 rows, is 157ms on average, while the query grouping directly on the SalesPersonID averages about 190ms. Now, the reads tell a slightly different story with 17428 on the generic query and 5721 on the specific query. So, maybe a server under load will see a significant performance increase. However, let’s deal with what we have in front of us and say that, at least for these tests, the catch-all GROUP BY query performs well.

Grant’s recommendation is to split this out into several procedures, and if you’re having performance problems, that’s a solid move.  I’m a bit more likely to keep them (especially in warehousing reports), but it’s nice to have options.

Comments closed

Basic Spark Terminology

Denny Lee and Jules Damji explain some of the key terms and concepts around Apache Spark:

At the core of Apache Spark is the notion of data abstraction as distributed collection of objects. This data abstraction, called Resilient Distributed Dataset (RDD), allows you to write programs that transform these distributed datasets.

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

Some of these concepts are new to Spark 2.0, but all are worth learning.

Comments closed

Analyzing Real-Time Data

Manjeet Chayel connects Spark Streaming to Amazon Kinesis and shows how to analyze the data in real time:

To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:

  1. Create an Amazon Kinesis stream.

  2. Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.

  3. Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.

  4. Connect to the Zeppelin notebook.

  5. Import the Zeppelin notebook from GitHub.

  6. Analyze and visualize the streaming data.

This is a good way of getting started with streaming data.  I’ve grown quite fond of notebooks in the short time that I’ve used them, as they make it very easy for people who know what they’re doing to provide code and information to people who want to know what they’re doing.

Comments closed

Home Labs

Chrissy LeMaire shows off her home lab:

I like to test my scripts against a variety of versions/editions and I don’t like spinning VMs up and down all the time. As for the cost; some people spend their money on golf, Polish pottery and gaming rigs. I spend mine on servers, Belgian beer and travel 😉

As you can see, I also have an old Macbook Pro with 256 SSD, 4TB HDD and 8GB RAM in the mix. It’s for photos and videos, however. And someone gave me an old silver Shuttle from like 2002, but I haven’t had the time to set it up yet.

The “cloud versus local” lab is a tough call, as both sides have their advantages and disadvantages.

Comments closed