Press "Enter" to skip to content

Curated SQL Posts

Lookups And Conditionals In Azure Data Factory V2

Alex Whittles shows us how to perform lookups and operations with IF clauses in Azure Data Factory V2:

Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.

This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.

Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.

Read on for an example.

Comments closed

Connecting To Azure SQL Database From On-Prem

Arun Sirpal shows how to set up a linked server instance between an on-prem SQL Server instance and Azure SQL Database:

You may (or may not) have a requirement to setup a linked server to Azure SQL Database from a locally installed SQL Server. One reason could be to pull down some reports from an Azure SQL Database to a local file share. Whatever your reason is hopefully you will find this blog post useful because I ran into some complications on the way.

This is what your linked server creation screens in SSMS (SQL Server Management Studio) should look like.

Take advantage of Arun’s hard-earned experience and read his post.

Comments closed

Re-Sending An E-Mail From SQL Server

Adrian Buckman shows us how to re-send an e-mail that was sent through DB Mail:

Sometimes emails from SQL Server go missing, especially when you share an inbox with colleagues. On most occasions it doesn’t always matter as the job that generated the email can simply be re ran to produce the email once again, but what about those emails that contain time specific information? we cannot simple just kick off the job again as it may be coded to evaluate right now and not provide us with the information that was sent 2 hours ago.

I was faced with exactly this issue the other day – i needed the information from the email that was sent at that given time so I wrote a bit of code to get the job done then decided that if I tidy this code up I could make it reusable so here is what I come up with:

Check out Adrian’s helpful script.

Comments closed

How Far Along Is That CHECKDB?

Steve Stedman shows us how to get percentage complete of a DBCC CHECKDB run:

You have a DBCC CHECKDB script running, something like the following, and it may take several hours to run to confirm if there is any corruption in your SQL Server Database.

1
DBCC CHECKDB ('MyDatabaseName') WITH DATA_PURITY, NO_INFOMSGS;

Then someone asks you the age old question… When will it be done?

Click through for a quick script and the answer.

Comments closed

Using Kafka To Go From Batch To Stream

Stephane Maarek has started a series on transforming a batch process into a streaming process using Apache Kafka.  Part one introduces the topic and two of the four microservices:

Before jumping straight in, it’s very important to map out the current process and see how we can improve each component. Below are my personal assumptions:

  • When a user writes a review, it gets POSTed to a Web Service (REST Endpoint), which will store that review into some kind of database table.

  • Every 24 hours, a batch job (could be Spark) would take all the new reviews and apply a spam filter to filter fraudulent reviews from legitimate ones.

  • New valid reviews are published to another database table (which contains all the historic valid reviews).

  • Another batch job or a SQL query computes new stats for courses. Stats include all-time average rating, all-time count of reviews, 90 days average rating, and 90 days count of reviews.

  • The website displays these metrics through a REST API when the user navigates a website.

Part two finishes up the story:

In the previous section, we learned about the early concepts of Kafka Streams, to take a stream and split it in two based on a spam evaluation function. Now, we need to perform some stateful computations such as aggregations, windowing in order to compute statistics on our stream of reviews.

Thankfully we can use some pre-defined operators in the High-Level DSL that will transform a KStream into a KTable. A KTable is basically a table that gets new events every time a new element arrives in the upstream KStream. The KTable then has some level of logic to update itself. Any KTable updates can then be forwarded downstream. For a quick overview of KStream and KTable, I recommend the quickstart on the Kafka website.

This is a nice introduction to Kafka Streams using a realistic example.

Comments closed

Building Observability Tools At Scale

Kevin Lew and Sangeeta Narayanan give us some lessons learned from building logging and monitoring solutions at Netflix:

We started our tooling efforts with providing visibility into device and server logs, so that our users can go to one tool instead of having to use separate data-specific tools or logging into servers. Providing visibility into logs is valuable because log messages include important contextual information, especially when errors occur.

However, at some point in our business growth, storing device and server logs didn’t scale because the increasing volume of log data caused our storage cost to balloon and query times to increase. Besides reducing our storage retention time period, we addressed scalability by implementing a real-time stream processing platform called Mantis. Instead of saving all logs to persistent storage, Mantis enables our users to stream logs into memory, and keep only those logs that match SQL-like query criteria. Users also have the choice to transform and save matching logs to persistent storage. A query that retrieves a sample of playback start events for the Apple iPad is shown in the following screenshot:

It’s an interesting post.

Comments closed

JSON Output And SSIS

Stacia Varga works around an oddity in the way SSIS reads JSON outputs:

What happened? The T-SQL produces the correct results in SQL Server Management Studio (SSMS). However, in SSIS, the same T-SQL statement in an OLE DB Source in a Data Flow Task produces two rows of data which adds a line feed into the flat file and renders the JSON unusable.

The problem is visible even before sending output to the flat file.

Click the link to see how Stacia solves this problem.

Comments closed

Permissions Error Executing R Scripts

Niels Berglund walks through a permissions error on a new installation of SQL Server 2017 CU 7 with Machine Learning Services:

Cool, all is “A-OK”! A couple of days go by, and I see that there is a Cumulative Update (CU) for SQL Server 2017 – CU7. I install it and does not think much about it. I mean: “what can go wrong, how hard can it be?”. A couple of days later and I am busy writing the follow-up post to sp_execute_external_script and SQL Compute Context – I when I try to execute sp_execute_external_script, and it falls over!

Niels has a couple false starts that he walks us through, but then lands on a solid answer.

Comments closed

Trickle Insertion With Clustered Columnstore Indexes

Sunil Agarwal provides a pattern for trickle loading clustered columnstore indexes:

A traditional scenario of loading data into CCI is a nightly load from one or more data files containing millions of rows. Recommended technique is to load the data with batchsize >= 102400 as explained https://blogs.msdn.microsoft.com/sqlserverstorageengine/2014/07/27/clustered-column-store-index-bulk-loading-the-data/. However, we are seeing many scenarios where data source is parallel data stream (i.e. trickle insert) to be loaded to CCI for analytics, a typical IOT scenario. CCI allows concurrent data streams into the same delta rowgroup. However, you will see higher page latch contention as you increase the concurrency. You may wonder why this is so? Well, each delta RG is organized internally as a clustered btree index and the dataload follows the pattern of monotonically increasing clustered index key causing latch contention on the last page.

Check out Sunil’s post.  I also have an older post covering my experiences with CCI trickle loads and three ETL patterns which can work.

Comments closed