Press "Enter" to skip to content

Author: Kevin Feasel

Wrapping Up Azure Data Factory

Cathrine Wilhelmsen wraps up a long series on Azure Data Factory with three final posts. First is lookups:

Lookups are similar to copy data activities, except that you only get data from lookups. They have a source dataset, but they do not have a sink dataset. (So, like… half a copy data activity? :D) Instead of copying data into a destination, you use lookups to get configuration values that you use in later activities.

And how you use the configuration values in later activities depends on whether you choose to get the first row only or all rows.

From there, it’s the bottom line question:

Congratulations! You’ve made it through my entire Beginner’s Guide to Azure Data Factory 🤓 We’ve gone through the fundamentals in the first 23 posts, and now we just have one more thing to talk about: Pricing.

And today, I’m actually going to talk! You see, in November 2019, I presented a 20-minute session at Microsoft Ignite about understanding Azure Data Factory pricing. And since it was recorded and the recording is available for free for everyone… Well, let’s just say that after 23 posts, I think we could both appreciate a short break from reading and writing

In case you missed anything, Cathrine has a summary and shows where you can learn a lot more:

After this, I will be taking a break from creating new content. However, I will continue to edit, update, tweak, rewrite, and improve all 25 posts already published. I originally published one post per day as an Azure Data Factory Advent Calendar, and even while writing I noticed things that I didn’t have time to cover or things that I wanted to go back and improve. But! I needed to get all the posts published first. I consider this the first edition of the series. Now, the editing begins. Then, I will do my best to keep the content updated as Azure Data Factory keeps evolving

This was a huge series; kudos to Cathrine for putting it all together.

Comments closed

Calculating the Pain of UDFs

Taiob Ali points out something added to SQL Server 2017 (and later 2016 and 2014):

Microsoft SQL Server Management Studio (SSMS) version 17.5 added new showplan attributes UdfCpuTime and UdfElapsedTime to QueryTimeStats. These two attributes will measure the time and CPU spent on user-defined functions within a query execution hence helping to discover the impact of UDF execution within full query execution. This feature was first added in SQL Server 2017 CU3 and was backported to SQL Server 2016 SP2. Finding the execution time and CPU for UDF was always a challenge for Data professionals because the number of times a function will execute will vary.

This was a blind spot for a very long time.

Comments closed

Computed Columns and Temporal Tables

Randolph West shows us how to include computed columns in temporal tables:

As I say in my Back to the future with Temporal Tables session, there are several limitations we should be aware of with this feature. One I had not previously considered is that computed columns are not permitted in a history table. This is a problem if we are converting an existing auditing process to make use of temporal tables, and we have computed columns in our table.

Computed columns are columns that are created from an expression based on one or more existing columns in a table. They are useful if we need to index a portion of a wider column, for example.

Hat tip to Erik Darling for coming up with the solution.

Comments closed

Parquet Versus Avro

Matthew Rathbone compares the Parquet and Avro file formats:

JSON improves upon CSV as each row provides some indication of schema, but without a special header-row, there’s no way to derive a schema for every record in the file, and it isn’t always clear what type a ‘null’ value should be interpreted as.

Avro and Parquet on the other hand understand the schema of the data they store. When you write a file in these formats, you need to specify your schema. When you read the file back, it tells you the schema of the data stored within. This is super useful for a framework like Spark, which can use this information to give you a fully formed data-frame with minimal effort.

I was kind of hoping to see ORC in the comparison as well, though even when the Hortonworks-Cloudera competition was at its max, my recollection is that the differences between the two formats were pretty small (where ORC was a little faster for non-nested data and Parquet a little faster for nested data).

Comments closed

Choosing a SQL Platform on Hadoop

Sagar Kewalramani walks us through the choices for SQL platforms on the Cloudera Data Platform:

CDW on CDP is a new service that enables you to create a self-service data warehouse for teams of Business Intelligence (BI) analysts.  You can quickly provision a new data warehouse and share any data set with a specific team or detpartment. Do you remember when you could provision a data warehouse on your own?  Without infrastructure and platform teams getting involved? This was never possible.  CDW fulfills this mission.  

However, CDW makes several SQL engines available, and with more choice comes more opportunities for confusion.   Let’s explore the SQL engines available in CDW on CDP and talk about which is the right SQL option for the right use case.

So many choices!  Impala? Hive LLAP?  Spark? What to use when?  Let’s explore.

Infrastructure and platform teams start to get involved approximately two days after the unexpectedly large bill arrives.

That aside, this is a really nice article covering several platform technologies, including Impala, Hive LLAP, and Spark SQL.

Comments closed

Formatting SSIS Packages

Tim Mitchell shows us different options available when formatting SSIS packages:

Most folks I know go to one extreme or the other when it comes to the visual layout of SQL Server Integration Services packages: either they don’t care a tiny bit about the appearance, or they insist on an easy-to-read layout before a project is considered complete. I am definitely in the second group, and will almost always spend the time to make sure my packages are properly laid out.

Regardless of which of these groups you are in, it is very easy to apply some simple visual formatting to your packages. In this quick tip, I’ll show you how to use the visual formatting options in SSIS to automate the layout.

I consider it quite important—it gives a future viewer an idea of your vision as developer. If I can’t read what you’re doing, that does not bode well.

Comments closed

From SQL Server to Cassandra

Shel Burkow has started a new series:

A subset of related tables in a relational schema can satisfy any number of queries known and unknown at design time. Refactoring the schema into one Cassandra table to answer a specific query, though, will (re)introduce all the data redundancies the original design had sought to avoid.

In this series, I’ll do just that. Starting from a normalized SQL Server design and statement of the Cassandra query, I’ll develop four possible solutions in both logical and physical models. To get there, though, I’ll first lay the foundation.

This initial article focuses on the Cassandra primary key. There are significant differences from those in relational systems, and I’ll cover it in some depth. Each solution (Part III) will have a different key.

Cassandra (as well as Riak, while that was still a thing people cared about) has the concept of tables and SQL statements to work with them, but it’s quite different from a relational database, different enough that new design patterns are necessary. Just about the worst thing you could do would be to drop your relational database schema in Cassandra and call it a day.

Comments closed

Deferred Compilation and Compatibility Level 140 Query Hints

Milos Radivojevic shows that table-valued parameters do not care about your QUERY_OPTIMIZER_COMPATIBILITY_LEVEL_140 settings:

The plan is created under CL 140, but the estimation number of rows for the table variable is not 1 but the actual one. Even if you would specify the FORCE_LEGACY_CARDINALITY_ESTIMATION hint, the query will be deferred compiled and behavior of table variable would be the same.

Table variable deferred compilation respects settings at the database scope and ignores all hints except one.

Click through to learn what that one hint is.

Comments closed

Automating JSON to Tables

Dave Mason has a stored procedure for us:

The code for my stored procedure is below. It has a single input parameter for a string of JSON data. The name/value pairs of the first “row” of JSON data is parsed to obtain column names and types for the result set. A query string for the OPENJSON function is constructed and executed via EXEC. JSON’s support for data types is pretty sparse compared to SQL Server (notice the big CASE expression where I attempt to do some mapping). If you’re using SQL 2016, you’ll have to make an edit for the STRING_AGG function.

Click through for the code and demos.

Comments closed

Parameters, Variables, and ForEach Loops in ADF

Cathrine Wilhelmsen has a few more posts in the Azure Data Factory series for us. First up is on parameters:

We can build dynamic solutions!

Creating hardcoded datasets and pipelines is not a bad thing in itself. It’s only when you start creating many similar hardcoded resources that things get tedious and time-consuming. Not to mention, the risk of manual errors goes drastically up when you feel like you create the same resource over and over and over again.

After that is variables:

Parameters are external values passed into pipelines. They can’t be changed inside a pipeline. Variables, on the other hand, are internal values that live inside a pipeline. They can be changed inside that pipeline.

Parameters and variables can be completely separate, or they can work together. For example, you can pass a parameter into a pipeline, and then use that parameter value in a set variable or append variable activity.

And the latest post in the series is all about ForEach loops:

By default, the foreach loop tries to run as many iterations as possible in parallel. You can choose to run them sequentially instead, for example if you need to copy data into a single table and want to ensure that each copy finishes before the next one starts.

If you choose to run iterations in parallel, you can limit the number of parallel executions by setting the batch count. The default number is 20 and the max number is 50.

This has been a very nice series, and it looks like there is a little bit more to go.

Comments closed