Press "Enter" to skip to content

Author: Kevin Feasel

Updates in Azure Synapse Analytics

Saveen Reddy shows how the Synapse product team has been busy this year:

Previously, Synapse workspaces had a kind of database called a Spark Database. Spark databases had two key characteristics:

– Tables in Spark databases kept their underlying data in Azure Storage accounts (i.e. data lakes)

– Tables in Spark databases could be queried by both Spark pools and by serverless SQL pools.

To help make it clear that these databases are supported by both Spark and SQL and to clarify their relationship to data lakes, we have renamed Spark databases to Lake databases. Lake databases work just like Spark databases did before. They just have a new name.

Okay, this is the kind of change I can do without. That’s a really dumb name. Spark databases tell you what a thing is. It’s a database which lives in Apache Spark. Lake databases run what? Apache Spark. But if anything really should be called a Lake database, it’d be a serverless SQL pool’s database because everything in there is built on top of the data lake—it’s all external tables pointing to a lake. So calling a Spark database a Lake database brings more confusion than elucidation.

Most of the other changes on that list? Really cool. This one? Not at all.

Comments closed

Variables and Scope in Powershell

Dave Mason continues a quest into the bowels of Powershell:

Let’s talk a little bit about PowerShell variables and how long they exist within the scopes they’re defined. I’ve encountered some behavior that for me, was unexpected. It’s made my development efforts unproductive–especially when it comes to debugging.

Just like with notebooks, it’s important to remember that the Powershell prompt has a session, and that you aren’t running fresh every time. You can also use Dave’s solution to the problem, which makes sense as well.

Comments closed

Using the Fail Activity in Azure Data Factory

Rayis Imayev thinks about failure:

Recently, Microsoft introduced a new Fail activity (https://docs.microsoft.com/en-us/azure/data-factory/control-flow-fail-activity) in the Azure Data Factory (ADF) and I wondered about a reason to fail a pipeline in ADF when my internal being tries very hard to make the pipelines successful once and for all. Yes, I understand a documented explanation that this activity can help to “customize both its error message and error code”, but why?

Click through for Rayis’s take. I’ll just be here cracking jokes about how Fail activities are banned in my code because I expect it to have a positive outlook on life.

Comments closed

A Heap of Pain

Chad Callihan explains the dislike for heaps in SQL Server:

A table is considered a heap when it is created without a clustered index. Data isn’t in any type of ordered state. Some data is over here, some data is over there.

When you are inserting data into a heap, that data is tossed in wherever. Think of it like your junk drawer. It’s not organized into its own little sections. What do you do when you have something to add such as a pair of scissors or an old pen? You open the drawer, toss it in, and close it up without giving it a second thought.

Like Chad mentions, there are uses for heaps. And when you move to Azure Synapse Analytics, there are more uses for heaps. But with on-premises SQL Server, a heap is usually a mistake.

Comments closed

Building an ETL Pipeline with Airflow and Containers

Nikita Vasilev needs to move some data:

Obviously, we can use one of the many ready-made ETL systems that implement the functions of loading information into the corporate data warehouse. Informatica PowerCenter, Oracle Data Integrator, SAP Data Services, Oracle Warehouse Builder, Talend Open Studio, Pentaho are just a sliver of off-the-shelf solutions. However, when it comes to large volumes of data at high speeds and Big Data infrastructure already in place, boxed solutions fall flat to satisfy your needs.

Therefore, Big Data pipelines require something like Apache Airflow. It’s an open-source set of libraries for developing, planning, and monitoring workflows. Airflow is written in Python and allows you to create and configure task chains both visually with a clear web-GUI and to write Python program code.

Click through for an example using Airflow with AWS’s Elastic Container Service.

Comments closed

Building 2048 in T-SQL

Tomaz Kastrun gives you a way to slack off at work while everybody else thinks you’re working on a really important SQL problem:

What is 2048 game? It is a classical puzzle game, that is easy and fun to play. The objective of the game is to move the numbers (tiles in the matrix/board) in a way to combine them to create a tile with the number 2048.

Click through to see how to use it and check out the scripts on Tomaz’s GitHub repo. This definitely merits the Wacky Ideas category.

Comments closed

Use Cases for Window Functions

I continue a series on window functions in SQL Server:

To this point, we’ve looked at five classes of window function in SQL Server. I’ve given you a couple of solid use cases, but for the most part, we’ve focused on what the classes of window functions are. Now we’re going to talk about why you want to use them.

As is the norm for me, it takes about seven posts before I explain why you might want to do something.

Comments closed

Data Processing in Data Explorer Pools

Tsuyoshi Matsuzaki shows us how Data Explorer pools work in Azure Synapse Analytics:

In Microsoft Ignite 2021, new Data Explorer (DX) pool in Azure Synapse Analytics is released in preview. You might wonder which one to choose among 3 different analytical pools – Spark pool, Dedicated SQL pool, and DX pool.

In this post, I’ll briefly summarize how data is processed in Data Explorer (Kusto) – Azure Data Explorer (ADX) and Azure Synapse Data Explorer (DX) pool.
I hope this will give you a hint for your optimal analytical platform.

Read on for this explanation.

Comments closed

CONVERT and Binary Styles

Abayomi Obawomiye has style:

I recently had a requirement to load some data from a source table to another destination table. The destination columns were exactly the same as the source columns with the same data types and length. The only difference was that some columns on the destination table must be encrypted. The task was to use the SHA2_512 encryption algorithm to encrypt the “sensitive” data. I will talk more about the encryption algorithm in another post.

To achieve this, I needed to use the HASHBYTES function in SQL Server. The challenge was that this function used with the SHA2_512 encryption algorithm will return a fixed character length of 64 characters which will be longer than the character length on my destination table. As a result, SQL Server will throw a truncation error. I will demonstrate this below.

One really important point: SHA is not encryption; it’s a hash (which is why the function is HASHBYTES() instead of something like EncryptByKey() as column-level security uses). Hashes are intended to be a one-way trip, whereas encryption implies an ability to decrypt if you have the relevant key details. Here, the use looks to be obfuscating the text of sensitive data fields, perhaps for loading in a dev/test environment, and so the actions themselves are quite reasonable.

By the way, the styles Abayomi talks about are all listed in this Docs page. Turns out that if you’re using a money datatype, you can use CONVERT() to display the end result with commas.

Comments closed