Press "Enter" to skip to content

Curated SQL Posts

Notes on Wrangling Data Flows

Rayis Imayev calculates distance between two geographical points in an Azure Data Factory Wrangling data flow:

Brian Donovan and Dan Work from the University of Illinois has pointed out that this dataset “contains a large number of errors. For example, there are several trips where the reported meter distances are significantly shorter than the straight-line distance, violating Euclidean geometry“. So, that triggered my interest to add an additional column to this dataset with a straight line distance between two geo-points of pickup and dropoff locations, and that’s where I wanted Wrangling Data Flows to help me.

Read on for Rayis’s demonstration, as well as a long list of observations (positive and negative) about the current state of Wrangling data flows.

Comments closed

Securing Data on ElasticMapReduce

Duncan Chen takes us through data encryption options when using ElasticMapReduce:

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Click through for more details on each.

Comments closed

Databricks + Azure Synapse Analytics

David Meyer and Clinton Ford explain how you can integrate Azure Databricks with Azure Synapse Analytics:

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like ElectroluxShell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.

Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Something interesting about Synapse is that its implementation of Spark is not the same as the Databricks implementation (perhaps for licensing reasons). But that doesn’t stop us from using Databricks to process and curate data for Synapse Analytics.

Comments closed

Capturing Inserts and Updates in MERGE Statements

The Purple Frog folks show us how to collect the counts of insert and update operations when using MERGE statements:

This post hows how you can capture and store the number of records inserted, updated or deleted from a T-SQL Merge statement.

This is in response to a question on an earlier post about using Merge to load SCDs in a Data Warehouse.

You can achieve this by using the OUTPUT clause of a merge statement, including the $Action column that OUTPUT returns.

Read on for the answer. If only MERGE weren’t so riddled with problems.

Comments closed

The State of SQL Server Tools

Vicky Harp provides us a general update on things related to SQL Server tooling:

This week we’re announcing the general availability of SQL Server 2019, a significant milestone for Azure Data and for SQL Server customers. This presents a good moment to give an update on the state of tooling for SQL Server.

Since SQL Server 2016, the tools for SQL Server have been released independently “out of box” from the server product. This allows us to be more agile to the needs of our users, get both features and bug fixes shipped more quickly, stay aligned with the more continuous release cycle of Azure SQL, and in general allows the tools team to innovate in exciting ways. However, one side effect is that it can be difficult to understand what’s happening across the tools landscape, as things change quickly in multiple products that are releasing as frequently as every month.

Azure Data Studio has gotten a lot better. SSMS has gotten a little better. We’ve also seen work around several other tools, including command-line options. It’s a good time to be in SQL Server.

Comments closed

Azure Synapse Analytics, Nee Azure SQL DW

John Macintire explains Azure Synapse Analytics:

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

There’s a fair amount of marketing-speak in here, but the gist is Azure SQL Data Warehouse + Spark + on-demand serverless queries (so you can, among other things, write T-SQL against your HDFS data). I think it has a better chance of long-lasting success than Azure SQL Data Warehouse.

Comments closed

Licensing for SQL Server Big Data Clusters

Mohammad Darab tackles the licensing question for Big Data Clusters:

One of the biggest questions I had when I first started diving into Big Data Clusters was, “What about licensing….how will that work?” With so many different instances running on the storage pool, data pool and compute pool nodes will licensing cost too much? The answer I got from Microsoft was that it will “be competitive”.

Well, with the general availability of SQL Server as of this week, Microsoft is making it way more financially attractive than I thought. Below is a summation of the SQL Server 2019 Licensing Guide for Big Data Clusters.

Click through for the explanation. It really is pretty simple, all things considered.

Comments closed

Debugging Azure Data Factory Data Flows

Mark Kromer takes us through debugging Azure Data Factory Data Flows:

When you are designing your mapping data flows in ADF, you are working against a live Azure Databricks Spark cluster. The size of that cluster is configurable via the Azure Integration Runtime. If you do not configure a custom Azure IR, then you will use the default Azure IR. That sets a very small cluster size by default of 4 cores for a single worker node and 4 cores for a single driver node. In most cases, while debugging and using data preview, that should be fine. But when you start exploring your data with column statistics or increase the sampling size in debug settings, you may find that you’ve exceeded the capacity on that small default cluster. Below are the steps you need to take to increase the size of your debug cluster.

Click through for step-by-step instructions.

Comments closed

SQL Server Assessment API Generally Available

Ebru Ersan announces the general availability of the SQL Assessment API in Powershell:

SQL Assessment API is a new mechanism to evaluate configuration of your SQL Server for best practices. The API methods are used by means of a SQL Server Management Object (SMO) extension and new cmdlets in SqlServer PowerShell module. API is delivered with a ruleset that is highly customizable and extensible. It can be used to assess SQL Server versions 2012 and higher, both on Windows and Linux, as well as Azure SQL DB Managed Instance. More products will be supported in future releases.
 
SQL Assessment API is shipped as part of SqlServer PowerShell module (21.1.18206) and SMO NuGet Package (150.18208.0).

It’s a set of reasonably good practices and also lets you customize for your own environment, so check it out.

Comments closed

Troubleshooting Tez Performance

Dmitry Tolpeko digs through Tez logs to figure out a performance issue:

Why did it take so long to run the job? Is there any way to improve its performance?

Tez Application Master Log
I am going to use the Tez AM log to investigate vertex performance and find possible bottlenecks.
Note that there is the Timeline Server REST API that you can use to get the statistics for Tez jobs, but the application master log is “event-driven”, shows the exact order of all events and contains much more details in general.

Click through for the process.

Comments closed