June 2020 – Page 14 – Curated SQL

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

That’s the same in the relational world: merge joins are the fastest, assuming that your data is pre-sorted in the proper manner.

Comments closed

Building a Stream Processing App with ksql

Published 2020-06-05 by Kevin Feasel

The Hadoop in Real World team walks us through event streaming with ksql:

ksqlDB is an event streaming database that enables creating powerful stream processing applications on top of Apache Kafka by using the familiar SQL syntax, which is referred to as KSQL. This is a powerful concept that abstracts away much of the complexity of stream processing from the user. Business users or analysts with SQL background can query the complex data structures passing through kafka and get real-time insights. In this article, we are going to see how to set up ksqlDB and also look at important concepts in ksql and its usage.

Event streaming has become a lot easier over the past couple of years, as Kafka, Spark, and Flink have all matured.

Comments closed

Using UDFs in Spark without Registration

Published 2020-06-05 by Kevin Feasel

Sourabh Mehta shows how we can immediately call a user-defined function in Spark without registering it first:

Here, we will demonstrate the use of UDF via a small example.
Use Case: We need to change the value of an existing column of DF/DS to add some prefix or suffix to the existing value in a new column.

I’m actually not sure what benefit you gain from not registering the UDF, but there probably is one.

Comments closed

Bulk Loading SQL Server from .NET

Published 2020-06-05 by Kevin Feasel

Adrian Hills walks us through the SqlBulkCopy class:

Ever been in a situation where rumblings of “process X is too slow” suddenly build into a super-high priority ball of urgency when that next step up in data volume hits? Yeah, that can be fun. No, really, it can be fun because we have strategies to sort this stuff out, right?
In this blog post, I’m going to talk about one particular piece of functionality—SqlBulkCopy—that can help you with bulk data loading. If I had to single out my favorite .NET class, SqlBulkCopy would be at the top of the list. My goal is to introduce you to this class so that maybe it can become a part of your tool belt, too.

Click through to see how it works. If you’re familiar with SSIS, you’re already familiar with the concept if not the specifics.

Comments closed

Unit Testing Azure Data Factory Pipelines

Published 2020-06-05 by Kevin Feasel

Richard Swinbank walks us through what it takes to run a unit test against an Azure Data Factory pipeline:

In part three of this series I looked at functional tests for ADF pipelines: verifying, in isolation, that pipelines are “doing things right”. In this post I’ll be testing isolated pipelines to check that they’re “doing the right things” – this is one description of a unit test. In a general-purpose programming language, unit tests might be used to verify that an individual line of code is executed, or that it has a particular effect. In Azure Data Factory, the smallest unit of development – a “line of code” – is a pipeline activity. I will be writing tests to verify that specific activities are executed (or not) and to inspect their results.

There’s a fair bit involved in this sort of test.

Comments closed

Power Query Performance Differences in When You Remove Columns

Published 2020-06-05 by Kevin Feasel

Chris Webb continues a series on optimizing Power Query merge performance:

In my last post I demonstrated how the size of a table affects the performance of Power Query merge operations on non-foldable data sources in Power BI. Specifically, I showed that removing columns from the tables involved in a merge before the merge took place improved performance. But does it matter when you remove the columns? Is it enough to only select the columns you need when you expand the nested table returned by a merge, for example, or just to remove columns after the merge step? So, today’s question is:
Does it make a difference to Power Query merge performance if you remove unwanted columns from your source tables in the step before the merge or in the step afterwards?

Read on for the result, as well as a pleasant surprise around Power BI’s capabilities.

Comments closed

Running Databricks ETL Code from Azure Data Factory

Published 2020-06-05 by Kevin Feasel

Ed Elliott shares several approaches for executing Databricks code from Azure Data Factory:

.Net
No post I write about Spark is going to miss off .Net, if you are writing C# or F# then you can’t use a notebook anyway. If you do this then you will also include the .Net driver JAR file (weird statement to make) and call that.

Read on for all of the techniques.

Comments closed

Backing Up SQL Server Instance Configuration

Published 2020-06-05 by Kevin Feasel

Claudio Silva has started a series on backing up your SQL Server instance configuration. Part 1 walks us through the basic process:

If you have never used this command, you can test for a single instance by running the following:
Export-DbaInstance -SqlInstance "devInstance" -Path "D:\temp"
This will create all scripts in the D:\temp folder. A folder named “devInstance-{date}” will be created.
In this folder, you will find 1 file per ‘object type’. The file names are in the form of “#-.sql” where the # is a number that represents the iterator on the order that the internal calls of the underlying functions happen.

Be sure to read the sections around passwords!

Part 2 introduces parallelism into the mix:

There are a couple of options, like the native PowerShell cmdlets Start-Job/Stop-Job a.k.a background jobs, Runspaces jobs and Thread jobs but I will just mention two of them. One is a nice addition to the most recent version of PowerShell (v7) and the other using a PowerShell module.
In case you don’t know, with PowerShell v7 it’s possible to use a new option -Parallel with ForEach-Object. Check PowerShell’s team blog post PowerShell ForEach-Object Parallel Feature.
However, because I don’t have (yet :-)) PS7, I will keep leveraging on PoshRSJob module, which uses runspaces, created by Boe Prox (T | B).

There’s a lot of meat in that second part, so check it out.

Comments closed

Portfolio Optimization with SAS and Python

Published 2020-06-04 by Kevin Feasel

Sophia Rowland shows off the sastopypackage:

I started by declaring my parameters and sets, including my risk threshold, my stock portfolio, the expected return of my stock portfolio, and covariance matrix estimated using the shrinkage estimator of Ledoit and Wolf(2003). I will use these pieces of information in my objective function and constraints. Now I will need SWAT, sasoptpy, and my optimization model object.

Read on for a demo.

Comments closed

Understanding Scatterplots

Published 2020-06-04 by Kevin Feasel

Alex Velez describes the nature of the scatterplot:

A scatterplot is a niche chart, but it’s one of my favorites! If you are a statistician or work in a technical field, a scatterplot might be your go-to graph type. However, if you don’t perform a lot of statistical analysis, then these charts may be unfamiliar. Regardless of your current comfort level, scatterplots are extremely useful to focus on the relationship between two series—a scenario that is common in both technical and non-technical fields. Let’s explore some of the basics of scatterplots via an example; I’ll also cover tips for designing more effective ones and discuss common variations (bubble charts, connected scatterplots, etc.), too!

Read on for a good explanation of what scatterplots are, variants on the theme, and when they make sense to use.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Month: June 2020

Optimizing Power BI Merge Performance with Table.Join

Building a Stream Processing App with ksql

Using UDFs in Spark without Registration

Bulk Loading SQL Server from .NET

Unit Testing Azure Data Factory Pipelines

Power Query Performance Differences in When You Remove Columns

Running Databricks ETL Code from Azure Data Factory

Backing Up SQL Server Instance Configuration

Portfolio Optimization with SAS and Python

Understanding Scatterplots