Press "Enter" to skip to content

Author: Kevin Feasel

Unit Testing Azure Data Factory Pipelines

Richard Swinbank walks us through what it takes to run a unit test against an Azure Data Factory pipeline:

In part three of this series I looked at functional tests for ADF pipelines: verifying, in isolation, that pipelines are “doing things right”. In this post I’ll be testing isolated pipelines to check that they’re “doing the right things” – this is one description of a unit test. In a general-purpose programming language, unit tests might be used to verify that an individual line of code is executed, or that it has a particular effect. In Azure Data Factory, the smallest unit of development – a “line of code” – is a pipeline activity. I will be writing tests to verify that specific activities are executed (or not) and to inspect their results.

There’s a fair bit involved in this sort of test.

Comments closed

Bulk Loading SQL Server from .NET

Adrian Hills walks us through the SqlBulkCopy class:

Ever been in a situation where rumblings of “process X is too slow” suddenly build into a super-high priority ball of urgency when that next step up in data volume hits? Yeah, that can be fun. No, really, it can be fun because we have strategies to sort this stuff out, right?

In this blog post, I’m going to talk about one particular piece of functionality—SqlBulkCopy—that can help you with bulk data loading. If I had to single out my favorite .NET class, SqlBulkCopy would be at the top of the list. My goal is to introduce you to this class so that maybe it can become a part of your tool belt, too.

Click through to see how it works. If you’re familiar with SSIS, you’re already familiar with the concept if not the specifics.

Comments closed

Power Query Performance Differences in When You Remove Columns

Chris Webb continues a series on optimizing Power Query merge performance:

In my last post I demonstrated how the size of a table affects the performance of Power Query merge operations on non-foldable data sources in Power BI. Specifically, I showed that removing columns from the tables involved in a merge before the merge took place improved performance. But does it matter when you remove the columns? Is it enough to only select the columns you need when you expand the nested table returned by a merge, for example, or just to remove columns after the merge step? So, today’s question is:

Does it make a difference to Power Query merge performance if you remove unwanted columns from your source tables in the step before the merge or in the step afterwards?

Read on for the result, as well as a pleasant surprise around Power BI’s capabilities.

Comments closed

Backing Up SQL Server Instance Configuration

Claudio Silva has started a series on backing up your SQL Server instance configuration. Part 1 walks us through the basic process:

If you have never used this command, you can test for a single instance by running the following:

Export-DbaInstance -SqlInstance "devInstance" -Path "D:\temp"

This will create all scripts in the D:\temp folder. A folder named “devInstance-{date}” will be created.
In this folder, you will find 1 file per ‘object type’. The file names are in the form of “#-.sql” where the # is a number that represents the iterator on the order that the internal calls of the underlying functions happen.

Be sure to read the sections around passwords!

Part 2 introduces parallelism into the mix:

There are a couple of options, like the native PowerShell cmdlets Start-Job/Stop-Job a.k.a background jobs, Runspaces jobs and Thread jobs but I will just mention two of them. One is a nice addition to the most recent version of PowerShell (v7) and the other using a PowerShell module.

In case you don’t know, with PowerShell v7 it’s possible to use a new option -Parallel with ForEach-Object. Check PowerShell’s team blog post PowerShell ForEach-Object Parallel Feature.

However, because I don’t have (yet :-)) PS7, I will keep leveraging on PoshRSJob module, which uses runspaces, created by Boe Prox (T | B).

There’s a lot of meat in that second part, so check it out.

Comments closed

Understanding Scatterplots

Alex Velez describes the nature of the scatterplot:

A scatterplot is a niche chart, but it’s one of my favorites! If you are a statistician or work in a technical field, a scatterplot might be your go-to graph type. However, if you don’t perform a lot of statistical analysis, then these charts may be unfamiliar. Regardless of your current comfort level, scatterplots are extremely useful to focus on the relationship between two series—a scenario that is common in both technical and non-technical fields. Let’s explore some of the basics of scatterplots via an example; I’ll also cover tips for designing more effective ones and discuss common variations (bubble charts, connected scatterplots, etc.), too!

Read on for a good explanation of what scatterplots are, variants on the theme, and when they make sense to use.

Comments closed

Portfolio Optimization with SAS and Python

Sophia Rowland shows off the sastopypackage:

I started by declaring my parameters and sets, including my risk threshold, my stock portfolio, the expected return of my stock portfolio, and covariance matrix estimated using the shrinkage estimator of Ledoit and Wolf(2003). I will use these pieces of information in my objective function and constraints. Now I will need SWAT, sasoptpy, and my optimization model object.

Read on for a demo.

Comments closed

Service Broker Conversations and Messages

Chris Johnson is working on a series on Service Broker fundamentals. This post covers conversations and messages:

First, each conversation has its own unique dialog_handle (not conversation handle, despite everything calling it a conversation from now on, score one for consistency Microsoft). We need to capture this handle in a UNIQUEIDENTIFIER variable, as we will need it later on to send messages across the conversation. In fact, the statement will error if you don’t supply a variable to capture the handle.

Second, we need to supply both FROM and TO services. These tell the conversation which service is the source and which is the target. Remember, each service is attached to a queue, and can have one or more contracts attached to it. The source service is a database object, but the target service is an NVARCHAR. This allows the target service to live outside the database, which is something that I will cover at some point in the Service Broker 201 series.

This is a nice explanation of the process, so if you aren’t particularly familiar with Service Broker, check out Chris’s series, starting with the first and second posts.

Comments closed

Cassandra Monitoring and Data Modeling

Instaclustr has put up a couple interesting posts on Cassandra. First, Anup Shirolkar explains how we can monitor Cassandra installations:

Cassandra is developed in Java and is a JVM based system. Each Cassandra node runs a single Cassandra process. JVM based systems are enabled with JMX (Java Management Extensions) for monitoring and management. Cassandra exposes various metrics using MBeans which can be accessed through JMX. Cassandra monitoring tools are configured to scrape the metrics through JMX and then filter, aggregate, and render the metrics in the desired format. There are a few performance limitations in the JMX monitoring method, which are referred to later. 

The metrics management in Cassandra is performed using Dropwizard library. The metrics are collected per node in Cassandra. However, those can be aggregated by the monitoring system. 

On the development side, the Instaclustr team walks us through data modeling guidelines:

The ultimate goal of Cassandra data modeling and analysis is to develop a complete, well organized, and high performance Cassandra cluster. Following the five Cassandra data modeling best practices outlined will hopefully help you meet that goal:

1. Cassandra is not a relational database, don’t try to model it like one
2. Design your model to meet 3 fundamental goals for data distribution
3. Understand the importance of the Primary Key in the overall data structure 
4. Model around your queries but don’t forget about your data
5. Follow a six step structured approach to building your model. 

Because Cassandra uses a variant of SQL, it’s easy to forget that data is stored completely differently and that design decisions are quite different from what we see in the relational world.

Comments closed

Generating SQL Server Data Tools Solutions from Templates

Sander Stad walks us through creating a template for building SSDT solutions:

Yes, templates. But how are we’re going to create a template for an SSDT solution in such a way that it can be reused?

That’s where the PowerShell module called “PSModuleDevelopment” comes in. PSModuleDevelopment is part of the PSFramework PowerShell module.

The PSModuleDevelopment module enables you to create templates for files but also entire directories. Using placeholders you can replace values in the template making is possible to have the name and other variables set.

This is where the SSDT template comes in. I have created a template for SSDT that containes two projects. One project is meant for the data model and the other project is meant for the unit tests.

Read the whole thing and check out Sander’s GitHub repo.

Comments closed