Month: December 2019

Recently I attended an excellent session by Andy Kirk on the state of data visualization for 2019. One of the tools Andy is most excited about is called Charticulator. For those of you that haven’t heard about it, Charticulator is an open source project from Microsoft Research. Using a web UI you can design almost any charts by interactively specifying constraints (NO code).
The best news is that you can use these visuals directly in Power BI. You can even use your own data coming from the PBI data model. This came as news for many at Andy Kirk’s session so decided to do a quick post on it. Building very custom charts entices a lot of data artists but it is often hard to bring this to main stream BI product. The ease of the Charticulator and Power BI integration brings this type of data visualizations to a much broader audience.

This is a lot easier than telling people to learn D3, though Charticulator will necessarily have more limitations than writing all of the code yourself.

Comments closed

Calculating Last Year to Date with DAX

Published 2019-12-06 by Kevin Feasel

Reza Rad wants to compare prior years to the current year, using year-to-date comparisons:

I have previously written about how to calculate the same period last year calculation and compare this year’s values with the last year’s values. However, sometimes, you don’t yet have the full year, especially for the current year. You might want to compare this year’s value with the last year’s value up until the same day but last year. This is what I call same period last year to date. Here in this blog article, I’ll explain how you can do that using DAX in Power BI. To learn more about Power BI, read Power BI from Rookie to Rock Star.

Click through to see how it’s done.

Comments closed

Azure Data Factory Pipelines

Published 2019-12-06 by Kevin Feasel

Cathrine Wilhelmsen continues a series on Azure Data Factory with a discussion of pipelines:

Pipelines are sorted by name, so I recommend that you decide on a naming convention early in your project. And yeah, I keep saying this to everyone else, but then I can never decide on how to name my own pipelines, haha 🙂 Don’t worry if you end up renaming your pipelines several times while you work on your project. It happens, and that’s completely fine, but try to stick to some kind of naming convention throughout your project.
In addition to naming conventions, you can create folders to organize your pipelines. Click the actions ellipsis next to the pipelines group, then click new folder.

Read on for more.

Comments closed

The A* Search Algorithm

Published 2019-12-05 by Kevin Feasel

Akash Kumar takes us through the A* algorithm and how it works for search:

Path Finding has been one of the oldest and most popular applications in computer programming. You could virtually find the most optimal path from a source to a destination by adding costs which would represent time, money etc. A* is one of the most popular algorithms for all the right reasons. In this article, let’s find out just why.

Click through for an explanation of what the algorithm does and pseudocode to implement it.

Comments closed

JupyterLab Integration for Databricks

Published 2019-12-05 by Kevin Feasel

Bernhard Walter announces an integration between JupyterLab and Databricks:

This blog post starts with a quick overview how using a remote Databricks cluster from your local JupyterLab would look like. It then provides an end to end example of working with JupyterLab Integration followed by explaining the differences to Databricks Connect. If you want to try it yourself, the last section explains the installation.

I like this a lot, as it fights back a bit against the balkanization of data science: it means I don’t need to keep one set of notebooks here and another set of notebooks there and a third set of notebooks somewhere else.

Comments closed

Fastest Way to Delete Lots of Rows in SQL Server

Published 2019-12-05 by Kevin Feasel

Bertrand tries out a few methods to delete data and what SQL Server configuration settings do to this calculus:

That took far longer than I’m comfortable admitting. Part of that was because I had originally included a 0.1% test for rowperloop which, in some cases, took several hours. So I removed those from the table a few days in, and can easily say: if you are removing 1,000,000 rows, deleting 1,000 rows at a time is highly unlikely to be an optimal choice, regardless of any other variables

I think Aaron lays out the caveats pretty well, but I’d reiterate that the main benefit behind chunking delete operations is not so much to make things faster, but to reduce the amount of time you spend blocking more important work, like user queries. And reducing the risk of blowing out the transaction log file (and maybe running out of disk space too).

1 Comment

Azure Data Factory Components and Copy Data Wizard

Published 2019-12-05 by Kevin Feasel

Cathrine Wilhelmsen continues a series on Azure Data Factory. First, we get an overview of the available components:

Pipelines are the things you execute or run in Azure Data Factory, similar to packages in SQL Server Integration Services (SSIS). This is where you define your workflow: what you want to do and in which order. For example, a pipeline can first copy data from an on-premises data center to Azure Data Lake Storage, and then transform the data from Azure Data Lake Storage into Azure Synapse Analytics (previously Azure SQL Data Warehouse).

Then, Cathrine looks at the Copy Data wizard:

LEGO! Yay! I love LEGO. Rebrickable is an online service that will show you which LEGO sets you can build from the sets and parts you already own. Fun! 🙂
They also have a database of all official LEGO sets and parts (including themes and colors) that you can download for free as CSV files or JSON files.
The CSV files are automatically generated at the start of each month and can be found on rebrickable.com/downloads.

Cathrine takes this LEGO data and feeds it into Azure Data Lake Storage.

Comments closed

An Ode to the SQLCallstackResolver

Published 2019-12-05 by Kevin Feasel

Jonathan Kehayias really likes the SQLCallstackResolver project:

Lately, I have been doing a lot of work troubleshooting certain behaviors in SQL Server for workloads that are, to put it simply, designed horribly. As a part of this, I have found it necessary to collect callstacks with Extended Events and to materialize them using the debugger symbols for SQL Server to see where exactly certain types of issues are being encountered to better understand some of the internals of newer features in SQL Server 2017 and 2019. Years ago I blogged about how to use the package0.callstack action in Extended Events for this type of thing, and Paul also has a blog post that talks about how to download the PDB symbols for SQL Server as well as a post that also demonstrates using the package0.callstack action to determine what causes a particular wait type. Using the debugging tools to get the symbols is somewhat clunky and tedious, so when I happened on this amazingly simple method of getting symbol files I had to share it.
The SQLCallstackResolver on Github has to be one of the greatest things since sliced bread if you want to materialize callstacks from SQL Server.

Arvind Shyamsundar deserves a lot of credit for putting it together; he did a great job with the project.

Comments closed

Power BI Aggregation Precedence

Published 2019-12-05 by Kevin Feasel

Shabnam Watson asks and answers a question of importance:

Precedence is one of the aggregation properties that you can define on an aggregation table in Power BI. In a model with multiple aggregation tables, you can use Precedence to define the order in which aggregation tables will be considered by Power BI to answer queries. The higher the Precedence number, the sooner the aggregation table will be considered. Very simple and easy to understand. but the question is:
What does Power BI do when there are multiple aggregation tables configured with the same Precedence value and they can all answer the same query? Which one is considered first? Does it choose the smallest one? or is there another rule in place?

Click through to find out.

Comments closed

Persistent Memory for SQL Server on Linux

Published 2019-12-05 by Kevin Feasel

The SQL Server team shows how you can configure persistent memory for SQL Server on Linux:

With the release of SQL Server 2019 on Linux, Microsoft introduced persistent memory (PMEM) support on Linux. This is an exciting development, as previous versions of SQL Server on Linux didn’t support PMEM. Let’s look at how to configure the PMEM for SQL Server on Linux.
SQL Server 2016 introduced support for non-volatile DIMMs and an optimization called Tail of the Log Caching on NVDIMM. These leveraged Windows Server direct access to a persistent memory device in DAX mode to reduce the number of operations needed to harden a log buffer to persistent storage.
SQL Server 2019 extends the support for PMEM devices to Linux, providing full enlightenment of data and transaction logs placed on PMEM. Enlightenment is a way to access the storage device using efficient user-space memcpy() operations. Rather than going through the file system and storage stack, SQL Server leverages DAX support on Linux to place data directly into the device. This helps to reduce latency.

Click through for the configuration steps.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31