May 2022 – Page 14 – Curated SQL

Securing Azure Storage

Published 2022-05-03 by Kevin Feasel

Craig Porteous continues a series on Azure Data Platform security:

This is the third in a series where I look at all of the resources common to a Data Lakehouse platform architecture and what you need to think about to get it past your security team.
Building upon Azure Databricks, I’ll move from the compute engine to our blob and data lake storage. Things are a little simpler to secure but the plethora of options available can have significant impacts on usability and cost so it’s important to understand the impact before baking them into your design.

Read on for some good advice around securing Azure storage accounts.

Comments closed

Multiple Aggregations with CASE

Published 2022-05-03 by Kevin Feasel

Chad Callihan shows off a good use of aggregate functions and the CASE statement:

Have you have been tasked with pulling multiple counts from the same table? Maybe you need to find how many records have a value for a column and how many are NULL. Or maybe you need to see how many records are true and how many are false.
It’s simple enough to run a query to count one set of criteria, run another query for the second set of criteria, and combine them when sending your results. Did you know you can get multiple counts with one query?
Here’s an example of how using COUNT and CASE can speed up your day.

Click through for an example. I mildly disagree with Chad’s conclusion that this is something you’ll rarely do—the more you work with reporting and analytical queries, the more you’ll appreciate this.

Comments closed

Power Query XML Inconsistencies

Published 2022-05-03 by Kevin Feasel

Chris Webb walks us through some inconsistencies:

A few months ago one of my colleagues at Microsoft, David Browne, showed me an interesting Power Query problem with how the Xml.Tables and Xml.Document M functions handle null or missing values. I’m posting the details here because the problem seems fairly common, it causes a lot of confusion and it’s not easy to deal with.
In XML there are two ways to represent a null or missing value:<a></a> or omitting the element completely. Unfortunately the Xml.Tables and Xml.Document M functions handle these inconsistently: they treat the <a></a> form as a table but the other as a scalar.

Click through for an example and a workaround for the issue.

Comments closed

Comparing R Package Versions with Diffify

Published 2022-05-02 by Kevin Feasel

Clarissa Barratt and Parisa Gregg announce an interesting tool:

You know that sinking feeling that you get when you’re months into a big project and you log in one day and nothing works? Turns out something has updated and things have been removed that you needed and now you need to spend hours-days figuring out what’s changed and your masters deadline is getting closer and … ok, apparently this took me back to a very specific event.
But I’m sure most of that sounds familiar to you if you’ve ever programmed something over a longer period of time.
Over the last few months, Jumping Rivers have been working on a tool that will make it easier to see differences between R package versions: Diffify.

Read on to see it in action. It looks quite useful for troubleshooting issues in which a package suddenly changed API functionality, something which tends to happen frequently in the R and Python worlds.

Comments closed

Playing with gganimate

Published 2022-05-02 by Kevin Feasel

Tomaz Kastrun tries out gganimate:

I firmly believe that animation and transition between different data states can give end-users much better insights and understanding of the data, than a single table with data points or correlation metrics.
With help of ggplot, gganimate, you can quickly create an animation based on your needs. This is a simple IRIS dataset example.

You can find more at the gganimate website. The real downside is that I don’t think it’s being maintained any longer, as the last commit was a year ago.

Comments closed

Slow File Open Times in Power BI

Published 2022-05-02 by Kevin Feasel

Marco Russo explains why opening some Power BI files might take so long:

There could be many reasons for that, but if you have calculated columns and/or calculated tables in your model, you should be aware that they could be the reasons why this happens. It could be, so I want to explain when this happens.
The short explanation is the following: when you open a PBIX file, Power BI Desktop automatically recalculates those calculated columns and calculated tables that depend on a volatile formula.

Read on for the longer explanation, which includes a (possibly incomplete) list of volatile formulas.

Comments closed

High Availability in SQL Managed Instance General Purpose Tier

Published 2022-05-02 by Kevin Feasel

Niko Neugebauer clears up what options you have for high availability in SQL MI’s General Purpose tier:

The two main requirements around high availability are commonly known as RTO and RPO.

RTO – stands for Recovery Time Objective and is the maximum allowable downtime when a failure occurs. In other words, how much time it takes for your databases to be up and running.

RPO – stands for Recovery Point Objective and is the maximum allowable data-loss when a failure occurs. Of course, the ideal scenario is not to lose any data, but a more realistic (and also ideal) scenario is to not lose any committed data, also known as Zero Committed Data Loss.

With those definitions out of the way, read on to learn more.

Comments closed

Sharing Individual Power BI Dataflows

Published 2022-05-02 by Kevin Feasel

Marc Lelijveld is in a sharing mood:

Recently, I have had a challenge at a customer, where a central teams maintains many dataflows in Power BI, to store their only and single version of the truth. However, this central team maintained many different dataflows in a single workspace, but did not want to share the entire workspace with others. What now? How can they share a single dataflows in Power BI?
In this blog, I will describe different ways to share dataflows in the Power BI service and highlight pros and cons of each solution. Read on to find out what options you have, and what my personal preference would be.

Read on to learn why you might want to share a dataflow, as well as four techniques to do it.

Comments closed

The Performance Cost of Subqueries in the SELECT Clause

Published 2022-05-02 by Kevin Feasel

Andrea Allred notes a potential performance issue:

Why is this query slow? If this query were to return 50 rows, it would run each query in the SELECT clause 50 times, and since there are two of them, that is 100 query runs. What if I returned 100,000 rows? That would be 200,000 query runs. How could I do this differently?

Read on for the answer.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Month: May 2022