2021-10-11 – Curated SQL

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.
Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed

Trying a Read-Only API

Published 2021-10-11 by Kevin Feasel

Mark Litwintschik reviews ROAPI:

ROAPI is an API Server that exposes CSV, JSON and Parquet files without the need to write any code. The project was started by Qingping Hou around this time last year. Qingping had spent the better part of four years working at LinkedIn prior to joining Scribd as a Senior Engineer. He is also a committer to both the Apache Airflow and Arrow projects.
ROAPI is made up of 4K lines of Rust. This line count is low due to the intense use of 3rd party libraries. These include Apache Arrow for, among other things, Parquet support, Arrow’s DataFusion Project, which provides SQL and query execution support, Actix, which provides the HTTP interface and Rusoto, the AWS SDK for Rust.

Click through to see how to set it up and how to use it.

Comments closed

Power BI Storage Modes and Aggregations

Published 2021-10-11 by Kevin Feasel

Phil Seamark dives into storage modes in Power BI:

How to choose the correct storage mode for Power BI Tables.
This article aims to help explain the different storage modes available when designing an aggregation strategy for a Power BI Report. What each storage mode is and when you would use it. Picking the correct storage mode for each table in your model can significantly affect overall performance.

Click through for the tl;dr version, but stay for the whole thing.

Comments closed

Optimizing for Mediocre

Published 2021-10-11 by Kevin Feasel

Erik Darling points out an issue with some approaches to preventing parameter sniffing problems in queries:

Despite the many metric tons of blog posts warning people about this stuff, I still see many local variables and optimize for unknown hints. As a solution to parameter sniffing, it’s probably the best choice 1/1000th of the time. I still end up having to fix the other 999/1000 times, though.
In this post, I want to show you how using either optimize for unknown or local variables makes my job — and the job of anyone trying to fix this stuff — harder than it should be.

Click through for two methods, both of which end up being the wrong answer.

Comments closed

Multi-Value Parameters with Power Query Online

Published 2021-10-11 by Kevin Feasel

Chris Webb shows off multi-value parameters in Power Query:

Why is this interesting? In the past, Power Query parameters were always single values like a date or a string; now a parameter can contain mutliple values.
There’s one other new feature in Power Query Online that goes along with this: In and Not In filters, which can use these new List parameters.

Click through for some examples.

Comments closed

Value Comparisons with Nullable Columns

Published 2021-10-11 by Kevin Feasel

Chad Baldwin wants to check if rows exist before inserting:

I haven’t posted in a while, so I thought I would throw a quick one together to hopefully restart the habit of writing and posting on a regular basis.
One of my first blog posts covered how to only update rows that changed. In that post, I described a popular method that uses EXISTS and EXCEPT to find rows that had changed while also implicitly handling NULL values.

Click through for two types of technique, one for non-nullable data and one which can include NULL.

Comments closed

Restoring a Database to Azure SQL Managed Instance

Published 2021-10-11 by Kevin Feasel

Arun Sirpal wants to restore a database:

Now that we have a Managed Instance built, the next question is how do we get data across? I will break this up into separate posts but the lesson for this blog post is ANALYSIS FIRST!

Click through for the analysis.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Day: October 11, 2021

Databricks Integration with Git Repos

Trying a Read-Only API

Power BI Storage Modes and Aggregations

Optimizing for Mediocre

Multi-Value Parameters with Power Query Online

Value Comparisons with Nullable Columns

Restoring a Database to Azure SQL Managed Instance