Press "Enter" to skip to content

Day: October 11, 2021

Databricks Integration with Git Repos

Ka-Hing Chueng and Vaibhav Sethi announce Databricks Repos is now generally available:

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.

Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed

Trying a Read-Only API

Mark Litwintschik reviews ROAPI:

ROAPI is an API Server that exposes CSV, JSON and Parquet files without the need to write any code. The project was started by Qingping Hou around this time last year. Qingping had spent the better part of four years working at LinkedIn prior to joining Scribd as a Senior Engineer. He is also a committer to both the Apache Airflow and Arrow projects.

ROAPI is made up of 4K lines of Rust. This line count is low due to the intense use of 3rd party libraries. These include Apache Arrow for, among other things, Parquet support, Arrow’s DataFusion Project, which provides SQL and query execution support, Actix, which provides the HTTP interface and Rusoto, the AWS SDK for Rust.

Click through to see how to set it up and how to use it.

Comments closed

Power BI Storage Modes and Aggregations

Phil Seamark dives into storage modes in Power BI:

How to choose the correct storage mode for Power BI Tables.

This article aims to help explain the different storage modes available when designing an aggregation strategy for a Power BI Report. What each storage mode is and when you would use it. Picking the correct storage mode for each table in your model can significantly affect overall performance.

Click through for the tl;dr version, but stay for the whole thing.

Comments closed

Optimizing for Mediocre

Erik Darling points out an issue with some approaches to preventing parameter sniffing problems in queries:

Despite the many metric tons of blog posts warning people about this stuff, I still see many local variables and optimize for unknown hints. As a solution to parameter sniffing, it’s probably the best choice 1/1000th of the time. I still end up having to fix the other 999/1000 times, though.

In this post, I want to show you how using either optimize for unknown or local variables makes my job — and the job of anyone trying to fix this stuff — harder than it should be.

Click through for two methods, both of which end up being the wrong answer.

Comments closed

Value Comparisons with Nullable Columns

Chad Baldwin wants to check if rows exist before inserting:

I haven’t posted in a while, so I thought I would throw a quick one together to hopefully restart the habit of writing and posting on a regular basis.

One of my first blog posts covered how to only update rows that changed. In that post, I described a popular method that uses EXISTS and EXCEPT to find rows that had changed while also implicitly handling NULL values.

Click through for two types of technique, one for non-nullable data and one which can include NULL.

Comments closed