Press "Enter" to skip to content

Author: Kevin Feasel

Scaling Out vs Scaling Up

Jordan Braiuka compares two models for scaling:

We often get questions from customers about the best way to add capacity to their cluster. Is it better to add nodes, or simply to increase the capacity in their nodes? Unfortunately, the truth is there is no best way—like all complex issues in distributed systems, there are benefits and drawbacks to each scaling approach. 

While each of our highly distributed systems (Apache CassandraApache Kafka, etc.) have slightly different implementations of scaling, the concepts remain consistent across most distributed systems. 

Click through for a comparison between the two approaches. As the article indicates, both are meaningful strategies and your choice might come down to a combination of the technology stack and the problem at hand.

Comments closed

Star Schemas versus Header-Detail Tables in Power BI

Marco Russo and Alberto Ferrari lay out another proof that the star schema is the right schema for Power BI:

We already shown in a previous article (Power BI – Star schema or single table – SQLBI) how the star schema proves to be the best option when compared with a single table model. Single-table models are the evil: do not be tempted by them, choose a star schema.

In this article, I want to show you an example in the opposite direction. A single table model denormalizes everything in one table, and we already learned that it is bad. But what if we keep a more normalized structure, as it often happens in header/detail models (like orders and order lines)? Is a header/detail model better than a star schema? The quick answer is: “No. Nope. No way. Not at all. Are you kidding me? No.”. Nonetheless, this might be just our personal opinion. The goal of the article is to provide you with some numbers and considerations to prove the previous statement.

Read on and you make the call.

Comments closed

Testing sp_ineachdb

Aaron Bertrand takes us to the Island of Misfit Databases:

The only database that requires extra handling is the one that contains a tab, because SQL Server doesn’t know how to generate file names when that character is present. I am sure there are a bunch of other less common but equally exotic characters that may cause the same problem.

This is how I actually tested sp_ineachdb, to make sure it was ready for just about any bad idea anyone used to name a database, and could handle various possible database states (for a lot more background on this procedure, and why it is better than the undocumented, unsupported, and buggy sp_msforeachdb, see this and this). Here you can see that the procedure works against all these poorly-named databases, and skips databases that are inaccessible (rather than raise an exception).

Click through to see the list of databases Aaron uses. Technically, I think Aaron’s blog post also counts as a Halloween post because some of those databases are spooky.

Comments closed

Searching T-SQL Objects

Rob Farley has a quick script to find references in SQL Server:

As a consultant, the kind of work that I do from customer to customer can change a bit. Sometimes I’m reviewing people’s environments; sometimes I’m performance tuning; sometimes I’m developing code or reports or cubes; sometimes I’m writing T-SQL, but it’s often DAX or PowerShell.

Click through for a quick script to search modules for a particular string.

Comments closed

Databricks Integration with Git Repos

Ka-Hing Chueng and Vaibhav Sethi announce Databricks Repos is now generally available:

Thousands of Databricks customers have adopted Databricks Repos since its public preview and have standardized on it for their development and production workflows. Today, we are happy to announce that Databricks Repos is now generally available.

Databricks Repos was created to solve a persistent problem for data teams: most tools used by data engineering/machine learning practitioners offer poor or no integration with Git version control systems, forcing them to navigate through multiple files, steps and UIs to simply review and commit code. Not only is this time-consuming, but it’s also error-prone.

This has been a bit of a pain point with Databricks in the past, and they’ve come up with this solution. Given that Azure Synapse Analytics has some of the same pain points, I’d expect we’ll see something similar in time.

Comments closed

Power BI Storage Modes and Aggregations

Phil Seamark dives into storage modes in Power BI:

How to choose the correct storage mode for Power BI Tables.

This article aims to help explain the different storage modes available when designing an aggregation strategy for a Power BI Report. What each storage mode is and when you would use it. Picking the correct storage mode for each table in your model can significantly affect overall performance.

Click through for the tl;dr version, but stay for the whole thing.

Comments closed

Trying a Read-Only API

Mark Litwintschik reviews ROAPI:

ROAPI is an API Server that exposes CSV, JSON and Parquet files without the need to write any code. The project was started by Qingping Hou around this time last year. Qingping had spent the better part of four years working at LinkedIn prior to joining Scribd as a Senior Engineer. He is also a committer to both the Apache Airflow and Arrow projects.

ROAPI is made up of 4K lines of Rust. This line count is low due to the intense use of 3rd party libraries. These include Apache Arrow for, among other things, Parquet support, Arrow’s DataFusion Project, which provides SQL and query execution support, Actix, which provides the HTTP interface and Rusoto, the AWS SDK for Rust.

Click through to see how to set it up and how to use it.

Comments closed

Optimizing for Mediocre

Erik Darling points out an issue with some approaches to preventing parameter sniffing problems in queries:

Despite the many metric tons of blog posts warning people about this stuff, I still see many local variables and optimize for unknown hints. As a solution to parameter sniffing, it’s probably the best choice 1/1000th of the time. I still end up having to fix the other 999/1000 times, though.

In this post, I want to show you how using either optimize for unknown or local variables makes my job — and the job of anyone trying to fix this stuff — harder than it should be.

Click through for two methods, both of which end up being the wrong answer.

Comments closed