Press "Enter" to skip to content

Day: November 18, 2021

Benchmarking Databricks vs Snowflake

Mostafa Mokhtar, et al, respond to some benchmarking claims:

On Nov 2, 2021, we announced that we set the official world record for the fastest data warehouse with our Databricks SQL lakehouse platform. These results were audited and reported by the official Transaction Processing Performance Council (TPC) in a 37-page document available online at tpc.org. We also shared a third-party benchmark by the Barcelona Supercomputing Center (BSC) outlining that Databricks SQL is significantly faster and more cost effective than Snowflake.

A lot has happened since then: many congratulations, some questions, and some sour grapes. We take this opportunity to reiterate that we stand by our blog post and the results: Databricks SQL provides superior performance and price performance over Snowflake, even on data warehousing workloads (TPC-DS).

Posts like this are exactly why getting rid of the DeWitt clause is important. I’d rather have Snowflake and Databricks duking it out with publicly-available and testable processes. When reading this, the most important part of this post was the several exhortations to try it out yourself, both for the Databricks test and the Snowflake test. Make benchmarking public, including hardware choices, configuration choices, and the testing process; then, I can tell for sure if your benchmark makes sense for my use case.

1 Comment

Aggregate Window Functions

I have a series on window functions:

Here, we get the sum of LineProfit by CustomerID. Because SUM() is an aggregate function, we need a GROUP BY clause for all non-aggregated columns. This is an aggregate function. The full set of them in T-SQL is available here, but you’ll probably be most familiar with MIN()MAX()SUM()AVG(), and COUNT().

To turn this into a window function, we slap on an OVER() and boom! Note: “boom!” only works on SQL Server 2012 and later, so if you’re still on 2008 R2, it’s more of a fizzle than a boom.

Read on for several examples of this nature.

Comments closed

API Servers and the Importance of Learning

Steve Jones tells a story:

While talking with a client recently about their performance challenges, I was relieved to find that the database wasn’t the problem. Instead, their API server was overloaded by the number of calls taking place in their application. While the database did provide the backing for the API calls, there was a fair amount of caching. However, as they’d moved to microservices, more and more of the interaction between modules was taking place as a network call to a single server, which became overloaded.

Steve goes on to the broader point of people freely donating their time and expertise to explain how to solve problems. And the above is a major problem of moving to microservices: everything gets several times chattier. The biggest tricks I have there are to embrace asynchronous processing via queues and ensure that messages passed back and forth are as small as possible, which means getting rid of the idea of passing big lists of fully-hyrdated objects around.

Comments closed

Get Power BI Data from Google Sheets

Reza Rad shows off a new connector:

Power BI can get data from Google Sheet now. This functionality is released just yesterday and announced in both Power BI and Power Query blogs. The feature is still preview (Beta) but it is worthwhile looking at how it works in a quick article and video.

There are several steps involved but it’s still a lot simpler than the old method of parsing a website, especially if you had any sort of security on the spreadsheets.

Comments closed

LOBs and OPTION(RECOMPILE)

Paul White has a warning for us:

All that is fairly well-known. The point of this short post is to draw your attention to another side-effect of adding OPTION (RECOMPILE) — the parameter embedding optimization (PEO).

When PEO is used, SQL Server takes the value of any variables and parameters and embeds the runtime values in the query text, pretty much as if you had entered them by hand before compiling. This is often very useful for plan quality, but there is a potential drawback when large object types (LOBs) are in play.

Click through for the explanation and a simple demo.

Comments closed

Signs It’s Time to Move to Enterprise Edition

Everywhere are signs, says Erik Darling:

SQL Server Standard Edition hobbles batch mode pretty badly. DOP is limited to two, and there’s no SIMD support. It’s totally possible to have batch mode queries running slower than row mode queries, because the row mode queries can use much higher DOPs and spread the row workload out.

I’d almost rather use indexed views in Standard Edition for large aggregations, because there are no Edition-locked enhancements. You’ll probably wanna use the NOEXPAND hint either way.

Click through for several factors which may cause you to want Enterprise Edition over Standard Edition. Similarly, if none of those apply to you, Standard Edition could work well for you.

Comments closed