Press "Enter" to skip to content

Month: November 2021

Re-Making a Line Chart

Alex Velez cleans up a line chart:

When sharing a makeover, typically I’ll show a side-by-side “before” and “after” view. This is a powerful moment for many audiences as they witness the dramatic impact of an effective graph. I share this with you, because until recently if a combination chart was found within my makeovers it represented the “before” state. That’s because most combo charts are hard to read, so I tend to revise them into something simpler—like this makeover.

Today’s article shows the inverse of that process, where, in order to make a visual more informative and easier to understand, I chose to transform the original, a “simple” line chart, into a more “complicated” combination chart. 

This is a good reminder that visuals themselves aren’t necessarily bad (except for the pie chart, which is inherently evil and don’t try to convince me otherwise); it’s all about whether the specific chart makes sense given the story you are trying to tell.

Comments closed

ETL from an API

Bill Fellows unravels a bad practice:

The direction for software as a service providers is to provide APIs to access their data instead of structured file exports. Which is a pity, as every SaaS system requires a bespoke data extract solution. I inheireted a solution that had an adverse pattern I’d like to talk about.

The next level of complexity, at least in the space Bill covers in the example, is what to do when the upstream provider changes their data, some with changes even a week later.

Comments closed

Non-Yielding IO Completion Ports

Sean Gallardy is here to demystify a concept:

IO Completion Ports are a set of Windows APIs which allow for efficient, fast, multithreaded asynchronous IO. Great, that pretty much tells you nothing.

SQL Server uses IO Completion Ports not for disk-based IO but for general network IO when it comes into SQL Server for TDS level items. This means it’s used for things such as connecting to an instance of SQL Server, sending batch and rpc information, etc., and is used to properly take actions on the incoming items. These actions should be extremely short and quick, the name of the game is low latency and high throughput which means not doing things like reading or writing from disk, allocating memory, calling functions that may block, etc., to keep things flowing.

Read on to see what happens when there is a problem and what might cause that problem.

Comments closed

T-SQL Tuesday 144 Roundup

Victoria Holt recaps T-SQL Tuesday #144:

This month’s T-SQL Tuesday attracted some great responses! Thank you to everyone who participated!

My invitation for this month’s #tsql2sday was 3 fold on sharing your experiences on data governance

– The current cost of data governance versus its benefits

– The amazing things data governance has enabled you to achieve or will enable you to achieve in the future

– The potential uses for Azure Purview within your estates and the automated deployment options for that

Read on for the recap.

Comments closed

Benchmarking Databricks vs Snowflake

Mostafa Mokhtar, et al, respond to some benchmarking claims:

On Nov 2, 2021, we announced that we set the official world record for the fastest data warehouse with our Databricks SQL lakehouse platform. These results were audited and reported by the official Transaction Processing Performance Council (TPC) in a 37-page document available online at tpc.org. We also shared a third-party benchmark by the Barcelona Supercomputing Center (BSC) outlining that Databricks SQL is significantly faster and more cost effective than Snowflake.

A lot has happened since then: many congratulations, some questions, and some sour grapes. We take this opportunity to reiterate that we stand by our blog post and the results: Databricks SQL provides superior performance and price performance over Snowflake, even on data warehousing workloads (TPC-DS).

Posts like this are exactly why getting rid of the DeWitt clause is important. I’d rather have Snowflake and Databricks duking it out with publicly-available and testable processes. When reading this, the most important part of this post was the several exhortations to try it out yourself, both for the Databricks test and the Snowflake test. Make benchmarking public, including hardware choices, configuration choices, and the testing process; then, I can tell for sure if your benchmark makes sense for my use case.

1 Comment

Aggregate Window Functions

I have a series on window functions:

Here, we get the sum of LineProfit by CustomerID. Because SUM() is an aggregate function, we need a GROUP BY clause for all non-aggregated columns. This is an aggregate function. The full set of them in T-SQL is available here, but you’ll probably be most familiar with MIN()MAX()SUM()AVG(), and COUNT().

To turn this into a window function, we slap on an OVER() and boom! Note: “boom!” only works on SQL Server 2012 and later, so if you’re still on 2008 R2, it’s more of a fizzle than a boom.

Read on for several examples of this nature.

Comments closed

API Servers and the Importance of Learning

Steve Jones tells a story:

While talking with a client recently about their performance challenges, I was relieved to find that the database wasn’t the problem. Instead, their API server was overloaded by the number of calls taking place in their application. While the database did provide the backing for the API calls, there was a fair amount of caching. However, as they’d moved to microservices, more and more of the interaction between modules was taking place as a network call to a single server, which became overloaded.

Steve goes on to the broader point of people freely donating their time and expertise to explain how to solve problems. And the above is a major problem of moving to microservices: everything gets several times chattier. The biggest tricks I have there are to embrace asynchronous processing via queues and ensure that messages passed back and forth are as small as possible, which means getting rid of the idea of passing big lists of fully-hyrdated objects around.

Comments closed

Get Power BI Data from Google Sheets

Reza Rad shows off a new connector:

Power BI can get data from Google Sheet now. This functionality is released just yesterday and announced in both Power BI and Power Query blogs. The feature is still preview (Beta) but it is worthwhile looking at how it works in a quick article and video.

There are several steps involved but it’s still a lot simpler than the old method of parsing a website, especially if you had any sort of security on the spreadsheets.

Comments closed

LOBs and OPTION(RECOMPILE)

Paul White has a warning for us:

All that is fairly well-known. The point of this short post is to draw your attention to another side-effect of adding OPTION (RECOMPILE) — the parameter embedding optimization (PEO).

When PEO is used, SQL Server takes the value of any variables and parameters and embeds the runtime values in the query text, pretty much as if you had entered them by hand before compiling. This is often very useful for plan quality, but there is a potential drawback when large object types (LOBs) are in play.

Click through for the explanation and a simple demo.

Comments closed

Signs It’s Time to Move to Enterprise Edition

Everywhere are signs, says Erik Darling:

SQL Server Standard Edition hobbles batch mode pretty badly. DOP is limited to two, and there’s no SIMD support. It’s totally possible to have batch mode queries running slower than row mode queries, because the row mode queries can use much higher DOPs and spread the row workload out.

I’d almost rather use indexed views in Standard Edition for large aggregations, because there are no Edition-locked enhancements. You’ll probably wanna use the NOEXPAND hint either way.

Click through for several factors which may cause you to want Enterprise Edition over Standard Edition. Similarly, if none of those apply to you, Standard Edition could work well for you.

Comments closed