Press "Enter" to skip to content

Curated SQL Posts

Living in the Lakehouse

James Serra defines the term “data lakehouse”:

As a follow-up to my blog Data Lakehouse & Synapse, I wanted to talk about the various definitions I am seeing about what a data lakehouse is, including a recent paper by Databricks.

Databricks uses the term “Lakehouse” in their paper (see Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics), which argues that the data warehouse architecture as we know it today will wither in the coming years and be replaced by a new architectural pattern, the Lakehouse. Instead of the two-tier data lake + relational data warehouse model, you will just need a data lake, which is made possible by implementing data warehousing functionality over open data lake file formats.

While I agree there may be some uses cases where technical designs may allow Lakehouse systems to completely replace relational data warehouses, I believe those use cases are much more limited than this paper suggests.

James is a sharp and perceptive fellow, so read the whole thing.

Comments closed

Power BI: New Features for Data Analysts

Tomaz Kastrun looks at some new functionality in Power BI which might interest data analysts:

Small multiples is a layout of small charts over a grouping variable, aligned side-by-side, sharing common scale, that is scaled to fit all the values (by grouping or categorical variable) on multiple smaller graphs. Analyst should immediately see and tell the difference between the grouping variable (e.g.: city, color, type,…) give a visualized data.

In Python, we know this as trellis plot or FacetGrid (seaborn) or simply subplots (Matplotlib).

In R, this is usually referred to as facets (ggplot2).

Read on for an example of this, as well as two other features, as well as how you might have worked with these ideas in Python and R.

Comments closed

Estimating Row Counts without Statistics

Matthew McGiffen dives into rules of thumb:

I find this is a question that comes up again and agan. What estimate for the number of rows returned does SQL Server use if you’re selecting from a column where there are no statistics available?

There are a few different algorithms used depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.

Read on for a few examples, noting that this specifically relates to tables and not things like table-valued parameters.

Comments closed

The Pain of SELECT *

Grant Fritchey strongly recommends against SELECT *:

Quite a few years ago, I wrote a post about SELECT * and performance. That post had a bit of a click-bait title (freely admitted). I wrote the post because there was a really bad checklist of performance tips making the rounds (pretty sure it’s still making the rounds). The checklist recommended a whole bunch of silly stuff. One silly thing it recommended was to simply substitute ALL columns (let me emphasize that again, name each and every column) instead of SELECT * because “it was faster”.

My post, linked above, showed that this statement was nonsense. Let’s be clear, I’m not a fan of SELECT *. Yes, it has some legitimate functionality. However, by and large, using SELECT * causes performance problems.

I’ll use SELECT * for one-off queries (well, something like SELECT TOP(100) * but same difference), but it’s a really bad practice to include that in application code for the reasons Grant mentions.

Comments closed

Change Tracking Runthrough

Erik Darling provides a runthrough (which is a walkthrough but at a faster pace) of change tracking in SQL Server:

I’ve been working with CDC and CT way too much, and even I’m annoyed with how much it’s coming out in blog posts.

I’m going to cover a lot of ground quickly here. If you get lost, or there’s something you don’t understand, your best bet is to reference the documentation to get caught up.

Check it out.

Comments closed

Using OAuth 2 in R Packages

Maelle Salmon explains how OAuth 2 works and also how you can use it in R packages:

When writing an R package wrapping an API using OAuth 2.0 you’ll need the user to grant access to an “app”, which will allow to create an access token and a refresh token. The access token will then often be passed to the API in a header when making requests, whilst the refresh token would be posted in a query string when the access token needs to be renewed.

Your problem is: how do I imitate a third-party app? Thankfully for you, in most cases the complexity can be handled by the httr package. For other cases, or if you want to e.g. only use curl, you will have to get creative. 

Read on for more detail.

Comments closed

SSAS and Database Loading

Nigel Foulkes-Nock explains why SSAS might not be available even if the service is running:

When starting SQL Server Analysis Services (SSAS) Tabular, the Service is quick to report that it has started. In my opinion, this Status is not entirely accurate – SSAS may be running but you cannot access data until it has loaded all associated SSAS Databases into memory and performed its consistency checks. This can take a long time.

After starting SSAS, if you try to browse the Databases using SQL Server Management Studio (SSMS) then SSMS becomes unresponsive. You will receive errors if you try to query a SSAS Database. It’s busy but it doesn’t report as such and doesn’t give any clue of how long it’ll take.

Read on for the explanation.

Comments closed

Hiding Excel using Powershell

Mikey Bronowski shows how you can hide an Excel worksheet, as well as specific rows and columns, using Powershell:

This is part of the How to Excel with PowerShell series. Links to all the tips can be found in this post.
If you would like to learn more about the module with an interactive notebook, check this post out.

MS Excel offers many different functionalities and one of them is making things to disappear like hiding worksheets or columns and rows, even cells.

Read on to see how.

Comments closed